I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...

54
I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    4

Transcript of I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...

Page 1: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

I256

Applied Natural Language Processing

Fall 2009

Lecture 6

• Introduction of Graphical Models

• Part of speech tagging

Barbara Rosario

Page 2: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Graphical Models

• Within the Machine Learning framework• Probability theory plus graph theory • Widely used

– NLP– Speech recognition– Error correcting codes– Systems diagnosis– Computer vision– Filtering (Kalman filters)– Bioinformatics

Page 3: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

(Quick intro to) Graphical Models

Nodes are random variables

B C

DA

P(A) P(D)

P(B|A)P(C|A,D)

Edges are annotated with conditional probabilities

Absence of an edge between nodes implies conditional independence

“Probabilistic database”

Page 4: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Graphical Models

A

B C

D

• Define a joint probability distribution:

• P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D)• Learning

– Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)

Page 5: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Graphical Models

• Define a joint probability distribution: • P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D)• Learning

– Given data, estimate P(A), P(B|A), P(D), P(C | A, D)

• Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D)

• Inference = Probabilistic queries• General inference algorithms (e.g.

Junction Tree)

A

B C

D

Page 6: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Naïve Bayes models

• Simple graphical model

• Xi depend on Y

• Naïve Bayes assumption: all xi are independent given Y

• Currently used for text classification and spam detection

x1 x2 x3

Y

Page 7: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Naïve Bayes models

Naïve Bayes for document classification

w1 w2 wn

topic

Inference task: P(topic | w1, w2 … wn)

Page 8: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Naïve Bayes for SWD

v1 v2 v3

sk

• Recall the general joint probability distribution:

P(X1, ..XN) = i P(Xi | Par(Xi) )

P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Page 9: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Naïve Bayes for SWD

v1 v2 v3

sk

P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Page 10: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Naïve Bayes for SWD

v1 v2 v3

sk

P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Inference (Testing): Compute conditional probabilities of interest:

P(sk| v1, v2, v3)

Page 11: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Graphical Models

• Given Graphical model– Do estimation (find parameters from data)– Do inference (compute conditional

probabilities)

• How do I choose the model structure (i.e. the edges)?

Page 12: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

How to choose the model structure?

v1 v2 v3

sk

v1 v2 v3

sk

v1 v2 v3

sk

v1 v2 v3

sk

Page 13: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Model structure

• Learn it: structure learning– Difficult & need a lot of data

• Knowledge of the domain and of the relationships between the variables– Heuristics– The fewer dependencies (edges) we can have, the

“better”• Sparsity: more edges, need more data• Next class…

– Direction of arrows v1 v2 v3

sk

P (v3 | sk, v1, v2)

Page 14: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Generative vs. discriminative

P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Estimation (Training): Given data, estimate: P(sk), P(v1| sk), P(v2| sk) and P(v3| sk )

Inference (Testing): Compute: P(sk| v1, v2, v3)(there are algorithms to find these cond. Pb, not covered here)

v1 v2 v3

sk

P(sk, v1..v3) = P(v1) P(v2) P(v3 ) P( sk | v1, v2 v3)

Conditional pb. of interest is “ready”: P(sk| v1, v2, v3) i.e. modeled directly

Estimation (Training): Given data, estimate: P(v1), P(v2), P(v3 ), and P( sk | v1, v2 v3)

Do inference to find Pb of interest Pb of interest is modeled directly

v1 v2 v3

sk

Generative Discriminative

Page 15: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Generative vs. discriminative

• Don’t worry…. You can use both models• If you are interested, let me know• But in short:

– If the Naive Bayes assumption made by the generative method is not met (conditional independencies not true), the discriminative method can have an edge

– But the generative model may converge faster– Generative learning can sometimes be more efficient

than discriminative learning; at least when the number of features is large compared to the number of samples

Page 16: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Graphical Models

• Provides a convenient framework for visualizing conditional independent

• Provides general inference algorithms• Next, we’ll see a GM (Hidden Markov Model) for

POS

Page 17: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part-of-speech (English)

From Dan Klein’s cs 288 slides

Page 18: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Modified from Diane Litman's version of Steve Bird's notes

18

Terminology

• Tagging– The process of associating labels with each

token in a text

• Tags– The labels– Syntactic word classes

• Tag Set– The collection of tags used

Page 19: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

19

Example• Typically a tagged text is a sequence of white-

space separated base/tag tokens:These/DTfindings/NNSshould/MDbe/VBuseful/JJfor/INtherapeutic/JJstrategies/NNSand/CCthe/DTdevelopment/NNof/INimmunosuppressants/NNStargeting/VBGthe/DTCD28/NNcostimulatory/NNpathway/NN./.

Page 20: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part-of-speech (English)

From Dan Klein’s cs 288 slides

Page 21: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

POS tagging vs. WSD

• Similar task: assign POS vs. assign WOS– You should butter your toast– Bread and butter

• Using a word as noun or verb involves a different meaning, like WSD

• In practice the two topics POS and WOS have been distinguished, because of for their different nature and also because the methods used are different– Nearby structures are most useful for POS (e.g. is the preceding

word a determiner?) but are of little use for WOS– Conversely, quite distant content words are very effective for

determining the semantic sense, but not POS

Page 22: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part-of-Speech Ambiguity

From Dan Klein’s cs 288 slides

(particle)

(preposition)

(adverb)

Page 23: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part-of-Speech Ambiguity

Words that are highly ambiguous as to their part of speech tag

Page 24: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Sources of information

• Syntagmatic: tags of the other words– AT JJ NN is common– AT JJ VBP impossible (or unlikely)

• Lexical: look at the words– The AT– Flour more likely to be a noun than a verb– A tagger that always chooses the most common tag is

90% correct (often used as baseline)

• Most taggers use both

Page 25: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Modified from Diane Litman's version of Steve Bird's notes

25

What does Tagging do?

1. Collapses Distinctions• Lexical identity may be discarded• e.g., all personal pronouns tagged with PRP

2. Introduces Distinctions• Ambiguities may be resolved• e.g. deal tagged with NN or VB

3. Helps in classification and prediction

Page 26: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Modified from Diane Litman's version of Steve Bird's notes

26

Why POS?• A word’s POS tells us a lot about the

word and its neighbors:– Limits the range of meanings (deal), pronunciation

(text to speech) (object vs object, record) or both (wind)

– Helps in stemming: saw[v] → see, saw[n] → saw– Limits the range of following words – Can help select nouns from a document for

summarization– Basis for partial parsing (chunked parsing)

Page 27: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Why POS?

From Dan Klein’s cs 288 slides

Page 28: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Slide modified from Massimo Poesio's

28

Choosing a tagset

• The choice of tagset greatly affects the difficulty of the problem

• Need to strike a balance between– Getting better information about context – Make it possible for classifiers to do their job

Page 29: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Slide modified from Massimo Poesio's

29

Some of the best-known Tagsets

• Brown corpus: 87 tags– (more when tags are combined)

• Penn Treebank: 45 tags• Lancaster UCREL C5 (used to tag the BNC): 61

tags• Lancaster C7: 145 tags!

Page 30: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

NLTK

• Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method.

Page 31: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Tagging methods

• Hand-coded

• Statistical taggers– N-Gram Tagging– HMM– (Maximum Entropy)

• Brill (transformation-based) tagger

Page 32: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Hand-coded Tagger

• The Regular Expression Tagger

Page 33: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Unigram Tagger

• Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. – For example, it will assign the tag JJ to any occurrence of the word

frequent, since frequent is used as an adjective (e.g. a frequent word)

more often than it is used as a verb (e.g. I frequent this cafe).

)wP(t nn |

Page 34: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Unigram Tagger• We train a UnigramTagger by specifying tagged sentence data as a

parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary, stored inside the tagger.

• We must be careful not to test it on the same data. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text.

• Instead, we should split the data, training on 90% and testing on the remaining 10% (or 75% and 25%)

• Calculate performance on previously unseen text. – Note: this is general procedure for learning systems

Page 35: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

N-Gram Tagging

• An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens

• A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers.

trigram tagger)ttwP(t nnnn 2,1,|

Page 36: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

N-Gram Tagging

• Why not 10-gram taggers?

Page 37: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

N-Gram Tagging

• Why not 10-gram taggers?• As n gets larger, the specificity of the contexts

increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data.

• This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)

• Next week: sparsity

Page 38: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Markov Model Tagger

• Bigram tagger

• Assumptions:– Words are independent of each other– A word identity depends only on its tag– A tag depends only on the previous tag

• How does a GM with these assumption look like?

Page 39: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Markov Model Tagger

t1

w1

t2

w2

tn

wn

)tP(w)tP(t)wwwttP(tw)P(t ii

i

iinn ||,,..,,,,..,,, 12121

Page 40: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Markov Model Tagger Training

• For all of tags ti do

– For all tags tj do

– end

• For all of tags ti do

– For all words wi do

)tC

)ttC)tP(t

i

ijij

(

(|

,

)tC

)twC)tP(w

i

iiii

(

,(|

C(tj,ti) = number of occurrences of tj followed by ti

C(wj, tj) = number of occurrences of wi that are labeled as followed as ti

)tP(w)tP(t)wwwttP(tw)P(t ii

i

iinn ||,,..,,,,..,,, 12121

Page 41: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Markov Model Tagger Estimation

• Goal/Estimation:– Find the optimal tag sequence for a given

sentence

– The Viterbi algorithm

)tP(w)tP(t)wwwttP(tw)P(t ii

i

iinn ||,,..,,,,..,,, 12121

)wwwttP(tt nnt

nn

,,..,, | ,..,,maxargˆ 2121,11

Page 42: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Sequence free tagging?

From Dan Klein’s cs 288 slides

Page 43: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Sequence free tagging?

• Solution: maximum entropy sequence models (MEMMs- maximum entropy markov models, CRFs– conditional random fields)

From Dan Klein’s cs 288 slides

Page 44: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Modified from Diane Litman's version of Steve Bird's notes

44

Rule-Based Tagger

• The Linguistic Complaint– Where is the linguistic knowledge of a tagger?– Just massive tables of numbers– Aren’t there any linguistic insights that could

emerge from the data?– Could thus use handcrafted sets of rules to tag

input sentences, for example, if input follows a determiner tag it as a noun.

)ttwP(t nnnn 2,1,|

Page 45: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Slide modified from Massimo Poesio's

45

The Brill tagger(transformation-based tagger)

• An example of Transformation-Based Learning – Basic idea: do a quick job first (using frequency), then

revise it using contextual rules.

• Very popular (freely available, works fairly well)– Probably the most widely used tagger (esp. outside

NLP)– …. but not the most accurate: 96.6% / 82.0 %

• A supervised method: requires a tagged corpus

Page 46: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Brill Tagging: In more detail

• Start with simple (less accurate) rules…learn better ones from tagged corpus– Tag each word initially with most likely POS– Examine set of transformations to see which improves

tagging decisions compared to tagged corpus – Re-tag corpus using best transformation– Repeat until, e.g., performance doesn’t improve– Result: tagging procedure (ordered list of

transformations) which can be applied to new, untagged text

Page 47: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Slide modified from Massimo Poesio's

47

An example• Examples:

– They are expected to race tomorrow.– The race for outer space.

• Tagging algorithm:1. Tag all uses of “race” as NN (most likely tag in the

Brown corpus)• They are expected to race/NN tomorrow• the race/NN for outer space

2. Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO:• They are expected to race/VB tomorrow• the race/NN for outer space

Page 48: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

What gets learned? [from Brill 95]

Tags-triggered transformations Morphology-triggered transformations

Rules are linguistically interpretable

Page 49: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Tagging accuracies (overview)

From Dan Klein’s cs 288 slides

Page 50: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Tagging accuracies

From Dan Klein’s cs 288 slides

Page 51: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Tagging accuracies

• Taggers are already pretty good on WSJ journal text…

• What we need is taggers that work on other text!• Performance depends on several factors

– The amount of training data– The tag set (the larger, the harder the task)– Difference between training and testing corpus– Unknown words

• For example, technical domains

Page 52: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Common Errors

From Dan Klein’s cs 288 slides

Page 53: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Next week

• What happen when ?

• Sparsity• Methods to deal with it

– For example: Back-off: if use instead:

0,| 2,1 )ttwP(t nnnn

0,| 2,1 )ttwP(t nnnn

)wP(t nn ,|

0,| 2,1 )ttwP(t nnnn

Page 54: I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Administrativia

• Assignment 2 is out – Due September 22– Soon grades and “best” solutions to

assignment 1

• Reading for next class– Chapter 6 Statistical NLP