I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...

I256

Applied Natural Language Processing

Fall 2009

Lecture 6

• Introduction of Graphical Models

• Part of speech tagging

Barbara Rosario

Graphical Models

• Within the Machine Learning framework• Probability theory plus graph theory • Widely used

– NLP– Speech recognition– Error correcting codes– Systems diagnosis– Computer vision– Filtering (Kalman filters)– Bioinformatics

(Quick intro to) Graphical Models

Nodes are random variables

B C

DA

P(A) P(D)

P(B|A)P(C|A,D)

Edges are annotated with conditional probabilities

Absence of an edge between nodes implies conditional independence

“Probabilistic database”

Graphical Models

A

B C

D

• Define a joint probability distribution:

• P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D)• Learning

– Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)

Graphical Models

• Define a joint probability distribution: • P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D)• Learning

– Given data, estimate P(A), P(B|A), P(D), P(C | A, D)

• Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D)

• Inference = Probabilistic queries• General inference algorithms (e.g.

Junction Tree)

A

B C

D

Naïve Bayes models

• Simple graphical model

• Xi depend on Y

• Naïve Bayes assumption: all xi are independent given Y

• Currently used for text classification and spam detection

x1 x2 x3

Y

Naïve Bayes models

Naïve Bayes for document classification

w1 w2 wn

topic

Inference task: P(topic | w1, w2 … wn)


v1 v2 v3

sk


Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )


v1 v2 v3

sk


Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )

Inference (Testing): Compute conditional probabilities of interest:

P(sk| v1, v2, v3)

Graphical Models

• Given Graphical model– Do estimation (find parameters from data)– Do inference (compute conditional

probabilities)

• How do I choose the model structure (i.e. the edges)?

How to choose the model structure?

v1 v2 v3

sk

v1 v2 v3

sk

v1 v2 v3

sk

v1 v2 v3

sk

Model structure

• Learn it: structure learning– Difficult & need a lot of data

• Knowledge of the domain and of the relationships between the variables– Heuristics– The fewer dependencies (edges) we can have, the

“better”• Sparsity: more edges, need more data• Next class…

– Direction of arrows v1 v2 v3

sk

P (v3 | sk, v1, v2)

Generative vs. discriminative


Estimation (Training): Given data, estimate: P(sk), P(v1| sk), P(v2| sk) and P(v3| sk )

Inference (Testing): Compute: P(sk| v1, v2, v3)(there are algorithms to find these cond. Pb, not covered here)

v1 v2 v3

sk

P(sk, v1..v3) = P(v1) P(v2) P(v3 ) P( sk | v1, v2 v3)

Conditional pb. of interest is “ready”: P(sk| v1, v2, v3) i.e. modeled directly

Estimation (Training): Given data, estimate: P(v1), P(v2), P(v3 ), and P( sk | v1, v2 v3)

Do inference to find Pb of interest Pb of interest is modeled directly

v1 v2 v3

sk

Generative Discriminative

Generative vs. discriminative

• Don’t worry…. You can use both models• If you are interested, let me know• But in short:

– If the Naive Bayes assumption made by the generative method is not met (conditional independencies not true), the discriminative method can have an edge

– But the generative model may converge faster– Generative learning can sometimes be more efficient

than discriminative learning; at least when the number of features is large compared to the number of samples

Graphical Models

• Provides a convenient framework for visualizing conditional independent

• Provides general inference algorithms• Next, we’ll see a GM (Hidden Markov Model) for

POS

Part-of-speech (English)

From Dan Klein’s cs 288 slides

Modified from Diane Litman's version of Steve Bird's notes

18

Terminology

• Tagging– The process of associating labels with each

token in a text

• Tags– The labels– Syntactic word classes

• Tag Set– The collection of tags used

19

Example• Typically a tagged text is a sequence of white-

space separated base/tag tokens:These/DTfindings/NNSshould/MDbe/VBuseful/JJfor/INtherapeutic/JJstrategies/NNSand/CCthe/DTdevelopment/NNof/INimmunosuppressants/NNStargeting/VBGthe/DTCD28/NNcostimulatory/NNpathway/NN./.

Part-of-speech (English)


POS tagging vs. WSD

• Similar task: assign POS vs. assign WOS– You should butter your toast– Bread and butter

• Using a word as noun or verb involves a different meaning, like WSD

• In practice the two topics POS and WOS have been distinguished, because of for their different nature and also because the methods used are different– Nearby structures are most useful for POS (e.g. is the preceding

word a determiner?) but are of little use for WOS– Conversely, quite distant content words are very effective for

determining the semantic sense, but not POS

Part-of-Speech Ambiguity


(particle)

(preposition)

(adverb)

Part-of-Speech Ambiguity

Words that are highly ambiguous as to their part of speech tag

Sources of information

• Syntagmatic: tags of the other words– AT JJ NN is common– AT JJ VBP impossible (or unlikely)

• Lexical: look at the words– The AT– Flour more likely to be a noun than a verb– A tagger that always chooses the most common tag is

90% correct (often used as baseline)

• Most taggers use both


25

What does Tagging do?

1. Collapses Distinctions• Lexical identity may be discarded• e.g., all personal pronouns tagged with PRP

2. Introduces Distinctions• Ambiguities may be resolved• e.g. deal tagged with NN or VB

3. Helps in classification and prediction


26

Why POS?• A word’s POS tells us a lot about the

word and its neighbors:– Limits the range of meanings (deal), pronunciation

(text to speech) (object vs object, record) or both (wind)

– Helps in stemming: saw[v] → see, saw[n] → saw– Limits the range of following words – Can help select nouns from a document for

summarization– Basis for partial parsing (chunked parsing)

Why POS?


Slide modified from Massimo Poesio's

28

Choosing a tagset

• The choice of tagset greatly affects the difficulty of the problem

• Need to strike a balance between– Getting better information about context – Make it possible for classifiers to do their job


29

Some of the best-known Tagsets

• Brown corpus: 87 tags– (more when tags are combined)

• Penn Treebank: 45 tags• Lancaster UCREL C5 (used to tag the BNC): 61

tags• Lancaster C7: 145 tags!

NLTK

• Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method.

Tagging methods

• Hand-coded

• Statistical taggers– N-Gram Tagging– HMM– (Maximum Entropy)

• Brill (transformation-based) tagger

Hand-coded Tagger

• The Regular Expression Tagger

Unigram Tagger

• Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. – For example, it will assign the tag JJ to any occurrence of the word

frequent, since frequent is used as an adjective (e.g. a frequent word)

more often than it is used as a verb (e.g. I frequent this cafe).

)wP(t nn |

Unigram Tagger• We train a UnigramTagger by specifying tagged sentence data as a

parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary, stored inside the tagger.

• We must be careful not to test it on the same data. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text.

• Instead, we should split the data, training on 90% and testing on the remaining 10% (or 75% and 25%)

• Calculate performance on previously unseen text. – Note: this is general procedure for learning systems

N-Gram Tagging

• An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens

• A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers.

trigram tagger)ttwP(t nnnn 2,1,|

N-Gram Tagging

• Why not 10-gram taggers?

N-Gram Tagging

• Why not 10-gram taggers?• As n gets larger, the specificity of the contexts

increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data.

• This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)

• Next week: sparsity

Markov Model Tagger

• Bigram tagger

• Assumptions:– Words are independent of each other– A word identity depends only on its tag– A tag depends only on the previous tag

• How does a GM with these assumption look like?

Markov Model Tagger

t1

w1

t2

w2

tn

wn

)tP(w)tP(t)wwwttP(tw)P(t ii

i

iinn ||,,..,,,,..,,, 12121

Markov Model Tagger Training

• For all of tags ti do

– For all tags tj do

– end

• For all of tags ti do

– For all words wi do

)tC

)ttC)tP(t

i

ijij

(

(|

,

)tC

)twC)tP(w

i

iiii

(

,(|

C(tj,ti) = number of occurrences of tj followed by ti

C(wj, tj) = number of occurrences of wi that are labeled as followed as ti


i

iinn ||,,..,,,,..,,, 12121

Markov Model Tagger Estimation

• Goal/Estimation:– Find the optimal tag sequence for a given

sentence

– The Viterbi algorithm


i

iinn ||,,..,,,,..,,, 12121

)wwwttP(tt nnt

nn

,,..,, | ,..,,maxargˆ 2121,11

Sequence free tagging?


Sequence free tagging?

• Solution: maximum entropy sequence models (MEMMs- maximum entropy markov models, CRFs– conditional random fields)



44

Rule-Based Tagger

• The Linguistic Complaint– Where is the linguistic knowledge of a tagger?– Just massive tables of numbers– Aren’t there any linguistic insights that could

emerge from the data?– Could thus use handcrafted sets of rules to tag

input sentences, for example, if input follows a determiner tag it as a noun.

)ttwP(t nnnn 2,1,|


45

The Brill tagger(transformation-based tagger)

• An example of Transformation-Based Learning – Basic idea: do a quick job first (using frequency), then

revise it using contextual rules.

• Very popular (freely available, works fairly well)– Probably the most widely used tagger (esp. outside

NLP)– …. but not the most accurate: 96.6% / 82.0 %

• A supervised method: requires a tagged corpus

Brill Tagging: In more detail

• Start with simple (less accurate) rules…learn better ones from tagged corpus– Tag each word initially with most likely POS– Examine set of transformations to see which improves

tagging decisions compared to tagged corpus – Re-tag corpus using best transformation– Repeat until, e.g., performance doesn’t improve– Result: tagging procedure (ordered list of

transformations) which can be applied to new, untagged text


47

An example• Examples:

– They are expected to race tomorrow.– The race for outer space.

• Tagging algorithm:1. Tag all uses of “race” as NN (most likely tag in the

Brown corpus)• They are expected to race/NN tomorrow• the race/NN for outer space

2. Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO:• They are expected to race/VB tomorrow• the race/NN for outer space

What gets learned? [from Brill 95]

Tags-triggered transformations Morphology-triggered transformations

Rules are linguistically interpretable

Tagging accuracies (overview)


Tagging accuracies


Tagging accuracies

• Taggers are already pretty good on WSJ journal text…

• What we need is taggers that work on other text!• Performance depends on several factors

– The amount of training data– The tag set (the larger, the harder the task)– Difference between training and testing corpus– Unknown words

• For example, technical domains

Common Errors


Next week

• What happen when ?

• Sparsity• Methods to deal with it

– For example: Back-off: if use instead:

0,| 2,1 )ttwP(t nnnn


)wP(t nn ,|


Administrativia

• Assignment 2 is out – Due September 22– Soon grades and “best” solutions to

assignment 1

• Reading for next class– Chapter 6 Statistical NLP

I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...

Documents

Transcript of I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...