I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
4
Transcript of I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part...
I256
Applied Natural Language Processing
Fall 2009
Lecture 6
• Introduction of Graphical Models
• Part of speech tagging
Barbara Rosario
Graphical Models
• Within the Machine Learning framework• Probability theory plus graph theory • Widely used
– NLP– Speech recognition– Error correcting codes– Systems diagnosis– Computer vision– Filtering (Kalman filters)– Bioinformatics
(Quick intro to) Graphical Models
Nodes are random variables
B C
DA
P(A) P(D)
P(B|A)P(C|A,D)
Edges are annotated with conditional probabilities
Absence of an edge between nodes implies conditional independence
“Probabilistic database”
Graphical Models
A
B C
D
• Define a joint probability distribution:
• P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D)• Learning
– Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)
Graphical Models
• Define a joint probability distribution: • P(X1, ..XN) = i P(Xi | Par(Xi) )• P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D)• Learning
– Given data, estimate P(A), P(B|A), P(D), P(C | A, D)
• Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D)
• Inference = Probabilistic queries• General inference algorithms (e.g.
Junction Tree)
A
B C
D
Naïve Bayes models
• Simple graphical model
• Xi depend on Y
• Naïve Bayes assumption: all xi are independent given Y
• Currently used for text classification and spam detection
x1 x2 x3
Y
Naïve Bayes models
Naïve Bayes for document classification
w1 w2 wn
topic
Inference task: P(topic | w1, w2 … wn)
Naïve Bayes for SWD
v1 v2 v3
sk
• Recall the general joint probability distribution:
P(X1, ..XN) = i P(Xi | Par(Xi) )
P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Naïve Bayes for SWD
v1 v2 v3
sk
P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Naïve Bayes for SWD
v1 v2 v3
sk
P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training): Given data, estimate: P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Inference (Testing): Compute conditional probabilities of interest:
P(sk| v1, v2, v3)
Graphical Models
• Given Graphical model– Do estimation (find parameters from data)– Do inference (compute conditional
probabilities)
• How do I choose the model structure (i.e. the edges)?
How to choose the model structure?
v1 v2 v3
sk
v1 v2 v3
sk
v1 v2 v3
sk
v1 v2 v3
sk
Model structure
• Learn it: structure learning– Difficult & need a lot of data
• Knowledge of the domain and of the relationships between the variables– Heuristics– The fewer dependencies (edges) we can have, the
“better”• Sparsity: more edges, need more data• Next class…
– Direction of arrows v1 v2 v3
sk
P (v3 | sk, v1, v2)
Generative vs. discriminative
P(sk, v1..v3) = P(sk) P(vi | Par(vi)) = P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training): Given data, estimate: P(sk), P(v1| sk), P(v2| sk) and P(v3| sk )
Inference (Testing): Compute: P(sk| v1, v2, v3)(there are algorithms to find these cond. Pb, not covered here)
v1 v2 v3
sk
P(sk, v1..v3) = P(v1) P(v2) P(v3 ) P( sk | v1, v2 v3)
Conditional pb. of interest is “ready”: P(sk| v1, v2, v3) i.e. modeled directly
Estimation (Training): Given data, estimate: P(v1), P(v2), P(v3 ), and P( sk | v1, v2 v3)
Do inference to find Pb of interest Pb of interest is modeled directly
v1 v2 v3
sk
Generative Discriminative
Generative vs. discriminative
• Don’t worry…. You can use both models• If you are interested, let me know• But in short:
– If the Naive Bayes assumption made by the generative method is not met (conditional independencies not true), the discriminative method can have an edge
– But the generative model may converge faster– Generative learning can sometimes be more efficient
than discriminative learning; at least when the number of features is large compared to the number of samples
Graphical Models
• Provides a convenient framework for visualizing conditional independent
• Provides general inference algorithms• Next, we’ll see a GM (Hidden Markov Model) for
POS
Part-of-speech (English)
From Dan Klein’s cs 288 slides
Modified from Diane Litman's version of Steve Bird's notes
18
Terminology
• Tagging– The process of associating labels with each
token in a text
• Tags– The labels– Syntactic word classes
• Tag Set– The collection of tags used
19
Example• Typically a tagged text is a sequence of white-
space separated base/tag tokens:These/DTfindings/NNSshould/MDbe/VBuseful/JJfor/INtherapeutic/JJstrategies/NNSand/CCthe/DTdevelopment/NNof/INimmunosuppressants/NNStargeting/VBGthe/DTCD28/NNcostimulatory/NNpathway/NN./.
Part-of-speech (English)
From Dan Klein’s cs 288 slides
POS tagging vs. WSD
• Similar task: assign POS vs. assign WOS– You should butter your toast– Bread and butter
• Using a word as noun or verb involves a different meaning, like WSD
• In practice the two topics POS and WOS have been distinguished, because of for their different nature and also because the methods used are different– Nearby structures are most useful for POS (e.g. is the preceding
word a determiner?) but are of little use for WOS– Conversely, quite distant content words are very effective for
determining the semantic sense, but not POS
Part-of-Speech Ambiguity
From Dan Klein’s cs 288 slides
(particle)
(preposition)
(adverb)
Part-of-Speech Ambiguity
Words that are highly ambiguous as to their part of speech tag
Sources of information
• Syntagmatic: tags of the other words– AT JJ NN is common– AT JJ VBP impossible (or unlikely)
• Lexical: look at the words– The AT– Flour more likely to be a noun than a verb– A tagger that always chooses the most common tag is
90% correct (often used as baseline)
• Most taggers use both
Modified from Diane Litman's version of Steve Bird's notes
25
What does Tagging do?
1. Collapses Distinctions• Lexical identity may be discarded• e.g., all personal pronouns tagged with PRP
2. Introduces Distinctions• Ambiguities may be resolved• e.g. deal tagged with NN or VB
3. Helps in classification and prediction
Modified from Diane Litman's version of Steve Bird's notes
26
Why POS?• A word’s POS tells us a lot about the
word and its neighbors:– Limits the range of meanings (deal), pronunciation
(text to speech) (object vs object, record) or both (wind)
– Helps in stemming: saw[v] → see, saw[n] → saw– Limits the range of following words – Can help select nouns from a document for
summarization– Basis for partial parsing (chunked parsing)
Why POS?
From Dan Klein’s cs 288 slides
Slide modified from Massimo Poesio's
28
Choosing a tagset
• The choice of tagset greatly affects the difficulty of the problem
• Need to strike a balance between– Getting better information about context – Make it possible for classifiers to do their job
Slide modified from Massimo Poesio's
29
Some of the best-known Tagsets
• Brown corpus: 87 tags– (more when tags are combined)
• Penn Treebank: 45 tags• Lancaster UCREL C5 (used to tag the BNC): 61
tags• Lancaster C7: 145 tags!
NLTK
• Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method.
Tagging methods
• Hand-coded
• Statistical taggers– N-Gram Tagging– HMM– (Maximum Entropy)
• Brill (transformation-based) tagger
Hand-coded Tagger
• The Regular Expression Tagger
Unigram Tagger
• Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. – For example, it will assign the tag JJ to any occurrence of the word
frequent, since frequent is used as an adjective (e.g. a frequent word)
more often than it is used as a verb (e.g. I frequent this cafe).
)wP(t nn |
Unigram Tagger• We train a UnigramTagger by specifying tagged sentence data as a
parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary, stored inside the tagger.
• We must be careful not to test it on the same data. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text.
• Instead, we should split the data, training on 90% and testing on the remaining 10% (or 75% and 25%)
• Calculate performance on previously unseen text. – Note: this is general procedure for learning systems
N-Gram Tagging
• An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens
• A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers.
trigram tagger)ttwP(t nnnn 2,1,|
N-Gram Tagging
• Why not 10-gram taggers?
N-Gram Tagging
• Why not 10-gram taggers?• As n gets larger, the specificity of the contexts
increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data.
• This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)
• Next week: sparsity
Markov Model Tagger
• Bigram tagger
• Assumptions:– Words are independent of each other– A word identity depends only on its tag– A tag depends only on the previous tag
• How does a GM with these assumption look like?
Markov Model Tagger
t1
w1
t2
w2
tn
wn
)tP(w)tP(t)wwwttP(tw)P(t ii
i
iinn ||,,..,,,,..,,, 12121
Markov Model Tagger Training
• For all of tags ti do
– For all tags tj do
– end
• For all of tags ti do
– For all words wi do
)tC
)ttC)tP(t
i
ijij
(
(|
,
)tC
)twC)tP(w
i
iiii
(
,(|
C(tj,ti) = number of occurrences of tj followed by ti
C(wj, tj) = number of occurrences of wi that are labeled as followed as ti
)tP(w)tP(t)wwwttP(tw)P(t ii
i
iinn ||,,..,,,,..,,, 12121
Markov Model Tagger Estimation
• Goal/Estimation:– Find the optimal tag sequence for a given
sentence
– The Viterbi algorithm
)tP(w)tP(t)wwwttP(tw)P(t ii
i
iinn ||,,..,,,,..,,, 12121
)wwwttP(tt nnt
nn
,,..,, | ,..,,maxargˆ 2121,11
Sequence free tagging?
From Dan Klein’s cs 288 slides
Sequence free tagging?
• Solution: maximum entropy sequence models (MEMMs- maximum entropy markov models, CRFs– conditional random fields)
From Dan Klein’s cs 288 slides
Modified from Diane Litman's version of Steve Bird's notes
44
Rule-Based Tagger
• The Linguistic Complaint– Where is the linguistic knowledge of a tagger?– Just massive tables of numbers– Aren’t there any linguistic insights that could
emerge from the data?– Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a determiner tag it as a noun.
)ttwP(t nnnn 2,1,|
Slide modified from Massimo Poesio's
45
The Brill tagger(transformation-based tagger)
• An example of Transformation-Based Learning – Basic idea: do a quick job first (using frequency), then
revise it using contextual rules.
• Very popular (freely available, works fairly well)– Probably the most widely used tagger (esp. outside
NLP)– …. but not the most accurate: 96.6% / 82.0 %
• A supervised method: requires a tagged corpus
Brill Tagging: In more detail
• Start with simple (less accurate) rules…learn better ones from tagged corpus– Tag each word initially with most likely POS– Examine set of transformations to see which improves
tagging decisions compared to tagged corpus – Re-tag corpus using best transformation– Repeat until, e.g., performance doesn’t improve– Result: tagging procedure (ordered list of
transformations) which can be applied to new, untagged text
Slide modified from Massimo Poesio's
47
An example• Examples:
– They are expected to race tomorrow.– The race for outer space.
• Tagging algorithm:1. Tag all uses of “race” as NN (most likely tag in the
Brown corpus)• They are expected to race/NN tomorrow• the race/NN for outer space
2. Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO:• They are expected to race/VB tomorrow• the race/NN for outer space
What gets learned? [from Brill 95]
Tags-triggered transformations Morphology-triggered transformations
Rules are linguistically interpretable
Tagging accuracies (overview)
From Dan Klein’s cs 288 slides
Tagging accuracies
From Dan Klein’s cs 288 slides
Tagging accuracies
• Taggers are already pretty good on WSJ journal text…
• What we need is taggers that work on other text!• Performance depends on several factors
– The amount of training data– The tag set (the larger, the harder the task)– Difference between training and testing corpus– Unknown words
• For example, technical domains
Common Errors
From Dan Klein’s cs 288 slides
Next week
• What happen when ?
• Sparsity• Methods to deal with it
– For example: Back-off: if use instead:
0,| 2,1 )ttwP(t nnnn
0,| 2,1 )ttwP(t nnnn
)wP(t nn ,|
0,| 2,1 )ttwP(t nnnn
Administrativia
• Assignment 2 is out – Due September 22– Soon grades and “best” solutions to
assignment 1
• Reading for next class– Chapter 6 Statistical NLP