Part of Speech Tagging & Hidden Markov Models (Part 1)
Transcript of Part of Speech Tagging & Hidden Markov Models (Part 1)
![Page 1: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/1.jpg)
Part of Speech Tagging
& Hidden Markov Models (Part 1)
Mitch Marcus
CIS 421/521
![Page 2: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/2.jpg)
CIS 421/521 - Intro to AI 2
NLP Task I – Determining Part of Speech Tags
• Given a text, assign each token its correct part of speech
(POS) tag, given its context and a list of possible POS tags
for each word type
Word POS listing in Brown Corpus
heat noun verb
oil noun
in prep noun adv
a det noun noun-proper
large adj noun adv
pot noun
![Page 3: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/3.jpg)
CIS 421/521 - Intro to AI 3
What is POS tagging good for?
• Speech synthesis:
• How to pronounce “lead”?• INsult inSULT
• OBject obJECT
• OVERflow overFLOW
• DIScount disCOUNT
• CONtent content
• Machine Translation• translations of nouns and verbs are different
• Stemming for search
• Knowing a word is a V tells you it gets past tense, participles, etc.
• Can search for “walk”, can get “walked”, “walking,…
![Page 4: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/4.jpg)
Equivalent Problem in Bioinformatics
• From a sequence of amino
acids (primary structure):
ATCPLELLLD
• Infer secondary structure
(features of the 3D
structure, like helices,
sheets, etc.):
HHHBBBBBC..
CIS 421/521 - Intro to AI 4
Figure from:
http://www.particlesciences.com/news/technical-
briefs/2009/protein-structure.html
![Page 5: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/5.jpg)
CIS 421/521 - Intro to AI 5
Penn Treebank Tagset I
Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word d'hoevre
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables (supports)
NNP proper noun, singular John
NNPS proper noun, plural Vikings
![Page 6: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/6.jpg)
CIS 421/521 - Intro to AI 6
Tag Description Example
PDT predeterminer both the boys
POS possessive ending friend 's
PRP personal pronoun I, me, him, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
Penn Treebank Tagset II
![Page 7: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/7.jpg)
CIS 421/521 - Intro to AI 7
Tag Description Example
VB verb, base form take (support)
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes (supports)
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Penn Treebank Tagset III
![Page 8: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/8.jpg)
CIS 421/521 - Intro to AI 8
NLP Task I – Determining Part of Speech Tags
• The Old Solution: Depth First search.
• If each of n word tokens has k tags on average,
try the kn combinations until one works.
• Machine Learning Solutions: Automatically learn
Part of Speech (POS) assignment.
• The best techniques achieve 97+% accuracy per word on
new materials, given a POS-tagged training corpus of 106
tokens with 3% error on a set of ~40 POS tags (tags on the
last three slides)
![Page 9: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/9.jpg)
CIS 421/521 - Intro to AI 9
Simple Statistical Approaches: Idea 1
![Page 10: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/10.jpg)
CIS 421/521 - Intro to AI 10
Simple Statistical Approaches: Idea 2
For a string of words
W = w1w2w3…wn
find the string of POS tags
T = t1 t2 t3 …tn
which maximizes P(T|W)
• i.e., the most likely POS tag ti for each word wigiven its surrounding context
![Page 11: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/11.jpg)
CIS 421/521 - Intro to AI 11
The Sparse Data Problem …
A Simple, Impossible Approach to Compute P(T|W):
Count up instances of the string "heat oil in a large
pot" in the training corpus, and pick the most
common tag assignment to the string..
![Page 12: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/12.jpg)
One more time: A BOTEC Estimate of What Works
CIS 421/521 - Intro to AI 12
What parameters can we estimate with a million words of hand tagged training data?
• Assume a uniform distribution of 5000 words and 40 part of speech tags..
We can get reasonable estimates of
• Tag bigrams
• Word x tag pairs
![Page 13: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/13.jpg)
Bayes Rule plus Markov Assumptions
yields a practical POS tagger!
I. By Bayes Rule
II. So we want to find
III. To compute P(W|T):
• use the chain rule + a Markov assumption
• Estimation requires word x tag and tag counts
IV. To compute P(T):
• use the chain rule + a slightly different Markov assumption
• Estimation requires tag unigram and bigram counts
CIS 421/521 - Intro to AI 13
( | )* ( )( | )
( )
P W T P TP T W
P W
arg max ( | ) arg max ( | )* ( )T T
P T W P W T P T
![Page 14: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/14.jpg)
IV. To compute P(T):
Just like computing P(W) last lecture
I. By the chain rule,
II. Applying the 1st order Markov Assumption
Estimated using tag bigrams/tag unigrams!
CIS 421/521 - Intro to AI 14
1 2 1 3 1 2 1 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP T P t P t t P t t t P t t t
1 2 1 3 2 1( ) ( )* ( | )* ( | )*...* ( | )n nP T P t P t t P t t P t t
![Page 15: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/15.jpg)
CIS 421/521 - Intro to AI 15
III. To compute P(W|T):
I. Assume that the words wi are conditionally independent
given the tag sequence T=t1 t2 … tn:
II. Applying a zeroth-order Markov Assumption:
by which
So, for a given string W = w1w2w3…wn, the tagger needs to find
the string of tags T which maximizes
1
( | ) ( | )n
i
i
P W T P w T
( | ) ( | )i i iP w T P w t
1
( | ) ( | )n
i i
i
P W T P w t
![Page 16: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/16.jpg)
CIS 421/521 - Intro to AI 16
Hidden Markov Models
This model is an instance of a Hidden Markov
Model. Viewed graphically:
Adj
.3
.6Det
.02
.47Noun
.3
.7Verb
.51 .1P(w|Det)
a .4
the .4
P(w|Adj)
good .02
low .04
P(w|Noun)
price .001
deal .0001
![Page 17: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/17.jpg)
CIS 421/521 - Intro to AI 17
Viewed as a generator, an HMM:
Adj
.3
.6Det
.02
.47 Noun
.3
.7 Verb
.51 .1
.4the
.4a
P(w|Det)
.04low
.02good
P(w|Adj)
.0001deal
.001price
P(w|Noun)
![Page 18: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/18.jpg)
Summary: Recognition using an HMM
I. By Bayes Rule
II. We select the Tag sequence T that maximizes
P(T|W):
CIS 421/521 - Intro to AI 18
( )* ( | )( | )
( )
P T P W TP T W
P W
1 2
1 2
...
1
1 1...
1 1
arg max ( | )
arg max ( )* ( | )
arg max ( )* ( , )* ( , )
n
n
T
T t t t
n n
i i i iT t t t
i i
P T W
P T P W T
t a t t b t w
![Page 19: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/19.jpg)
CIS 421/521 - Intro to AI 19
Training and Performance
• To estimate the parameters of this model, given an annotated training corpus use the MLE:
• Because many of these counts are small, smoothing is necessary for best results…
• Such taggers typically achieve about 95-96% correct tagging, for the standard 40-tag POS set.
• A few tricks for unknown words increase accuracy to 97%.
![Page 20: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/20.jpg)
POS from bigram and word-tag pairs??
CIS 421/521 - Intro to AI 20
A Practical compromise
• Rich Models often require vast amounts of data
• Well estimated bad models often outperform badly estimated truer models
(Mutt & Jeff 1942)
![Page 21: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/21.jpg)
CIS 421/521 - Intro to AI 21
Practical Tagging using HMMs
• Finding this maximum can be done using an
exponential search through all strings for T.
• However, there is a linear time solution using
dynamic programming called Viterbi decoding.
![Page 22: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/22.jpg)
The three basic HMM problems
![Page 23: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/23.jpg)
CIS 421/521 - Intro to AI 23
Parameters of an HMM
• States: A set of states S=s1, … sn
• Transition probabilities: A= a1,1, a1,2, …, an,n
Each ai,j represents the probability of
transitioning from state si to sj.
• Emission probabilities: a set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si
• Initial state distribution: is the probability that
si is a start state
i
(This and later slides follow classic formulation by Ferguson, as
published by Rabiner and Juang, as adapted by Manning and Schutze.
Note the change in notation!!)
![Page 24: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/24.jpg)
CIS 421/521 - Intro to AI 24
The Three Basic HMM Problems
• Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
, how do we compute the
probability of O given the model?
• Problem 2 (Decoding): Given the observation
sequence O and an HMM model , how do we
find the state sequence that best explains the
observations?
• Problem 3 (Learning): How do we adjust the
model parameters , to maximize
?
(A,B,)
(A,B,)
P(O | )
![Page 25: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/25.jpg)
CIS 421/521 - Intro to AI 25
Problem 1: Probability of an Observation Sequence
• Q: What is ?
• A: the sum of the probabilities of all possible
state sequences in the HMM.
• Naïve computation is very expensive. Given T
observations and N states, there are NT possible
state sequences.
• (for T=10 and N=10, 10 billion different paths!!)
• Solution: linear time dynamic programming!
P(O | )
![Page 26: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/26.jpg)
CIS 421/521 - Intro to AI 26
The Crucial Data Structure: The Trellis
![Page 27: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/27.jpg)
CIS 421/521 - Intro to AI 27
Forward Probabilities:
• For a given HMM , for some time t,
what is the probability that the partial
observation o1 … ot has been generated and that
the state at time t is i ?
• Forward algorithm computes t(i) 0<i<N, 0<t<T
in time 0(N2T) using the trellis
t (i) P(o1...ot , qt si | )
![Page 28: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/28.jpg)
CIS 421/521 - Intro to AI 28
Forward Algorithm: Induction step
t ( j) t1(i)aiji1
N
b j (ot )
t (i) P(o1...ot , qt si | )
![Page 29: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/29.jpg)
CIS 421/521 - Intro to AI 29
Forward Algorithm
• Initialization (probability that o1 has been
generated and that the state is i at time t=1:
• Induction:
• Termination:
NjTtobaij tj
N
i
ijtt
1,2)()()(1
1
1(i) ibi(o1) 1 i N
P(O | ) T (i)i1
N
![Page 30: Part of Speech Tagging & Hidden Markov Models (Part 1)](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c55d5862fa4003876a76d1/html5/thumbnails/30.jpg)
CIS 421/521 - Intro to AI 30
Forward Algorithm Complexity
• Naïve approach requires exponential time to
evaluate all NT state sequences
• Forward algorithm using dynamic programming
takes O(N2T) computations