COMP790: Statistical NLP

1

COMP790: Statistical NLP

POS TaggingChap. 10

2

POS tagging

Goal: assign the right part of speech (noun, verb, …) to words in a text“The/AT representative/NN put/VBD chairs/NNS on/IN

the/AT table/NN.” Terminology

POS, part-of-speech tag word class morphological class lexical tag grammatical tag

3

Purpose: 1st step to NLU easier then full NLU (results > 95% accuracy)

Useful for: speech recognition / synthesis (better accuracy)

how to recognize/pronounce a word CONtent/noun VS conTENT/adj

stemming in IR which morphological affixes the word can take adverb - ly = noun (friendly - ly = friend)

for IR and QA pick out nouns which may be more important than other words

in indexing documents or query analysis partial parsing/chunking (for IE)

to find noun phrases/verb phrases

Why do POS Tagging?

4

Tag sets

Different tag sets, depends on the purpose of the application 45 tags in Penn Treebank 62 tags in CLAWS with BNC corpus 79 tags in Church (1991) 87 tags in Brown corpus 147 tags in C7 tagset 258 tags in Tzoukermann and Radev

(1995)

5

Tag set: Penn TreeBank

IN preposition or subordinating conjunct.

JJ adjective or numeral, ordinal

JJR adjective, comparative

NN noun, common, singular or mass

NNP noun, proper, singular

NNS noun, common, plural

TO "to" as preposition or infinitive marker

VB verb, base form

VBD verb, past tense

VBG verb, present participle or gerund

VBN verb, past participle

VBP verb, present tense, not 3rd p. singular

VBZ verb, present tense, 3rd p. singular

…

45 tags

6

but most word types are rare… Brown corpus (Francis&Kucera, 1982):

11.5% word types are ambiguous (>1 tag) 40% word tokens are ambiguous (>1 tag)

Nb word types

Unambiguous (1 tag) 35 340

Ambiguous (>1 tag) 4 100

2 tags 3760

3 tags 264

4 tags 61

5 tags 12

6 tags 2

7 tags 1 “still”

Most word types are not ambiguous but...

7

rule-based tagging uses hand-written rules

stochastic tagging uses probabilities computed from

training corpus transformation-based tagging

uses rules learned automatically

Techniques to POS tagging

8

Information sources for taggingAll techniques are based on the same

observations…

Syntagmatic information: some tag sequences are more probable than others

ART+ADJ+N is more probable than ART+ADJ+VB

Lexical information: knowing the word to be tagged gives a lot of

information about the correct tag “table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...}

9

Naïve POS tagging

using only syntagmatic patterns: Green & Rubin (1971) accuracy of 77%

using the most-likely tag for each word: Charniak et al. (1993) accuracy of 90% much better, but not very good...

1 mistake every 10 words used as baseline for evaluation

10

--> rule-based tagging uses hand-written rules





11

Rule-based POS tagging

Step 1: Assign each word with all possible tags use dictionary

Step 2: Use if-then rules to identify the correct tag in context (disambiguation rules)

12

Sample rules

N-IP rule: A tag N (noun) cannot be followed by a tag IP

(interrogative pronoun)

... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun

ART-V rule:A tag ART (article) cannot be followed by a tag V (verb)...the book…

the: {ART} book: {N, V} --> {N}

13


--> stochastic tagging uses probabilities computed from




14

Stochastic POS tagging

Assume that a word’s tag only depends on the previous tags (not following ones)

Use a training set (manually tagged corpus) to: learn the regularities of tag sequences learn the possible tags for a word model this info through a language model

(n-gram)

15

Goal: maximize P(word|tag) x P(tag|previous n tags)

P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)

P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)

Hidden Markov Model (HMM) Taggers

Lexical information Syntagmatic information

16

P(tag|previous n tags) if we look (n-1) tags before to find current tag --> n-

gram model

trigram model chooses the most probable tag ti for word wi given:

the previous 2 tags ti-2 & ti-1 and the current word wi

bigram model chooses the most probable tag ti for word wi given:

the previous tag ti-1 and the current word wi

unigram model (just most-likely tag) chooses the most probable tag ti for word wi given:

the current word wi

Tag sequence probability

17

Example “race” can be VB or NN

“Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV”

“People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN”

let’s tag the word “race” in 1st sentence with a bigram model.

18

Example (con’t) assuming previous words have been tagged, we

have:“Secretariat/NNP is/VBZ expected/VBN to/TO race/??

tomorrow”

P(race|VB) x P(VB|TO) ? given that we have a VB, how likely is the current word to be

race given that the previous tag is TO, how likely is the current tag

to be VB

P(race|NN) x P(NN|TO) ? given that we have a NN, how likely is the current word to be

race given that the previous tag is TO, how likely is the current tag

to be NN

19

From the training corpus, we found that:

P(NN|TO) = .021 // given that the previous tag is TO// 2.1% chances that the current tag is NN

P(VB|TO) = .34 // given that the previous tag is TO// 34% chances that the current tag is VB

P(race|NN) = .00041 // given that we have an NN // 0.041% chances that this word is

"race" P(race|VB) = .00003 // given that we have a VB

// 003% chances that this word is "race"

so:

P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009

so:VB is more probable!

Example (con’t)

20

and by the way: race is 98% of the time a NN !!!

P(VB|race) = 0.02P(NN|race) = 0.98 !!!

How are the probabilities found ? using a training corpus of hand-tagged

text long & meticulous work done by linguists

Example (con’t)

21

HMM Tagging

But HMM tagging tries to find: the best sequence of tags for a sentence not just best tag for a single word

goal: maximize the probability of a tag sequence, given a word sequence

i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)

22

By Bayes law:

wordSeq is given… so P(wordSeq) will be the same for all tagSeq so we can drop it from the equation

P(wordSeq)

tagSeq) | P(wordSeq x P(tagSeq) wordSeq) | P(tagSeq

HMM Tagging (con’t) wordSeq) | P(tagSeq argmax bestTagSeq

tagSeq

)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t

tagSeq)|P(wordSeq x P(tagSeq) argmax bestTagSeq

n1n1n1t,...,t

n1

tagSeq

n1

23

n

1iii1-ii

t,...,t

n1n1n1t,...,t

n1

)t|P(w x )t|P(t argmax

)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t so

n1

n1

1. words are independent

2. Markov assumption (approximation to short history)

ex. with bigram approximation:

3. probability of a word is only dependent on its tag

emission probability

state transition probability

Assumptions in HMM Tagging

)t,...,t|P(w x ... )xt,...,t|P(w x )t,...,t|P(w )t,...,t|w,...,P(w n1nn12n11n1n1

)t|P(t )t,...,t|P(t 1-ii1-i1i

)t|P(w t,...,tP(w iin1i )|

25

Emissions & Transitions probabilities let

N: number of possible tags (size of tag set) V: number of word types (vocabulary)

from a tagged training corpus, we compute the frequency of: Emission probabilities P(wi| ti)

stored in an N x V matrix emission[i,j] = probability that tag i is the correct tag for word j

Transitions probabilities P(ti|ti-1) stored in an N x N matrix transmission[i,j] = probability that tag i follows tag j

In practice, these matrices are very sparse So these models are smoothed to avoid zero probabilities

26

Emission probabilities P(wi| ti)

stored in an N x V matrix emission[i,j] = probability/frequency that

tag i is the correct tag for word j

27

Transitions probabilities P(ti|ti-

1) stored in an N x N matrix transmission[i,j] = probability/frequency

that tag i follows tag j

28

Efficiency issues

to find the best probability of a sequence is exponential in time

for efficiency, we usually use the Viterbi algorithm For global maximisation i.e. best tag sequence

29

an Example Emission probabilities:

Transition probabilities:

Tag

PN VB TO IN AT NN

Vocab

ula

ry

John 0.9 0.1

likes 0.2 0.3

to 0.5

fish 0.1 0.8 0.3

in 0.5 1

the 1

sea 0.3

Firs

t Tag

Second tag

PN VB TO IN AT NN None (last tag)

PN 0.2 0.7 0.1

VB 0.1 0.2 0.2 0.5

TO 1

IN 0.1 0.9

AT 0.05 0.95

NN 0.3 0.25 0.25 0.1 0.5 0.05

None (1st tag)

0.7 0.2 0.1

30

State Transition Diagram (VMM) Transition probabilities

0.7 0.1

0.7

0.1

PN

startAT

0.2

NN

0.5

0.95

IN

0.2

0.05

0.1

0.2

VB0.25

TO

0.25

0.3

0.1

0.2

end

0.05

0.5

1

0.1

0.9

31

State Transition Diagram (HMM) but the states are "invisible" (we only see the words)

John: 0.3

fish: 0.3

0.7 0.1

0.7

0.1

PN

startAT

0.2

NN

0.50.95

IN

0.2

0.05

0.1

0.2

VB0.25

TO

0.25

0.3

0.1

0.2

end

0.05

0.5

1

0.1

0.9

likes: 0.1

likes: 0.1

to: 0.1

fish: 0.1

the: 0.1 in: 0.2

sea: 0.2

in: 0.1…

…

…

…

…

…

32

The Viterbi Algorithm

best tag sequence for "John likes to fish in the sea"?

efficiently computes the most likely state sequence given a particular output sequence

based on dynamic programming

33

A smaller example

0.6

b

q rstart end

0.5

0.7

What is the best sequence of states for the input string “bbba”?

Computing all possible paths and finding the one with the max probability is exponential

a

0.4 0.80.2

b a

1 1

0.3 0.5

34

A smaller example (con’t) For each state, store the most likely sequence that could lead to it

(and its probability) Path probability matrix:

An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the

prob. for being in each state at the preceding time.

Best sequence Input sequence / time ε --> b b --> b bb --> b bbb --> a

leading to q

coming from q

ε --> q 0.6 (1.0x0.6)

q --> q 0.108(0.6x0.3x0.6)

qq --> q 0.01944 (0.108x0.3x0.6)

qrq --> q 0.018144(0.1008x0.3x0.4)

coming from r

r --> q 0 (0x0.5x0.6)

qr --> q 0.1008(0.336x0.5x 0.6)

qrr --> q 0.02688 (0.1344x0.5x0.4)

leading to r

coming from q

ε --> r 0(0x0.8)

q --> r 0.336(0.6x0.7x0.8)

qq --> r 0.0648 (0.108x0.7x0.8)

qrq --> r 0.014112 (0.1008x0.7x0.2)

coming from r

r --> r 0 (0x0.5x0.8)

qr --> r 0.1344 (0.336x0.5x0.8)

qrr --> r 0.01344(0.1344x0.5x0.2)

35

Viterbi for POS taggingLet:

n = nb of words in sentence to tag (nb of input tokens) T = nb of tags in the tag set (nb of states) vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag

sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1

// Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t ≠ PERIOD

36

Viterbi for POS tagging (con’t)// Induction (build the path probability matrix)for i:=1 to n step 1 do // for all words in the sentence

for all tags tj do // for all possible tags// store the max prob of the path

vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj| tk))

// store the actual state

path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) endend

//Termination and path-readout

bestStaten+1 := argmax1≤j≤T vit[n+1,j]for j:=n to 1 step -1 do // for all the words in the sentence

bestStatej := path[i+1, bestStatej+1]end

P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j]

emission probability

state transitionprobability

probability of best path

leading to state tk at word i

37

in bigram POS tagging, we condition a tag only on the preceding tag

why not... use more context (ex. use trigram model)

more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense

combine trigram, bigram, unigram models condition on words too

but with an n-gram approach, this is too costly (too many parameters to model)

transformation-based tagging...

Possible improvements

38



training corpus --> transformation-based tagging



39

Transformation-based tagging Due to Eric Brill (1995) basic idea:

take a non-optimal sequence of tags and improve it successively by applying a series of

well-ordered re-write rules

rule-based but, rules are learned automatically by

training on a pre-tagged corpus

40

1. Assign to words their most likely tag P(NN|race) = .98 P(VB|race) = .02

2. Change some tags by applying transformation rules

Rule Context (trigger) (apply the rule when…)

Examples

NN VB (noun verb)

the previous tag is the preposition to

go to sleep(VB) ? go to school(VB)

VBR VB (past tense base f orm)

one of the previous 3 tags is a modal (MD)

you may cut (VB)

J J R RBR (comparative adj comparative adv)

next tag is an adjective (J J )

a more (RBR) valuable

VBP VB (past tense base f orm)

one of the previous 2 words is “n’t”

should (VB) n’t

An example

41

Types of context lots of latitude… can be:

tag-triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ...

word- triggered transformation The preceding/following word this word …

morphology- triggered transformation The preceding/following word finishes with an s …

a combination of the above The preceding word is tagged this ways AND the following word is this

word

42

Learning the transformation rules

Input: A corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0)

Output: A bag of transformation rules Algorithm:

Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 Change tag a to tag b when…

The preceding/following word is tagged zThe word two before/after is tagged zOne of the 2 preceding/following words is tagged zOne of the 2 preceding words is z…

43

Learning the transformation rules (con't)

Run the initial tagger and compile types of errors <incorrect tag, desired tag, # of occurrences>

For each error type, instantiate all templates to generate candidate transformations

Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces

Save the transformation that yields the greatest improvement

Stop when no transformation can reduce the error rate by a predetermined threshold

44

Example

if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159>

Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of

the two preceding words is tagged as a determiner. When this template is applied to the corpus:

it corrects 98 of the 159 errors but it also creates 18 new errors

Error reduction is 98-18=80

45

Learning the best transformations

input: a corpus with each word:

correctly tagged (for reference) tagged with its most frequent tag (C0)

a bag of unordered transformation rules

output: an ordering of the best transformation rules

46

let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus Cε = minimum number of errors desired

for k:= 0 step 1 dobt := argmint (E(t(Ck)) // find the transformation t that minimizes // the error rate

if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly then goto finished

Ck+1 := bt(Ck) // apply rule bt to the current corpus

Tk+1 := bt // bt will be kept as the current transformation // rule

endfinished: the sequence T1 T2 … Tk is the ordered transformation rules

Learning the best transformations (con’t)

47

Strengths of transformation-based tagging exploits a wider range of lexical and syntactic

regularities

can look at a wider context condition the tags on preceding/next words not just

preceding tags. can use more context than bigram or trigram.

transformation rules are easier to understand than matrices of probabilities

48

Evaluation of POS taggers

compared with gold-standard of human performance metric:

accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to:

ceiling (best possible results) how do human annotators score compared to each other? (96-

97%) so systems are not bad at all!

baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless

of previous tags ? (90-91%) so anything less is really bad

49

More on tagger accuracy

is 95% good? that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per

sentence

when comparing tagger accuracy, beware of: size of training corpus

the bigger, the better the results difference between training & testing corpora (genre, domain…)

the closer, the better the results size of tag set

Prediction versus classification unknown words

the more unknown words (not in dictionary), the worst the results

50

Error analysis of POS taggers

correct tag

tags assigned by the tagger (Penn Treebank tags)

DT NNP J J NN VBD VBN … Total

DT 99.4 .3 0 0 .3 0 100

NNP 0 90.2 3.3 4.1 0 0 100

J J 0 .1 93.9 1.8 .1 1.9 100

NN 0 .5 2.2 95.5 .2 0 100

VBD 0 0 .3 1.4 96.0 2.5 100

VBN 0 0 1.9 0 3.4 93.3 100

…

Where did the tagger go wrong ? Use a confusion matrix / contingency table

Most confused: NN (noun) vs. NNP (proper noun) vs. JJ (adjective) VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective)

he chopped carrots, the carrots were chopped, the chopped carrots

51

Major difficulties in POS tagging

Unknown words (proper names) because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS

tagger) possible solutions:

assign all possible tags with probabilities distribution identical to lexicon as a whole

use morphological cues to infer possible tags ex. word ending in -ed are likely to be past tense verbs or past

participles

Frequently confused tag pairs preposition vs particle

<running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective

COMP790: Statistical NLP

Documents

Transcript of COMP790: Statistical NLP