Part-of-Speech Tagging - Department of Computer Sciencejason/465/PDFSlides/lect22-tagging.pdf ·...

600.465 - Intro to NLP - J. Eisner 1

Part-of-Speech Tagging

A Canonical Finite-State Task

The Tagging Task

Input: the lead paint is unsafe

Output: the/Det lead/N paint/N is/V unsafe/Adj

Uses: text-to-speech (how do we pronounce “lead”?) can write regexps like (Det) Adj* N+ over the output preprocessing to speed up parser (but a little dangerous) if you know the tag, you can back off to it in other tasks

Why Do We Care?

The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task Can be done well with methods that look at local context Though should “really” do it by parsing!

Degree of Supervision

Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn’t tagged Partly supervised: Training corpus isn’t tagged,

but you have a dictionary giving possible tags foreach word

We’ll start with the supervised case and move todecreasing levels of supervision.

Current Performance

How many tags are correct? About 97% currently (for English) But baseline is already 90% Baseline is performance of stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns

What Should We Look At?

Bill directed a cortege of autos through the dunesPN Verb Det Noun Prep Noun Prep Det Noun

correct tags

PN Adj Det Noun Prep Noun Prep Det NounVerb Verb Noun Verb

Adj some possible tags forPrep each word (maybe more)…?

Each unknown tag is constrained by its wordand by the tags to its immediate left and right.But those tags are unknown too …

correct tags

Three Finite-State Approaches

Noisy Channel Model (statistical)

noisy channel Y X

real language Y

observed string X

want to recover Y from X

part-of-speech tags(n-gram model)

replace tagswith words

1. Noisy Channel Model (statistical)

2. Deterministic baseline tagger composedwith a cascade of fixup transducers

3. Nondeterministic tagger composed witha cascade of finite-state automata thatact as filters

Review: Noisy Channel

noisy channel Y X

real language Y

obseved string X

p(X | Y)

p(X,Y)

want to recover yY from xXchoose y that maximizes p(y | x) or equivalently p(x,y)

p(X | Y)

p(X,Y)

Note p(x,y) sums to 1.Suppose x=“C”; what is best “y”?

p(X | Y)

p(X,Y)

Suppose x=“C”; what is best “y”?

p(X | Y)

p(x, Y)

.o. *(X = x)?restrict just to

paths compatiblewith output “C”

Noisy Channel for Tagging

p(X | Y)

p(x, Y)

.o. *(X = x)?

acceptor: p(tag sequence)

transducer: tags words

acceptor: the observed words

transducer: scores candidate tag seqson their joint probability with obs words;

pick best path

“Markov Model”

“Unigram Replacement”

“straight line”

Markov Model (bigrams)

AdjNoun

Markov Model

AdjNoun

0.3 0.7

0.4 0.5

Markov Model

AdjNoun

0.70.3

0.20.4 0.5

Markov Model

AdjNoun

0.4 0.5

Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2

p(tag seq)

Markov Model as an FSA

AdjNoun

0.70.3

0.4 0.5

p(tag seq)

Markov Model as an FSA

AdjNoun

Noun0.7Adj 0.3

Adj 0.4

Noun0.5

Det 0.8

p(tag seq)

Markov Model (tag bigrams)

AdjNoun Stop

Adj 0.4Noun0.5

Det 0.8

p(tag seq)

Adj 0.3

p(X | Y)

p(x, Y)

.o. *p(x | X)

automaton: p(tag sequence)

transducer: tags words

automaton: the observed words

pick best path

“Markov Model”

“Unigram Replacement”

“straight line”

p(X | Y)

p(x, Y)

.o. *p(x | X)

we should pick best path

the cool directed autos

Adj:cortege/0.000001…

Noun:Bill/0.002Noun:autos/0.001

…Noun:cortege/0.000001

Adj:cool/0.003Adj:directed/0.0005

Det:the/0.4Det:a/0.6

AdjNoun

Noun0.7Adj 0.3

Adj 0.4 0.1

Noun0.5

Det 0.8

Unigram Replacement Model

Noun:Bill/0.002

Noun:autos/0.001

Adj:cool/0.003

Adj:directed/0.0005

Det:the/0.4

Det:a/0.6

sums to 1

p(word seq | tag seq)

AdjNoun

Adj 0.3

Adj 0.4Noun0.5

Det 0.8

p(tag seq)

ComposeAdj:cortege/0.000001

AdjNoun

Noun0.7Adj 0.3

Adj 0.4 0.1

Noun0.5

Det 0.8

Det:a 0.48Det:the 0.32

Compose

AdjNoun Stop

Adj:cool 0.0009Adj:directed 0.00015Adj:cortege 0.000003

p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)

AdjNoun

Noun0.7Adj 0.3

Adj 0.4 0.1

Noun0.5

Det 0.8

N:cortegeN:autos

Observed Words as Straight-Line FSA

word seq

Det:a 0.48Det:the 0.32

AdjNoun Stop

p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)Verb

Compose with the cool directed autos

N:cortegeN:autos

Det:the 0.32Det

AdjNoun Stop

Adj:cool 0.0009

the cool directed autosCompose with

why did thisloop go away?

Adj:directed 0.00020N:autos

Det:the 0.32Det

AdjNoun Stop

Adj:cool 0.0009

AdjAdj:directed 0.00020

N:autos

The best path:Start Det Adj Adj Noun Stop = 0.32 * 0.0009 …

In Fact, Paths Form a “Trellis”

Start Adj

p(word seq, tag seq)

Adj:directed…

600.465 - Intro to NLP - J. Eisner 33So all paths here must have 4 words on output side

All paths here are 4 words

The Trellis Shape Emerges from the Cross-ProductConstruction for Finite-State Composition

0 1 2 3 4

Actually, Trellis Isn’t Complete

Start Adj

Adj:directed…

Trellis has no Det Det or Det Stop arcs; why?

Start Adj

Adj:directed…

Lattice is missing some other arcs; why?

Start Stop

Noun Noun

Adj:directed…

Lattice is missing some states; why?

Find best path from Start to Stop

Use dynamic programming – like prob. parsing: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as

probability plus one backpointer) Special acyclic case of Dijkstra’s shortest-path alg. Faster if some arcs/states are absent

Start Adj

Adj:directed…

In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a “Hidden Markov Model”:

Start PN Verb Det Noun Prep Noun Prep Det Noun

Bill directed a cortege of autos through the dunes

0.4 0.6

Find X that maximizes probability product

probsfrom tagbigrammodel

probs fromunigramreplacement

Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are!

Start PN Verb Det …Bill directed a …p( )

* p(a | Bill directed, Start PN Verb Det …) * …

Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are!

Start PN Verb Det …Bill directed a …p( )

* p(a | Bill directed, Start PN Verb Det …) * …

Start PN Verb Det Noun Prep Noun Prep Det Noun Stop

Bill directed a cortege of autos through the dunes

Another FST Paradigm:Successive Fixups Like successive markups but alter Morphology Phonology Part-of-speech tagging …

output

Transformation-Based Tagging(Brill 1995)

figure from Brill’s thesis

Transformations Learnedfigure from Brill’s thesis

Compose thiscascade of FSTs.

Gets a big FST thatdoes the initialtagging and the

sequence of fixups“all at once.”

BaselineTag*NN @ VB // TO _

VBP @ VB // ... _etc.

Initial Tagging of OOV Wordsfigure from Brill’s thesis

Variations

Multiple tags per word Transformations to knock some of them out

How to encode multiple tags and knockouts?

Use the above for partly supervised learning Supervised: You have a tagged training corpus Unsupervised: You have an untagged training corpus Here: You have an untagged training corpus and a

dictionary giving possible tags for each word

Part-of-Speech Tagging - Department of Computer Sciencejason/465/PDFSlides/lect22-tagging.pdf ·...

Documents

Transcript of Part-of-Speech Tagging - Department of Computer Sciencejason/465/PDFSlides/lect22-tagging.pdf ·...

Stop thinking, start tagging: Tag Semantics arise from Collaborative Verbosity

Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A Survey … · 2018. 6. 22. · Language Processing (NLP) perspective. Part Of Speech (POS) tagging is the most initial step

Meaning Of A Tag: A collaborative approach to bridge the gap between tagging and Linked Data

Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity

Recommending Items in Social Tagging Systems Using Tag and Time Information

Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

RFID-TAGGING CHRISTCHURCH’S MGBS · • Typical read range of tag 1 - 10m (Tuneable via Antenna). • Vehicle Tracking showing Lift record with a Tag and customer RFID TECHNOLOGY

COMP 786 (Fall 2020) Natural Language Processingmbansal/teaching/slides...Lecture 3: POS-Tagging, NER, Seq Labeling, Coreference . Part-of-Speech Tagging . Part-of-Speech Tagging Basic

Personalized Tag Prediction Boosted by BaggTaming · Collaborative Tagging By applying BaggTaming, improving the prediction accuracy of the tag that is personalized for a speciﬁc

Advancesindeeplearningapproachesforimage tagging...While image tagging and tag refinement are the two key technologies to help users access images, how to do itaccuratelyisnoteasy.Mostpreviousimagetagging/tag

AUTOMATICALLY TAG VIDEOS WITH THE CORRECT DATA · 2/19/2019 · AUTOMATICALLY TAG VIDEOS WITH THE CORRECT DATA Auto-tagging takes information from your Computer-Aided Dispatch and

Crowdsourced Semantics with Semantic Tagging: …ceur-ws.org/Vol-1030/paper-09.pdfCrowdsourced Semantics with Semantic Tagging: “Don’t just tag it, LexiTag it!” Csaba Veres Institute

Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

Intelligent Freight Trains for Multi-Modal Cargo Transport · • Intelligent train technology & Telemetry GSM-R or Alternatives ... tag Microwave tagging Microwave tagging of containersof

LENOVO™ ASSET TAGGING AND LASER ETCHING SERVICES · Essential Asset Tag Standard Asset Tag Enhanced Asset Tag [1] Source: eWeek.com, (December 2010) [2] Lenovo’s Asset Tagging

Molecular tagging with SNAP-tag Final (3)

Panel: Social Tagging and Folksonomies: Indexing, Retrieving... and Beyond? - Searching and browsing via tag clouds

Extraction and Analysis of Tripartite Relationships from ... · Web. Alice’s Tag is “Semantic Web, Tagging” and Bob’s tag is “Semantic Web, Blogging”. Now if Alice removes

I-TAG: AUTO FACE TAGGING ASSISTANT by Duygu Sarıkaya,Ahsen Yergök,Dilek Demirbaş

On Content-Based Recommendation and Social-Tagging...Social Web; Collaborative tagging; Tag forgery. 1. Introduction. Recommendation and information ﬁltering systems have been developed