POS Tagging: Introduction
description
Transcript of POS Tagging: Introduction
1
POS Tagging Introduction
Heng Ji
hengjicsqccunyeduFeb 2 2008
Acknowledgement some slides from Ralph Grishman Nicolas Nicolov JampM
239
Some Administrative Stuff Assignment 1 due on Feb 17 Textbook required for assignments and final
exam
339
Outline
Parts of speech (POS) Tagsets POS Tagging
Rule-based tagging Markup Format Open source Toolkits
439
What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)
Verb Noun Adjective Adverb Article hellip We can also include inflection
Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip
539
Parts of Speech 8 (ish) traditional parts of speech
Noun verb adjective preposition adverb article interjection pronoun conjunction etc
Called parts-of-speech lexical categories word classes morphological classes lexical tags
Lots of debate within linguistics about the number nature and universality of these
Wersquoll completely ignore this debate
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
239
Some Administrative Stuff Assignment 1 due on Feb 17 Textbook required for assignments and final
exam
339
Outline
Parts of speech (POS) Tagsets POS Tagging
Rule-based tagging Markup Format Open source Toolkits
439
What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)
Verb Noun Adjective Adverb Article hellip We can also include inflection
Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip
539
Parts of Speech 8 (ish) traditional parts of speech
Noun verb adjective preposition adverb article interjection pronoun conjunction etc
Called parts-of-speech lexical categories word classes morphological classes lexical tags
Lots of debate within linguistics about the number nature and universality of these
Wersquoll completely ignore this debate
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
339
Outline
Parts of speech (POS) Tagsets POS Tagging
Rule-based tagging Markup Format Open source Toolkits
439
What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)
Verb Noun Adjective Adverb Article hellip We can also include inflection
Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip
539
Parts of Speech 8 (ish) traditional parts of speech
Noun verb adjective preposition adverb article interjection pronoun conjunction etc
Called parts-of-speech lexical categories word classes morphological classes lexical tags
Lots of debate within linguistics about the number nature and universality of these
Wersquoll completely ignore this debate
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
439
What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)
Verb Noun Adjective Adverb Article hellip We can also include inflection
Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip
539
Parts of Speech 8 (ish) traditional parts of speech
Noun verb adjective preposition adverb article interjection pronoun conjunction etc
Called parts-of-speech lexical categories word classes morphological classes lexical tags
Lots of debate within linguistics about the number nature and universality of these
Wersquoll completely ignore this debate
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
539
Parts of Speech 8 (ish) traditional parts of speech
Noun verb adjective preposition adverb article interjection pronoun conjunction etc
Called parts-of-speech lexical categories word classes morphological classes lexical tags
Lots of debate within linguistics about the number nature and universality of these
Wersquoll completely ignore this debate
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
639
7 Traditional POS Categories N noun chair bandwidth
pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
739
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
839
Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of
Wall Street Journal 1M words 46 tags Some particularities
to TO not disambiguated Auxiliaries and verbs not distinguished
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
939
Penn Treebank Tagset
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1039
Why POS tagging is useful Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1139
Equivalent Problem in Bioinformatics
Durbin et al Biological Sequence Analysis Cambridge University Press
Several applications eg proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1239
Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis
How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names relations etc
Machine Translation
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1339
Open and Closed Classes Closed class a small fixed membership
Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which
play a role in grammar) Open class new ones can be created all the time
English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1439
Open Class Words Nouns
Proper nouns (Boulder Granby Eli Manning) English capitalizes these
Common nouns (the rest) Count nouns and mass nouns
Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)
Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)
Verbs In English have morphological affixes (eateatseaten)
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1539
Closed Class Words
Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1639
Prepositions from CELEX
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1739
English Particles
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1839
Conjunctions
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
1939
POS TaggingChoosing a Tagset
There are so many parts of speech potential distinctions we can draw
To do POS tagging we need to choose a standard set of tags to work with
Could pick very coarse tagsets N V Adj Adv
More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG
Even more fine-grained tagsets exist
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2039
Using the Penn Tagset
TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS
Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)
Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2139
POS Tagging
Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word
These examples from Dekang Lin
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2239
How Hard is POS Tagging Measuring Ambiguity
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2339
Current Performance How many tags are correct
About 97 currently But baseline is already 90 Baseline algorithm
Tag every word with its most frequent tag Tag unknown words as nouns
How well do people do
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2439
Quick Test Agreement the students went to class plays well with others fruit flies like a banana
DT the this thatNN nounVB verbP prepostionADV adverb
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2539
Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2639
How to do it History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93-95
Greene and RubinRule Based - 70
LOB Corpus Created (EN-UK)1 Million Words
DeRoseChurchEfficient HMMSparse Data
95+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based ndash 95+
Tree-Based Statistics (Helmut Shmid)
Rule Based ndash 96+
Neural Network 96+
Trigram Tagger(Kempe)
96+
Combined Methods98+
Penn Treebank Corpus
(WSJ 45M)
LOB Corpus Tagged
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2739
Two Methods for POS Tagging
1 Rule-based tagging (ENGTWOL)
2 Stochastic1 Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2839
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
2939
Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin
1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential
POSs Stage 2 Apply rules which certify or disallow tag
sequences Rules originally handwritten more recently Machine
Learning methods can be used
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3039
Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB
bull Etchellip for the ~100000 words of English with more than 1 tag
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3139
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3239
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo
NN RB JJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3339
Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological
analyzer to get all parts of speech Example Pavlov had shown that salivation hellip
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3439
Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule
Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo
Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which
allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3539
Inline Mark-up POS Tagging
httpnlpcsqccunyeduwsj_poszip Input Format
Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29
Output FormatPierreNNP VinkenNNP 61CD yearsNNS
oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3639
POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger
(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml
Demo How it works
Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3739
POS Tagging Tools Stanford tagger (Loglinear tagger )
httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED
_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3839
NLP Toolkits Uniform CL Annotation Platform
UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet
Natural langauge toolkit with data sets Demo
Information Extraction Jet (NYU IE toolkit)
httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml
University of Sheffield IE toolkit Information Retrieval
INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit
Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-
3939
Looking Ahead Next Class Machine Learning for POS Tagging
Hidden Markov Model
- POS Tagging Introduction
- Some Administrative Stuff
- Outline
- What is Part-of-Speech (POS)
- Parts of Speech
- 7 Traditional POS Categories
- POS Tagging
- Penn TreeBank POS Tag Set
- Penn Treebank Tagset
- Why POS tagging is useful
- Equivalent Problem in Bioinformatics
- Why is POS Tagging Useful
- Open and Closed Classes
- Open Class Words
- Closed Class Words
- Prepositions from CELEX
- English Particles
- Conjunctions
- POS Tagging Choosing a Tagset
- Using the Penn Tagset
- Slide 21
- How Hard is POS Tagging Measuring Ambiguity
- Current Performance
- Quick Test Agreement
- Quick Test
- How to do it History
- Two Methods for POS Tagging
- Rule-Based Tagging
- Rule-based taggers
- Start With a Dictionary
- Assign Every Possible Tag
- Write Rules to Eliminate Tags
- Stage 1 of ENGTWOL Tagging
- Stage 2 of ENGTWOL Tagging
- Inline Mark-up
- POS Tagging Tools
- Slide 37
- NLP Toolkits
- Looking Ahead Next Class
-