Post on 28-Mar-2015
School of somethingFACULTY OF OTHER
School of ComputingFACULTY OF ENGINEERING
Word Bi-grams and PoS Tags
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst, and other contributors)
Reminder
FreqDist counts of tokens and their distribution can be useful
Eg find main characters in Gutenberg texts
Eg compare word-lengths in different languages
Human can predict the next word …
N-gram models are based on counts in a large corpus
Auto-generate a story ... (but gets stuck in local maximum)
Grammatical trends: modal verb distribution predicts genre
Why do puns make us groan?
He drove his expensive car into a tree and found
out how the Mercedes bends.
Isn't the Grand Canyon just gorges?
Time flies like an arrow. Fruit flies like a banana.
Predicting Next Words
One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word
They also exploit
• homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous)
• polysemy – same spelling, different meaning
NLP programs can also make use of word-sequence modeling
Auto-generate a Story
How to fix this? Use a random number generator.
Auto-generate a Story The choice() method chooses one item
randomly from a list(from random import *)
Part-of-Speech Tagging: Terminology
Tagging
• The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg
Hand-coded rules
Statistical taggers
Brill (transformation-based) tagger
Hybrid tagger: combination, eg by “vote”
Tags
• The labels
Tag Set
• The collection of tags used for a particular task, eg Brown or LOB tagset
Example from the GENIA corpus
Typically a tagged text is a sequence of white-space separated word/tag tokens:
These/DT
findings/NNS
should/MD
be/VB
useful/JJ
for/IN
therapeutic/JJ
strategies/NNS
and/CC
the/DT
development/NN
of/IN
immunosuppressants/NNS
targeting/VBG
the/DT
CD28/NN
costimulatory/NN
pathway/NN
./.
What does Tagging do?
Collapses Distinctions
• Lexical identity may be discarded
• e.g., all personal pronouns tagged with PRP
Introduces Distinctions
• Ambiguities may be resolved
• e.g. deal tagged with NN or VB
Helps in classification and prediction
Significance of Parts of Speech
A word’s POS tells us a lot about the word and its neighbors:
• Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)
• Helps in stemming
• Limits the range of following words
• Can help select nouns from a document for summarization
• Basis for partial parsing (chunked parsing)
• Parsers can build trees directly on the POS tags instead of maintaining a lexicon
Choosing a tagset
The choice of tagset greatly affects the difficulty of the problem
Need to strike a balance between
• Getting better information about context
• Make it possible for classifiers to do their job
Some of the best-known Tagsets
Brown corpus: 87 tags
• (more when tags are combined, eg isn’t)
LOB corpus: 132 tags
Penn Treebank: 45 tags
Lancaster UCREL C5 (used to tag the BNC): 61 tags
Lancaster C7: 145 tags
The Brown Corpus
An early digital corpus (1961)
• Francis and Kucera, Brown University
Contents: 500 texts, each 2000 words long
• From American books, newspapers, magazines
• Representing genres:
• Science fiction, romance fiction, press reportage scientific writing, popular lore
help(nltk.corpus.brown)
>>> help(nltk.corpus.brown)
| paras(self, fileids=None, categories=None)
|
| raw(self, fileids=None, categories=None)
|
| sents(self, fileids=None, categories=None)
|
| tagged_paras(self, fileids=None, categories=None, simplify_tags=False)
|
| tagged_sents(self, fileids=None, categories=None, simplify_tags=False)
|
| tagged_words(self, fileids=None, categories=None, simplify_tags=False)
|
| words(self, fileids=None, categories=None)
|
nltk.corpus.brown
>>> nltk.corpus.brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
>>> nltk.corpus.brown.tagged_sents()
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …
Penn Treebank
First large syntactically annotated corpus
1 million words from Wall Street Journal
Part-of-speech tags and syntax trees
help(nltk.corpus.treebank)
| parsed(*args, **kwargs)
| @deprecated: Use .parsed_sents() instead.
|
| parsed_sents(self, files=None)
|
| raw(self, files=None)
|
| read(*args, **kwargs)
| @deprecated: Use .raw() or .sents() or .tagged_sents() or
| .parsed_sents() instead.
|
| sents(self, files=None)
|
| tagged(*args, **kwargs)
| @deprecated: Use .tagged_sents() instead.
|
| tagged_sents(self, files=None)
|
| tagged_words(self, files=None)
How hard is POS tagging?
Number of tags 1 2 3 4 5 6 7
Number of word types
35340 3760 264 61 12 2 1
In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous
Tagging with lexical frequencies
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater probability
• P(race|VB)
• P(race|NN)
Actual estimate from the Switchboard corpus:
• P(race|NN) = .00041
• P(race|VB) = .00003
This suggests we should always tag race/NN (correct 41/44=93%)
Reminder
Puns play on our assumptions of the next word…
… eg they present us with an unexpected homonym (bends)
ConditionalFreqDist() counts word-pairs: word bigrams
Used for story generation, Speech recognition, …
Parts of Speech: groups words into grammatical categories
… and separates different functions of a word
In English, many words are ambiguous: 2 or more PoS-tags
Very simple tagger: choose by lexical probability (only)
Better Pos-Taggers: to come…