Natural Language Processing

Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either

or both NLP and speech recognition Week 2 Speech Recognition

Speech lab Finish projects and short critical reading

Week 3 Present projects Discuss reading

What is “Natural Language”?

Phonetics

Phonetics – the sounds which make up a word

ie. “cat” – k a t

PhoneticsMorphology

PhoneticsMorphology – The rules by which

words are composed ie. Run + ing

PhoneticsMorphologySyntax

PhoneticsMorphologySyntax - rules for the formation of

grammatical sentences ie. "Colorless green ideas sleep

furiously.” Not "Colorless ideas green sleep

furiously.”

PhoneticsMorphologySyntaxSemantics

PhoneticsMorphologySyntaxSemantics – meaning ie. “rose”

PhoneticsMorphologySyntaxSemanticsPragmatics

PhoneticsMorphologySyntaxSemanticsPragmatics - relationship of meaning

to the context, goals and intent of the speaker

ie. “Duck!”

PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse

PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse – 'beyond the sentence

boundary'

Truly interdisciplinary

Truly interdisciplinaryProbabilistic methods

Truly interdisciplinaryProbabilistic methodsAPIs

Natural Language Toolkit for Python

Natural Language Toolkit for PythonText not speech

Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,

taggers, chunkers, parsers, classifiers, clusterers…

Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,

taggers, chunkers, parsers, classifiers, clusterers…

words = book.words()bigrams = nltk.bigrams(words)cfd = nltk.ConditionalFreqDist(bigrams)pos = nltk.pos_tag(words)

Token - An instance of a symbol, commonly a word, a linguistic unit

Tokenize – to break a sequence of characters into constituent parts

Often uses a delimiter like whitespace, special characters, newlines



“The quick brown fox jumped over the log.”



“The quick brown fox jumped over the log.”

“Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”

Lexeme – The set of forms taken by a single word; main entries in a dictionary

ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny

Morpheme - the smallest meaningful unit in the grammar of a language

UnladylikeDogsTechnique

Sememe – a unit of meaning attached to a morpheme

Dog - A domesticated carnivorous mammal

S – A plural marker on nouns

Phoneme - the smallest contrastive unit in the sound system of a language

/k/ sound in the words kit and skill /e/ in peg and bread International Phonetic Alphabet (IPA)

Lexicon - A Vocabulary, a set of a language’s lexemes

Lexical Ambiguity - multiple alternative linguistic structures can be built for the input

ie. “I made her duck”

Lexical Ambiguity - multiple alternative linguistic structures can be built for the input

ie. “I made her duck”We use POS tagging and word

sense disambiguation to ATTEMPT to resolve these issues

Part of Speech - how a word is used in a sentence

Grammar – the syntax and morphology of a natural language

Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics

Concordance – list of the usages of a word in its immediate context from a specific text

>>> text1.concordance(“monstrous”)

Collocation – a sequence of words that occur together unusually often

ie. red wine>>> text4.collocations()

Hapax – a word that appears once in a corpus

>>> fdist.hapaxes()

Bigram – sequential pair of wordsFrom the sentence fragment “The

quick brown fox…” (“The”, “quick”), (“quick”, “brown”),

(“brown”, “fox…”)

Frequency Distribution – tabulation of values according to how often a value occurs in a sample

ie. Word frequency in a corpusWord length in a corpus >>> fdist = FreqDist(samples)

Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition

ie. How often is a word tagged as a noun compared to a verb

>>> cfd = nltk.ConditionalFreqDist(tagged_corpus)

POS tagging

Default – tags everything as a nounAccuracy - .13

Regular Expression – Uses a set of regexes to tag based on word patterns

Accuracy = .2

Unigram – learns the best possible tag for an individual word regardless of context

ie. Lookup tableNLTK example accuracy = .46Supervised learning

Based on conditional frequency analysis of a corpus

P (word | tag) ie. What is the probability of the

word “run” having the tag “verb”

Ngram tagger – expands unigram tagger concept to include the context of N previous tokens

Including 1 previous token in bigram Including 2 previous tokens is

trigram

N-gram taggers use Hidden Markov Models

P (word | tag) * P (tag | previous n tags)

ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”

Tradeoff between coverage and accuracy

Ex. If we train on ('In', 'IN'), ('this', 'DT'), ('light', 'NN'),

('we', 'PPSS'), ('need', 'VB'), ('1,000', 'CD'), ('churches', 'NNS'), ('in', 'IN'), ('Illinois', 'NP'), (',', ','), ('where', 'WRB'), ('we', 'PPSS'), ('have', 'HV'), ('200', 'CD'), (';', '.')

Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)

Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)Try to tag : “Turn on this light”

Higher the value of NMore accurate taggingLess coverage of unseen phrases

Higher the value of NMore accurate taggingLess coverage of unseen phrasesSparse data problem

Backoff

BackoffPrimary – TrigramSecondary – BigramTertiary – Unigram or default

Backoff

>>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)

>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

BackoffAccuracy = 0.84491179108940495

Brill Inductive transformation-based

learning


learning“painting with progressively finer

brush strokes”


learning“painting with progressively finer

brush strokes”Supervised learning using a tagged

corpus as training data

1. Every word is labeled based on the most likely tag

ex. Sentence “She is expected to race tomorrow”

PRO/She VBZ/is VBN/expected TO/to NN/race NN/tomorrow.

2. Candidate transformation rules are proposed based on errors made.

Ex. Change NN to VB when the previous tag is TO

3. The rule that results in the most improved tagging is chosen and the training data is re-tagged

PRO/She VBZ/is VBN/expected TO/to VB/race NN/tomorrow.

nltk.tag.brill.demo()

For efficiency Brill uses templates for rule creation

ie. Previous word is tagged Z one of the two preceding words is

tagged Z, etc.

More info on POS tagging in Jurafsky ch. 5 available at Bobst

Critical!

Critical!Human annotators create a “gold

standard”


standard” Inter-annotator agreement


standard” Inter-annotator agreementSeparating training and test data


standard” Inter-annotator agreementSeparating training and test data90% train 10% test

Confusion Matrix

Predicted/ Gold

NN VB ADJ

NN 103 10 7

VB 3 117 0

ADJ 9 13 98

Remember: performance is always limited by ambiguity in training set and agreement between human annotators

Translation – Google translate

Translation – Google translateSpelling and grammar check–

Microsoft Word


Microsoft WordConversational interfaces - Wolfram

Alpha



AlphaText analysis of online material for

marketing



AlphaText analysis of online material for

marketing IM interface help desk

Break!And pair up for the lab

Natural Language Processing

Documents

Transcript of Natural Language Processing