Natural Language Processing

83

description

Natural Language Processing. Overview of this unit. Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either or both NLP and speech recognition Week 2 Speech Recognition Speech lab Finish projects and short critical reading Week 3 - PowerPoint PPT Presentation

Transcript of Natural Language Processing

Page 1: Natural Language Processing
Page 2: Natural Language Processing

Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either

or both NLP and speech recognition Week 2 Speech Recognition

Speech lab Finish projects and short critical reading

Week 3 Present projects Discuss reading

Page 3: Natural Language Processing

What is “Natural Language”?

Page 4: Natural Language Processing

Phonetics

Page 5: Natural Language Processing

Phonetics – the sounds which make up a word

ie. “cat” – k a t

Page 6: Natural Language Processing

PhoneticsMorphology

Page 7: Natural Language Processing

PhoneticsMorphology – The rules by which

words are composed ie. Run + ing

Page 8: Natural Language Processing

PhoneticsMorphologySyntax

Page 9: Natural Language Processing

PhoneticsMorphologySyntax - rules for the formation of

grammatical sentences ie. "Colorless green ideas sleep

furiously.” Not "Colorless ideas green sleep

furiously.”

Page 10: Natural Language Processing

PhoneticsMorphologySyntaxSemantics

Page 11: Natural Language Processing

PhoneticsMorphologySyntaxSemantics – meaning ie. “rose”

Page 12: Natural Language Processing

PhoneticsMorphologySyntaxSemanticsPragmatics

Page 13: Natural Language Processing

PhoneticsMorphologySyntaxSemanticsPragmatics - relationship of meaning

to the context, goals and intent of the speaker

ie. “Duck!”

Page 14: Natural Language Processing

PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse

Page 15: Natural Language Processing

PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse – 'beyond the sentence

boundary'

Page 16: Natural Language Processing

Truly interdisciplinary

Page 17: Natural Language Processing

Truly interdisciplinaryProbabilistic methods

Page 18: Natural Language Processing

Truly interdisciplinaryProbabilistic methodsAPIs

Page 19: Natural Language Processing

Natural Language Toolkit for Python

Page 20: Natural Language Processing

Natural Language Toolkit for PythonText not speech

Page 21: Natural Language Processing

Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,

taggers, chunkers, parsers, classifiers, clusterers…

Page 22: Natural Language Processing

Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,

taggers, chunkers, parsers, classifiers, clusterers…

words = book.words()bigrams = nltk.bigrams(words)cfd = nltk.ConditionalFreqDist(bigrams)pos = nltk.pos_tag(words)

Page 23: Natural Language Processing
Page 24: Natural Language Processing

Token - An instance of a symbol, commonly a word, a linguistic unit

Page 25: Natural Language Processing

Tokenize – to break a sequence of characters into constituent parts

Often uses a delimiter like whitespace, special characters, newlines

Page 26: Natural Language Processing

Tokenize – to break a sequence of characters into constituent parts

Often uses a delimiter like whitespace, special characters, newlines

“The quick brown fox jumped over the log.”

Page 27: Natural Language Processing

Tokenize – to break a sequence of characters into constituent parts

Often uses a delimiter like whitespace, special characters, newlines

“The quick brown fox jumped over the log.”

“Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”

Page 28: Natural Language Processing

Lexeme – The set of forms taken by a single word; main entries in a dictionary

ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny

Page 29: Natural Language Processing

Morpheme - the smallest meaningful unit in the grammar of a language

UnladylikeDogsTechnique

Page 30: Natural Language Processing

Sememe – a unit of meaning attached to a morpheme

Dog - A domesticated carnivorous mammal

S – A plural marker on nouns

Page 31: Natural Language Processing

Phoneme - the smallest contrastive unit in the sound system of a language

/k/ sound in the words kit and skill /e/ in peg and bread International Phonetic Alphabet (IPA)

Page 32: Natural Language Processing

Lexicon - A Vocabulary, a set of a language’s lexemes

Page 33: Natural Language Processing

Lexical Ambiguity - multiple alternative linguistic structures can be built for the input

ie. “I made her duck”

Page 34: Natural Language Processing

Lexical Ambiguity - multiple alternative linguistic structures can be built for the input

ie. “I made her duck”We use POS tagging and word

sense disambiguation to ATTEMPT to resolve these issues

Page 35: Natural Language Processing

Part of Speech - how a word is used in a sentence

Page 36: Natural Language Processing

Grammar – the syntax and morphology of a natural language

Page 37: Natural Language Processing

Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics

Page 38: Natural Language Processing

Concordance – list of the usages of a word in its immediate context from a specific text

>>> text1.concordance(“monstrous”)

Page 39: Natural Language Processing

Collocation – a sequence of words that occur together unusually often

ie. red wine>>> text4.collocations()

Page 40: Natural Language Processing

Hapax – a word that appears once in a corpus

>>> fdist.hapaxes()

Page 41: Natural Language Processing

Bigram – sequential pair of wordsFrom the sentence fragment “The

quick brown fox…” (“The”, “quick”), (“quick”, “brown”),

(“brown”, “fox…”)

Page 42: Natural Language Processing

Frequency Distribution – tabulation of values according to how often a value occurs in a sample

ie. Word frequency in a corpusWord length in a corpus   >>> fdist = FreqDist(samples)

Page 43: Natural Language Processing

Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition

ie. How often is a word tagged as a noun compared to a verb

>>> cfd = nltk.ConditionalFreqDist(tagged_corpus)

Page 44: Natural Language Processing

POS tagging

Page 45: Natural Language Processing

Default – tags everything as a nounAccuracy - .13

Page 46: Natural Language Processing

Regular Expression – Uses a set of regexes to tag based on word patterns

Accuracy = .2

Page 47: Natural Language Processing

Unigram – learns the best possible tag for an individual word regardless of context

ie. Lookup tableNLTK example accuracy = .46Supervised learning

Page 48: Natural Language Processing

Based on conditional frequency analysis of a corpus

P (word | tag) ie. What is the probability of the

word “run” having the tag “verb”

Page 49: Natural Language Processing

Ngram tagger – expands unigram tagger concept to include the context of N previous tokens

Including 1 previous token in bigram Including 2 previous tokens is

trigram

Page 50: Natural Language Processing

N-gram taggers use Hidden Markov Models

P (word | tag) * P (tag | previous n tags)

ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”

Page 51: Natural Language Processing

Tradeoff between coverage and accuracy

Page 52: Natural Language Processing

Ex. If we train on ('In', 'IN'), ('this', 'DT'), ('light', 'NN'),

('we', 'PPSS'), ('need', 'VB'), ('1,000', 'CD'), ('churches', 'NNS'), ('in', 'IN'), ('Illinois', 'NP'), (',', ','), ('where', 'WRB'), ('we', 'PPSS'), ('have', 'HV'), ('200', 'CD'), (';', '.')

Page 53: Natural Language Processing

Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)

Page 54: Natural Language Processing

Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)Try to tag : “Turn on this light”

Page 55: Natural Language Processing

Higher the value of NMore accurate taggingLess coverage of unseen phrases

Page 56: Natural Language Processing

Higher the value of NMore accurate taggingLess coverage of unseen phrasesSparse data problem

Page 57: Natural Language Processing

Backoff

Page 58: Natural Language Processing

BackoffPrimary – TrigramSecondary – BigramTertiary – Unigram or default

Page 59: Natural Language Processing

Backoff

>>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)

>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

Page 60: Natural Language Processing

BackoffAccuracy = 0.84491179108940495

Page 61: Natural Language Processing

Brill Inductive transformation-based

learning

Page 62: Natural Language Processing

Brill Inductive transformation-based

learning“painting with progressively finer

brush strokes”

Page 63: Natural Language Processing

Brill Inductive transformation-based

learning“painting with progressively finer

brush strokes”Supervised learning using a tagged

corpus as training data

Page 64: Natural Language Processing

1. Every word is labeled based on the most likely tag

ex. Sentence “She is expected to race tomorrow”

PRO/She VBZ/is VBN/expected TO/to NN/race NN/tomorrow.

Page 65: Natural Language Processing

2. Candidate transformation rules are proposed based on errors made.

Ex. Change NN to VB when the previous tag is TO

Page 66: Natural Language Processing

3. The rule that results in the most improved tagging is chosen and the training data is re-tagged

PRO/She VBZ/is VBN/expected TO/to VB/race NN/tomorrow.

Page 67: Natural Language Processing

nltk.tag.brill.demo()

Page 68: Natural Language Processing

For efficiency Brill uses templates for rule creation

ie. Previous word is tagged Z one of the two preceding words is

tagged Z, etc.

Page 69: Natural Language Processing

More info on POS tagging in Jurafsky ch. 5 available at Bobst

Page 70: Natural Language Processing

Critical!

Page 71: Natural Language Processing

Critical!Human annotators create a “gold

standard”

Page 72: Natural Language Processing

Critical!Human annotators create a “gold

standard” Inter-annotator agreement

Page 73: Natural Language Processing

Critical!Human annotators create a “gold

standard” Inter-annotator agreementSeparating training and test data

Page 74: Natural Language Processing

Critical!Human annotators create a “gold

standard” Inter-annotator agreementSeparating training and test data90% train 10% test

Page 75: Natural Language Processing

Confusion Matrix

Page 76: Natural Language Processing

Predicted/ Gold

NN VB ADJ

NN 103 10 7

VB 3 117 0

ADJ 9 13 98

Page 77: Natural Language Processing

Remember: performance is always limited by ambiguity in training set and agreement between human annotators

Page 78: Natural Language Processing

Translation – Google translate

Page 79: Natural Language Processing

Translation – Google translateSpelling and grammar check–

Microsoft Word

Page 80: Natural Language Processing

Translation – Google translateSpelling and grammar check–

Microsoft WordConversational interfaces - Wolfram

Alpha

Page 81: Natural Language Processing

Translation – Google translateSpelling and grammar check–

Microsoft WordConversational interfaces - Wolfram

AlphaText analysis of online material for

marketing

Page 82: Natural Language Processing

Translation – Google translateSpelling and grammar check–

Microsoft WordConversational interfaces - Wolfram

AlphaText analysis of online material for

marketing IM interface help desk

Page 83: Natural Language Processing

Break!And pair up for the lab