Natural Language Processing
description
Transcript of Natural Language Processing
![Page 1: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/1.jpg)
![Page 2: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/2.jpg)
Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either
or both NLP and speech recognition Week 2 Speech Recognition
Speech lab Finish projects and short critical reading
Week 3 Present projects Discuss reading
![Page 3: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/3.jpg)
What is “Natural Language”?
![Page 4: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/4.jpg)
Phonetics
![Page 5: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/5.jpg)
Phonetics – the sounds which make up a word
ie. “cat” – k a t
![Page 6: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/6.jpg)
PhoneticsMorphology
![Page 7: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/7.jpg)
PhoneticsMorphology – The rules by which
words are composed ie. Run + ing
![Page 8: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/8.jpg)
PhoneticsMorphologySyntax
![Page 9: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/9.jpg)
PhoneticsMorphologySyntax - rules for the formation of
grammatical sentences ie. "Colorless green ideas sleep
furiously.” Not "Colorless ideas green sleep
furiously.”
![Page 10: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/10.jpg)
PhoneticsMorphologySyntaxSemantics
![Page 11: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/11.jpg)
PhoneticsMorphologySyntaxSemantics – meaning ie. “rose”
![Page 12: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/12.jpg)
PhoneticsMorphologySyntaxSemanticsPragmatics
![Page 13: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/13.jpg)
PhoneticsMorphologySyntaxSemanticsPragmatics - relationship of meaning
to the context, goals and intent of the speaker
ie. “Duck!”
![Page 14: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/14.jpg)
PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse
![Page 15: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/15.jpg)
PhoneticsMorphologySyntaxSemanticsPragmaticsDiscourse – 'beyond the sentence
boundary'
![Page 16: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/16.jpg)
Truly interdisciplinary
![Page 17: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/17.jpg)
Truly interdisciplinaryProbabilistic methods
![Page 18: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/18.jpg)
Truly interdisciplinaryProbabilistic methodsAPIs
![Page 19: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/19.jpg)
Natural Language Toolkit for Python
![Page 20: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/20.jpg)
Natural Language Toolkit for PythonText not speech
![Page 21: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/21.jpg)
Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,
taggers, chunkers, parsers, classifiers, clusterers…
![Page 22: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/22.jpg)
Natural Language Toolkit for PythonText not speechCorpora, tokenizers, stemmers,
taggers, chunkers, parsers, classifiers, clusterers…
words = book.words()bigrams = nltk.bigrams(words)cfd = nltk.ConditionalFreqDist(bigrams)pos = nltk.pos_tag(words)
![Page 23: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/23.jpg)
![Page 24: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/24.jpg)
Token - An instance of a symbol, commonly a word, a linguistic unit
![Page 25: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/25.jpg)
Tokenize – to break a sequence of characters into constituent parts
Often uses a delimiter like whitespace, special characters, newlines
![Page 26: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/26.jpg)
Tokenize – to break a sequence of characters into constituent parts
Often uses a delimiter like whitespace, special characters, newlines
“The quick brown fox jumped over the log.”
![Page 27: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/27.jpg)
Tokenize – to break a sequence of characters into constituent parts
Often uses a delimiter like whitespace, special characters, newlines
“The quick brown fox jumped over the log.”
“Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”
![Page 28: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/28.jpg)
Lexeme – The set of forms taken by a single word; main entries in a dictionary
ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny
![Page 29: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/29.jpg)
Morpheme - the smallest meaningful unit in the grammar of a language
UnladylikeDogsTechnique
![Page 30: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/30.jpg)
Sememe – a unit of meaning attached to a morpheme
Dog - A domesticated carnivorous mammal
S – A plural marker on nouns
![Page 31: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/31.jpg)
Phoneme - the smallest contrastive unit in the sound system of a language
/k/ sound in the words kit and skill /e/ in peg and bread International Phonetic Alphabet (IPA)
![Page 32: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/32.jpg)
Lexicon - A Vocabulary, a set of a language’s lexemes
![Page 33: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/33.jpg)
Lexical Ambiguity - multiple alternative linguistic structures can be built for the input
ie. “I made her duck”
![Page 34: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/34.jpg)
Lexical Ambiguity - multiple alternative linguistic structures can be built for the input
ie. “I made her duck”We use POS tagging and word
sense disambiguation to ATTEMPT to resolve these issues
![Page 35: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/35.jpg)
Part of Speech - how a word is used in a sentence
![Page 36: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/36.jpg)
Grammar – the syntax and morphology of a natural language
![Page 37: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/37.jpg)
Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics
![Page 38: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/38.jpg)
Concordance – list of the usages of a word in its immediate context from a specific text
>>> text1.concordance(“monstrous”)
![Page 39: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/39.jpg)
Collocation – a sequence of words that occur together unusually often
ie. red wine>>> text4.collocations()
![Page 40: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/40.jpg)
Hapax – a word that appears once in a corpus
>>> fdist.hapaxes()
![Page 41: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/41.jpg)
Bigram – sequential pair of wordsFrom the sentence fragment “The
quick brown fox…” (“The”, “quick”), (“quick”, “brown”),
(“brown”, “fox…”)
![Page 42: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/42.jpg)
Frequency Distribution – tabulation of values according to how often a value occurs in a sample
ie. Word frequency in a corpusWord length in a corpus >>> fdist = FreqDist(samples)
![Page 43: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/43.jpg)
Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition
ie. How often is a word tagged as a noun compared to a verb
>>> cfd = nltk.ConditionalFreqDist(tagged_corpus)
![Page 44: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/44.jpg)
POS tagging
![Page 45: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/45.jpg)
Default – tags everything as a nounAccuracy - .13
![Page 46: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/46.jpg)
Regular Expression – Uses a set of regexes to tag based on word patterns
Accuracy = .2
![Page 47: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/47.jpg)
Unigram – learns the best possible tag for an individual word regardless of context
ie. Lookup tableNLTK example accuracy = .46Supervised learning
![Page 48: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/48.jpg)
Based on conditional frequency analysis of a corpus
P (word | tag) ie. What is the probability of the
word “run” having the tag “verb”
![Page 49: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/49.jpg)
Ngram tagger – expands unigram tagger concept to include the context of N previous tokens
Including 1 previous token in bigram Including 2 previous tokens is
trigram
![Page 50: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/50.jpg)
N-gram taggers use Hidden Markov Models
P (word | tag) * P (tag | previous n tags)
ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”
![Page 51: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/51.jpg)
Tradeoff between coverage and accuracy
![Page 52: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/52.jpg)
Ex. If we train on ('In', 'IN'), ('this', 'DT'), ('light', 'NN'),
('we', 'PPSS'), ('need', 'VB'), ('1,000', 'CD'), ('churches', 'NNS'), ('in', 'IN'), ('Illinois', 'NP'), (',', ','), ('where', 'WRB'), ('we', 'PPSS'), ('have', 'HV'), ('200', 'CD'), (';', '.')
![Page 53: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/53.jpg)
Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)
![Page 54: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/54.jpg)
Bigrams for “light” are (('this', 'DT'), ('light', 'NN'))Trigrams for light are ('In', 'IN'), ('this', 'DT'), ('light', 'NN’)Try to tag : “Turn on this light”
![Page 55: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/55.jpg)
Higher the value of NMore accurate taggingLess coverage of unseen phrases
![Page 56: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/56.jpg)
Higher the value of NMore accurate taggingLess coverage of unseen phrasesSparse data problem
![Page 57: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/57.jpg)
Backoff
![Page 58: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/58.jpg)
BackoffPrimary – TrigramSecondary – BigramTertiary – Unigram or default
![Page 59: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/59.jpg)
Backoff
>>> t0 = nltk.DefaultTagger('NN') >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
![Page 60: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/60.jpg)
BackoffAccuracy = 0.84491179108940495
![Page 61: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/61.jpg)
Brill Inductive transformation-based
learning
![Page 62: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/62.jpg)
Brill Inductive transformation-based
learning“painting with progressively finer
brush strokes”
![Page 63: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/63.jpg)
Brill Inductive transformation-based
learning“painting with progressively finer
brush strokes”Supervised learning using a tagged
corpus as training data
![Page 64: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/64.jpg)
1. Every word is labeled based on the most likely tag
ex. Sentence “She is expected to race tomorrow”
PRO/She VBZ/is VBN/expected TO/to NN/race NN/tomorrow.
![Page 65: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/65.jpg)
2. Candidate transformation rules are proposed based on errors made.
Ex. Change NN to VB when the previous tag is TO
![Page 66: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/66.jpg)
3. The rule that results in the most improved tagging is chosen and the training data is re-tagged
PRO/She VBZ/is VBN/expected TO/to VB/race NN/tomorrow.
![Page 67: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/67.jpg)
nltk.tag.brill.demo()
![Page 68: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/68.jpg)
For efficiency Brill uses templates for rule creation
ie. Previous word is tagged Z one of the two preceding words is
tagged Z, etc.
![Page 69: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/69.jpg)
More info on POS tagging in Jurafsky ch. 5 available at Bobst
![Page 70: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/70.jpg)
Critical!
![Page 71: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/71.jpg)
Critical!Human annotators create a “gold
standard”
![Page 72: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/72.jpg)
Critical!Human annotators create a “gold
standard” Inter-annotator agreement
![Page 73: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/73.jpg)
Critical!Human annotators create a “gold
standard” Inter-annotator agreementSeparating training and test data
![Page 74: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/74.jpg)
Critical!Human annotators create a “gold
standard” Inter-annotator agreementSeparating training and test data90% train 10% test
![Page 75: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/75.jpg)
Confusion Matrix
![Page 76: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/76.jpg)
Predicted/ Gold
NN VB ADJ
NN 103 10 7
VB 3 117 0
ADJ 9 13 98
![Page 77: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/77.jpg)
Remember: performance is always limited by ambiguity in training set and agreement between human annotators
![Page 78: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/78.jpg)
Translation – Google translate
![Page 79: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/79.jpg)
Translation – Google translateSpelling and grammar check–
Microsoft Word
![Page 80: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/80.jpg)
Translation – Google translateSpelling and grammar check–
Microsoft WordConversational interfaces - Wolfram
Alpha
![Page 81: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/81.jpg)
Translation – Google translateSpelling and grammar check–
Microsoft WordConversational interfaces - Wolfram
AlphaText analysis of online material for
marketing
![Page 82: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/82.jpg)
Translation – Google translateSpelling and grammar check–
Microsoft WordConversational interfaces - Wolfram
AlphaText analysis of online material for
marketing IM interface help desk
![Page 83: Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062408/568144e2550346895db1b0ed/html5/thumbnails/83.jpg)
Break!And pair up for the lab