NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...
Transcript of NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...
CIS192 Python ProgrammingNLP
Harry Smith
University of Pennsylvania
April 12, 2017
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 1 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 2 / 25
Natural Language Processing
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 3 / 25
Natural Language Processing
source: researchperspectives.org
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 4 / 25
Language is Hard
How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 5 / 25
Some Applications of NLP
Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 6 / 25
Natural Language Tool Kit (NLTK)
NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 7 / 25
Terminology
Corpus: a body of textToken: Each meaningful "entity" in a string
I Depending on context, tokens can be words, sentences,paragraphs
Part of Speech: categories that words are assigned toI noun, verb, adjective, ...
Stopwords: most common words in a language, filtered out beforeNLP tasks
I the, is, at, which, on, ...
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 8 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 9 / 25
Word Tokenization
>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 10 / 25
Sentence Tokenization
>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t
able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 11 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 12 / 25
Counting Words in a Corpus
Before today:
>>> counts = defaultdict(int)>>> for word in words:
counts[word] += 1
Better:
>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]
Neat!
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 13 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 14 / 25
Creating "random" sentences from a corpus
In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:
I Take your current wordI Add a new word that typically appears after your current wordI Repeat!
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 15 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 16 / 25
Part of Speech Tagging
Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 17 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 18 / 25
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 19 / 25
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:
I For each token in our corpus, count the occurrences of surroundingtokens
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 20 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 21 / 25
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 22 / 25
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
I Machine learning (last two lectures)F Try to "learn" the sentiment-relevant features of textF Need lots of training dataF Data driven approach
I Rule-based methodsF "Rule of thumb": uses heuristics to determine sentimentsF Needs little training dataF Good for production: fast, but harder to initially createF VADER: popular rule based model aimed for social media
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 23 / 25
Outline
1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 24 / 25
Next Steps
CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub
Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 25 / 25