NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...

25
CIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 1 / 25

Transcript of NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...

Page 1: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

CIS192 Python ProgrammingNLP

Harry Smith

University of Pennsylvania

April 12, 2017

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 1 / 25

Page 2: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 2 / 25

Page 3: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Natural Language Processing

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 3 / 25

Page 4: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Natural Language Processing

source: researchperspectives.org

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 4 / 25

Page 5: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Language is Hard

How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 5 / 25

Page 6: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Some Applications of NLP

Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 6 / 25

Page 7: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Natural Language Tool Kit (NLTK)

NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 7 / 25

Page 8: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Terminology

Corpus: a body of textToken: Each meaningful "entity" in a string

I Depending on context, tokens can be words, sentences,paragraphs

Part of Speech: categories that words are assigned toI noun, verb, adjective, ...

Stopwords: most common words in a language, filtered out beforeNLP tasks

I the, is, at, which, on, ...

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 8 / 25

Page 9: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 9 / 25

Page 10: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Word Tokenization

>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 10 / 25

Page 11: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Sentence Tokenization

>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t

able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 11 / 25

Page 12: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 12 / 25

Page 13: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Counting Words in a Corpus

Before today:

>>> counts = defaultdict(int)>>> for word in words:

counts[word] += 1

Better:

>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]

Neat!

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 13 / 25

Page 14: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 14 / 25

Page 15: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Creating "random" sentences from a corpus

In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:

I Take your current wordI Add a new word that typically appears after your current wordI Repeat!

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 15 / 25

Page 16: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 16 / 25

Page 17: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Part of Speech Tagging

Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 17 / 25

Page 18: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 18 / 25

Page 19: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 19 / 25

Page 20: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:

I For each token in our corpus, count the occurrences of surroundingtokens

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 20 / 25

Page 21: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 21 / 25

Page 22: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 22 / 25

Page 23: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

I Machine learning (last two lectures)F Try to "learn" the sentiment-relevant features of textF Need lots of training dataF Data driven approach

I Rule-based methodsF "Rule of thumb": uses heuristics to determine sentimentsF Needs little training dataF Good for production: fast, but harder to initially createF VADER: popular rule based model aimed for social media

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 23 / 25

Page 24: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 24 / 25

Page 25: NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of Pennsylvania April 12, 2017 Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017

Next Steps

CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 25 / 25