NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...

Post on 11-Jul-2020

2 views 0 download

Transcript of NLP Harry Smith - Penn EngineeringCIS192 Python Programming NLP Harry Smith University of...

CIS192 Python ProgrammingNLP

Harry Smith

University of Pennsylvania

April 12, 2017

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 1 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 2 / 25

Natural Language Processing

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 3 / 25

Natural Language Processing

source: researchperspectives.org

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 4 / 25

Language is Hard

How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 5 / 25

Some Applications of NLP

Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 6 / 25

Natural Language Tool Kit (NLTK)

NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 7 / 25

Terminology

Corpus: a body of textToken: Each meaningful "entity" in a string

I Depending on context, tokens can be words, sentences,paragraphs

Part of Speech: categories that words are assigned toI noun, verb, adjective, ...

Stopwords: most common words in a language, filtered out beforeNLP tasks

I the, is, at, which, on, ...

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 8 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 9 / 25

Word Tokenization

>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 10 / 25

Sentence Tokenization

>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t

able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 11 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 12 / 25

Counting Words in a Corpus

Before today:

>>> counts = defaultdict(int)>>> for word in words:

counts[word] += 1

Better:

>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]

Neat!

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 13 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 14 / 25

Creating "random" sentences from a corpus

In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:

I Take your current wordI Add a new word that typically appears after your current wordI Repeat!

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 15 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 16 / 25

Part of Speech Tagging

Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 17 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 18 / 25

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 19 / 25

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:

I For each token in our corpus, count the occurrences of surroundingtokens

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 20 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 21 / 25

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 22 / 25

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

I Machine learning (last two lectures)F Try to "learn" the sentiment-relevant features of textF Need lots of training dataF Data driven approach

I Rule-based methodsF "Rule of thumb": uses heuristics to determine sentimentsF Needs little training dataF Good for production: fast, but harder to initially createF VADER: popular rule based model aimed for social media

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 23 / 25

Outline

1 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 24 / 25

Next Steps

CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub

Harry Smith (University of Pennsylvania) CIS 192 April 12, 2017 25 / 25