Introduction to Artificial Intelligence CoreNLP, Semantic ... · Introduction to Arti cial...
Transcript of Introduction to Artificial Intelligence CoreNLP, Semantic ... · Introduction to Arti cial...
Introduction to Artificial IntelligenceCoreNLP, Semantic Analysis, Naives
Bayes Classifier
Janyl JumadinovaNovember 18, 2016
CoreNLP
I Reference: http://stanfordnlp.github.io/CoreNLP/
I Package available in /opt/corenlp/
I Run: java -cp
"/opt/corenlp/stanford-corenlp-3.7.0/*" -Xmx2g
edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner -file
input.txt
2/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I tokenize: Creates tokens from the given text.
I ssplit: Separates a sequence of tokens into sentences.
I pos: Creates Parts of Speech (POS) tags for tokens.
I ner: Performs Named Entity Recognition classification.
3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I tokenize: Creates tokens from the given text.
I ssplit: Separates a sequence of tokens into sentences.
I pos: Creates Parts of Speech (POS) tags for tokens.
I ner: Performs Named Entity Recognition classification.
3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I tokenize: Creates tokens from the given text.
I ssplit: Separates a sequence of tokens into sentences.
I pos: Creates Parts of Speech (POS) tags for tokens.
I ner: Performs Named Entity Recognition classification.
3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I lemma: Creates word lemmas for tokens.
– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.
4/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I lemma: Creates word lemmas for tokens.– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.
– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.
4/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
I lemma: Creates word lemmas for tokens.– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.
4/24
Sentiment Analysis
5/24
Sentiment Analysis
I https://www.csc.ncsu.edu/faculty/healey/tweet_viz/
tweet_app/
I http://www.alchemyapi.com/developers/
getting-started-guide/twitter-sentiment-analysis
I www.sentiment140.com 6/24
Sentiment analysis has many other names
I Opinion extraction
I Opinion mining
I Sentiment mining
I Subjectivity analysis
7/24
Sentiment analysis is the detection of attitudes
I “enduring, affectively colored beliefs, dispositions towardsobjects or persons”
8/24
Attitudes
I Holder (source) of attitude
I Target (aspect) of attitude
I Type of attitude- From a set of types:Like, love, hate, value, desire, etc.- Or (more commonly) simple weighted polarity:positive, negative, neutral, together with strength
I Text containing the attitude- Sentence or entire document
9/24
Attitudes
I Holder (source) of attitude
I Target (aspect) of attitude
I Type of attitude- From a set of types:Like, love, hate, value, desire, etc.- Or (more commonly) simple weighted polarity:positive, negative, neutral, together with strength
I Text containing the attitude- Sentence or entire document
9/24
Attitudes
I Holder (source) of attitude
I Target (aspect) of attitude
I Type of attitude- From a set of types:Like, love, hate, value, desire, etc.- Or (more commonly) simple weighted polarity:positive, negative, neutral, together with strength
I Text containing the attitude- Sentence or entire document
9/24
Sentiment analysis
I Simplest task:Is the attitude of this text positive or negative?
I More complex:Rank the attitude of this text from 1 to 5
I Advanced:Detect the target, source, or complex attitude types
10/24
Sentiment analysis
I Simplest task:Is the attitude of this text positive or negative?
I More complex:Rank the attitude of this text from 1 to 5
I Advanced:Detect the target, source, or complex attitude types
10/24
Sentiment analysis
I Simplest task:Is the attitude of this text positive or negative?
I More complex:Rank the attitude of this text from 1 to 5
I Advanced:Detect the target, source, or complex attitude types
10/24
Baseline Algorithm
I Tokenization
I Feature Extraction
I Classification using different classifiers– Naive Bayes– MaxEnt– SVM
11/24
Sentiment Tokenization Issues
I Deal with HTML and XML markup
I Twitter/Facebook/... mark-up (names, hash tags)
I Capitalization (preserve for words in all caps)
I Phone numbers, dates
I Emoticons
12/24
Extracting Features for Sentiment
Classification
I How to handle negation:I didn’t like this movie vs. I really like this movie
I Which words to use?–Only adjectives–All words
13/24
Extracting Features for Sentiment
Classification
I How to handle negation:I didn’t like this movie vs. I really like this movie
I Which words to use?–Only adjectives–All words
13/24
Negation
Add NOT to every word between negation and followingpunctuation
14/24
Naive Bayes Algorithm
I Simple (“naive”) classification method based on Bayes rule
I Relies on very simple representation of document:- Bag of words
15/24
Naive Bayes Algorithm
16/24
Naive Bayes Algorithm
17/24
Naive Bayes Algorithm
18/24
Naive Bayes Algorithm
For a document d and a class c
19/24
Naive Bayes Algorithm
20/24
Naive Bayes Algorithm
21/24
Naive Bayes Algorithm
22/24
Binarized (Boolean feature) Multinomial
Naive Bayes
Intuition:
I Word occurrence may matter more than word frequency
I The occurrence of the word fantastic tells us a lot
I The fact that it occurs 5 times may not tell us much more.
Boolean Multinomial Naive Bayes
Clips all the word counts in each document at 1
23/24
Binarized (Boolean feature) Multinomial
Naive Bayes
Intuition:
I Word occurrence may matter more than word frequency
I The occurrence of the word fantastic tells us a lot
I The fact that it occurs 5 times may not tell us much more.
Boolean Multinomial Naive Bayes
Clips all the word counts in each document at 1
23/24
Neural Networks and Deep Learning: Next!
I http://nlp.stanford.edu/sentiment/
I java -cp "/opt/corenlp/stanford-corenlp-3.7.0/*"
-Xmx2g edu.stanford.nlp.sentiment.SentimentPipeline
-file input.txt
24/24