Introduction to Artificial Intelligence CoreNLP, Semantic ... · Introduction to Arti cial...

Introduction to Artificial IntelligenceCoreNLP, Semantic Analysis, Naives

Bayes Classifier

Janyl JumadinovaNovember 18, 2016

CoreNLP

I Reference: http://stanfordnlp.github.io/CoreNLP/

I Package available in /opt/corenlp/

I Run: java -cp

"/opt/corenlp/stanford-corenlp-3.7.0/*" -Xmx2g

edu.stanford.nlp.pipeline.StanfordCoreNLP

-annotators tokenize,ssplit,pos,lemma,ner -file

input.txt

2/24

http://stanfordnlp.github.io/CoreNLP/

CoreNLP Annotators

http://stanfordnlp.github.io/CoreNLP/annotators.html

I tokenize: Creates tokens from the given text.

I ssplit: Separates a sequence of tokens into sentences.

I pos: Creates Parts of Speech (POS) tags for tokens.

I ner: Performs Named Entity Recognition classification.

3/24


CoreNLP Annotators


I lemma: Creates word lemmas for tokens.

– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.

4/24


CoreNLP Annotators


I lemma: Creates word lemmas for tokens.– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.

– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.

4/24


CoreNLP Annotators


I lemma: Creates word lemmas for tokens.– The goal of lemmatization (as of stemming) is to reducerelated forms of a word to a common base form.– Lemmatization usually uses a vocabulary and morphologicalanalysis of words to:- remove inflectional endings only, and- to return the base or dictionary form of a word, which isknown as the lemma.

4/24


Sentiment Analysis

5/24

Sentiment Analysis

I https://www.csc.ncsu.edu/faculty/healey/tweet_viz/

tweet_app/

I http://www.alchemyapi.com/developers/

getting-started-guide/twitter-sentiment-analysis

I www.sentiment140.com 6/24

https://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

https://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis

http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis

www.sentiment140.com

Sentiment analysis has many other names

I Opinion extraction

I Opinion mining

I Sentiment mining

I Subjectivity analysis

7/24

Sentiment analysis is the detection of attitudes

I “enduring, affectively colored beliefs, dispositions towardsobjects or persons”

8/24

Attitudes

I Holder (source) of attitude

I Target (aspect) of attitude

I Type of attitude- From a set of types:Like, love, hate, value, desire, etc.- Or (more commonly) simple weighted polarity:positive, negative, neutral, together with strength

I Text containing the attitude- Sentence or entire document

9/24

Sentiment analysis

I Simplest task:Is the attitude of this text positive or negative?

I More complex:Rank the attitude of this text from 1 to 5

I Advanced:Detect the target, source, or complex attitude types

10/24

Baseline Algorithm

I Tokenization

I Feature Extraction

I Classification using different classifiers– Naive Bayes– MaxEnt– SVM

11/24

Sentiment Tokenization Issues

I Deal with HTML and XML markup

I Twitter/Facebook/... mark-up (names, hash tags)

I Capitalization (preserve for words in all caps)

I Phone numbers, dates

I Emoticons

12/24

Extracting Features for Sentiment

Classification

I How to handle negation:I didn’t like this movie vs. I really like this movie

I Which words to use?–Only adjectives–All words

13/24

Negation

Add NOT to every word between negation and followingpunctuation

14/24

Naive Bayes Algorithm

I Simple (“naive”) classification method based on Bayes rule

I Relies on very simple representation of document:- Bag of words

15/24


16/24


17/24


18/24


For a document d and a class c

19/24


20/24


21/24


22/24

Binarized (Boolean feature) Multinomial

Naive Bayes

Intuition:

I Word occurrence may matter more than word frequency

I The occurrence of the word fantastic tells us a lot

I The fact that it occurs 5 times may not tell us much more.

Boolean Multinomial Naive Bayes

Clips all the word counts in each document at 1

23/24

Neural Networks and Deep Learning: Next!

I http://nlp.stanford.edu/sentiment/

I java -cp "/opt/corenlp/stanford-corenlp-3.7.0/*"

-Xmx2g edu.stanford.nlp.sentiment.SentimentPipeline

-file input.txt

24/24

http://nlp.stanford.edu/sentiment/

Introduction to Artificial Intelligence CoreNLP, Semantic ... · Introduction to Arti cial...

Documents

Transcript of Introduction to Artificial Intelligence CoreNLP, Semantic ... · Introduction to Arti cial...