Sentiment analysis
-
Upload
girisv -
Category
Technology
-
view
3.199 -
download
3
Transcript of Sentiment analysis
Sentiment Analysis
- S.V.Giri - ([email protected])
Sentiment analysis or Opinion Mining
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document
~ Wikipedia[1]
Levels[2] at which sentiments can be expressed: Phrase Sentence Paragraph Document About a Subject
Examples
User’s Opinions
Bob: It's a great movie (Positive sentiment)Alice: Nah!! I didn't like it at all (Negative sentiment)
Bob: I am not so sure about the movie. You may like it, or may be not ! (Neutral!! Confused!!)
Sentiment on Subjects from twitter feeds - Ireland Election - 2011
(Courtesy - IBM)
Sentiment on Aspects – Google product
Why sentiment Analysis
Understanding public opinion on products, movies etc. Ex: There is 67% negative opinion on the color of
Amazon’s new version of Kindle.
Using this knowledge to Make predictions in market trends, results of election
polls etc. Make decisions ! Ex: Changing the color in subsequent versions Personalization! Ex: Recommeding products depending on what your friends feel.
Opinions - Polarity [3]
Binary Positive Negative
Ordinal valuesEx: rating from 1 to 5
Complex polarityDetect the source, target and attitudeEx: Obama offers comfort after colorado shooting.Subject : Obama, Target: People , Attitude: comfort
Approaches
NLP Use of semantics to understand the language Uses lexicons, dictionaries, ontologiesEx: I feel great today. (Understands that user’s feeling is great)
Machine Learning Don’t have to understand the meaning. Uses classifiers such as Naïve Bayes, SVM, Max Ent etc.Ex: I feel great today (Doesn’t have to understand what user is
feeling. It’s just that word great appears in positive or negative set, is good enough to classify the sentence as positive or negative)
NLP
Apple Ipod Review
Alice : Apple ipod is a great music player. It’s better than any other product I have bought
Great – PositiveBetter – PositiveTotal Positives = 2Total Negatives =0Net Score = 2-0=2Hence the review is Positive
NLP Contd…
Apple Ipod Review
Alice : Apple ipod is not bad at all. You can buy it.Not – NegativeBad – NegativeTotal Positives = 0Total Negatives =2Net Score = 0-2=-2Hence the review is NegativeNote: This can be solved by a preprocessing stage such
as converting “Not bad ” to “Good”. But preprocessing for NLP is complex.
Machine Learning [8]
Requires a good classifierRequires a training set for each class.
In our case:2 classes, Positive and NegativeRequire pre-classified training set for both these
classes.
Machine Learning Contd…
Training data for Movie Domain
Positive class Sleepy Hollow is an awesome movie. Every one should
watch it. Christopher Nolan is such a great director that he can
convert any script into a block buster. Great actors, great direction and a great movie.
Negative class Nothing can make this movie better. It can win the
stupidest movie of the year award, if there is such a thing.
Machine Learning over NLP
Advantages Don’t have to create a sentiment lexicon (great is
80% positive, bad is 75% negative etc…) Categorization of proper nouns as well
(Ex: Cameron Diaz) Generic and can be applied for various domains Language independent models (Ex: J'aime le film "Amélie") Disadvantage:
Should have large sets of training data
Architecture
Yelp
Data Collection
Pre-processing
Preparing Training Set
Preparing Test Set
Train Classifier
Test classifier
City Grid
Data Collection
City Grid MediaCityGrid Media is an online media company that connects web and mobile publishers with local businesses by linking them through CityGrid
Provides Restful API Ratings (0-10) Reviews
Domain Restaurant
Pre-Processing
Tokenization Case Conversion Word conversion to full forms (“Don’t” to “do
not”, “I’ll” to “I will”) Removal of punctuations Stop word filter using Lucene Length filter – to remove words with less than 3
characters
Training and Test Data
Reviews with ratings > 8 - Positive Class Reviews with ratings < 3 - Negative Class
TrainingPositive reviews – 20,000Negative reviews – 20,000Considering the same scale with out bias
Test SetPositive reviews – 1,000Negative reviews – 1,000
Mahout’s Naïve Bayes Classifier [4]
Tokenization Splitting the sentences into words.
Vectorization A vector for each review in the vector space model
Training and Test SetsStore the files corresponding to Training and Test sets on HDFS
Train the classifier./bin/mahout trainclassifier -i /restaurants/bayes-
train-input -o /restaurants/bayes-model -type bayes -ng 1 -source hdfs
N-Gram Model
Unigram Considers only one token
Ex: It is a good movie. {It, is, a, good, movie}
BigramConsiders two consecutive tokensEx: It is not bad movie{It is, is not, not bad, bad movie}
Naive Bayes Classification Example - Training Set
Reviews for sea food restaurants This restaurant makes good crab dishes. Crab is a
kind of sea food isn't it? The is a good sea food restaurant. Nay!! don't go there if you want sea food. Try going to
Marina or some other restaurant.
Reviews for breakfast The English breakfast is very good in this restaurant. Crepes are yummy. Eww! I hate sea food. I can survive the entire day on
my breakfast
Naive Bayes Example - Training Set
Considering the case of Unigram
Word frequency in each class
Sea food BreakfastSeafood - 3 1crabs 1 0breakfast 0 1crepes 0 1
Compute prior probabilities according to this table
Test Set
Which place should I go to order crepes? Seafood or breakfast place?
Naïve Bayes Formula p(c/w)= [p(w/c)p(c)]/p(w)
Solution Crepes (Important extracted word from query- all other words being
unimportant) – classify
ProbablityFor sea food = [0* (4/7)/ (1/7)] = 0For BreakFast = (1/3 * (3/7)/(1/7))=1
Results
N-gram 1Confusion Matrix-------------------------------------------------------a b <--Classified as964 36 | 1000 a = (Positive)82 918 | 1000 b = (Negative)================================================
Results
N-gram 2Confusion Matrix-------------------------------------------------------a b c <--Classified as969 31 0 | 1000 a = (Positive)62 938 0 | 1000 b = (Negative)================================================
Results
Results
Precision= True positives / (True Positives + False Positives) Recall = True Positives / (True Positives + False Negatives)
F - score= 2*P*R/(P+R)
The results show that Bi-gram model does better than unigram model
Intensity Calculation
Dark Knight rises is a good movie Dark knight rises is an awesome movie
Both are positive But, second expresses more positive ness NLP is better than Machine Learning Machine learning cannot understand the semantics Need of a lexicon
Also to differentiate between I like the food The food is awesome and it’s worth every penny of your money. The
staff is very friendly and we received a very warm welcome.
(Twitter is restricted to 150 word tweets while many review sites allow users to enter as many words as possible. This Intensity calculation is useful in such cases)
Intensity Models
Intensity Models
Review Level Intensity The Intensity calculated according to the number/type of senti-words in the review
Corpus Level Intensity for the review. The Intensity of the review with respect to the entire corpus of reviews. This depends on the corpus distribution
Review Level Intensity
Uniform weightage ModelPositive emotion word is given a positive score of 1 and negative emotion word is given a negative score of 1 Net Score = ∑Positive Score – ∑Negative Score.
Using LexiconWeighted Net Score =∑ Weighted Positive Score – ∑ Weighted Negative Score.
The intensity values are obtained from Sentiwordnet [5].
Corpus Level Intensity - Gaussian Distribution
Applying Gaussian Distribution over entire corpus of reviews. Note: It doesn’t fall under Gaussian Distribution, but the log frequencies does.
Gaussian Distribution Contd…
Positive Reviews Average Positive Words/Review: 4.1 Average Negative Words/Review: 1.1
Negative Reviews Average Positive Words/Review: 1.7 Average Negative Words/Review: 4.2
Note: We use the property of Gaussian Distribution that 1-sigma deviation from Mean corresponds to 68% of the density, and 2-sigma deviation corresponds to 95% density.
Corpus Level Intensity
Corpus Level intensitiesThe more the number of positive senti-words in a review, the more is its positive intensity. Similarly, the more the number of negative senti-words in a review, the more is its negative intensity
Total Intensity
Total Intensity = [(Review Level Intensity + Corpus Level Intensity)]/2
I Like the foodSentiments : (food)Score = (100 + 1)/2 = 50.5
The food is awesome and it’s worth every penny of your money. The staff is very friendly and we received a very warm welcome.
Sentiments : (Awesome, worth, friendly, warm)Score = (100 + 80)/2 = 90
Aspects
Aspects [6] are the features which define a product/Item etc.
Samsung Galaxy Prevail Android Smartphone (Boost Mobile)--Amazon
Features of Smart Phone: Design Size Speed Sound Music Player Camera/cam Battery
Aspect Extraction - POS Tagger
Aspects can be extracted with the help of a POS TaggerStanford POS Tagger [7] :
This restaurant has good ambianceParse Tree(ROOT (S (NP (DT This) (NN restaurant))
(VP (VBZ has) (NP (JJ good) (NN ambiance))))
NP- Noun Phrase , JJ- Adjective , NN - Noun
Aspect Extraction
Extracting Adjective-Noun Pair from reviews(for the previous product):
This would enable us to identify the aspects and their corresponding sentiments
Reviews Attractive design & compact size Good speed, not the slowest nor the fastest Clear sound for phone calls & decent music player Fixed focus low res cam (2MP) no LED Battery, this is an issue with all smart phones
Aspects – {Design (attractive), Size(compact), Speed(Good), Sound(clear), Music Player(decent), Cam(low resolution), Battery(negative) }
Automatic Aspect Extraction
Used Stanford POS tagger to extract Adjective-Noun pair from the corpus of all the restaurant reviews
Restaurant DomainI – 2548We- 1342They- 955It- 911Food- 347Services- 291Place- 248Foods- 229Service- 210 experiences- 131 Waitress- 122 … pizza-51
Problem : Apart from the aspects/features of restaurants such as Food, Place, service, there is high number of pronouns. These pronouns can represent any thing
Pronoun De-reference
The high frequency counts of pronouns shows that we need to de-reference them and extract the corresponding nouns
This restaurant has good ambiance, but it is not as good as described by my friends
Replacing all the “it”s in this sentence with ambiance“This” with restaurant.
Note: Stanford NLP tool kit has de-referencing API
Freebase
Is –A Relation Ship Another problem faced. Sentiments attached to sub-categories than the main
categories. Ex: The pizza in this restaurant is good. Good is attached to Pizza Pizza is a type of Food Hence all the sentiments about Pizza should be pointed to food
This kind of relationships are given by Graph Database(Entity relationships) called freebase
Algorithm for Automatic Aspect extraction
Algorithm
Use POS tagger to extract nouns attached to adjectives
Dereference the personal pronouns Remove the existing pronouns Use freebase dump to find IS-A relation Merge frequencies of plural and singular words
and use singulars Find the adjectives associated with the nouns.
This would give an indication of the sentiment
Resultant Aspects
Restaurant- 816Food- 719Service- 613experience- 219Waitress- 122 (Further have to establish a relation ship between waitress and service. Need of an ontology for each domain or can use
wordnet to find the distance between waitress and service)Review – 91Drink - 64
References
[1] http://en.wikipedia.org/wiki/Sentiment_analysis
[2] R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar, “Structured models for fine-tocoarse sentiment analysis,” Proceedings of the Association for Computational Linguistics (ACL), pp. 432–439, Prague, Czech Republic: June 2007.
[3] WILSON,T., J.WIEBE, and P.HOFFMANN. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), pp. 347–354, Vancouver, Canada.
[4] https://cwiki.apache.org/MAHOUT/naivebayes.html
[5] http://sentiwordnet.isti.cnr.it/search.php?q=greatest
[6] http://sentic.net/sentire/2011/ott.pdf
[7] http://nlp.stanford.edu:8080/parser/index.jsp
[8] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification using machine learning techniques,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79–86, 2002.
Thank You