Major presentation

48
Sentiment analysis and opinion mining on twitter Parnika Sharma (9911103501)

Transcript of Major presentation

Sentiment analysis and opinion mining

on twitter

Parnika Sharma(9911103501)

Sentiment analysis It aims to determine the attitude of a

speaker or a writer with respect to some topic or the overall contextual polarity of a document.

The attitude may be his or her judgment or evaluation, affective state (that is, the emotional state of the author when writing), or the intended emotional communication (that is, the emotional effect the author wishes to have on the reader).

subtasks within sentiment analysis:

Determining document subjectivity: Often called subjectivity classification, this subtask

determines whether a giving text is objective (expressing a fact) or subjective (expressing an opinion or emotion).

Determining document orientation: Often called sentiment classification or document-level

sentiment classification, this subtask determines the polarity of a given subjective text. In other words, determines whether this text expresses a positive or a negative sentiment on its subject matter.

Determining the strength of document orientation: This subtask decides whether the positive sentiment

expressed by a text on its subject matter is weakly positive, mildly positive or strongly positive.

motivation Consumer information

› Product reviews Marketing

› Consumer attitudes› Trends

Politics› Politicians want to know voters’ views› Voters want to know policitians’ stances and who else

supports them Social

› Find like-minded individuals or communities

approaches Machine learning

› Naïve Bayes› Maximum Entropy Classifier› SVM

Unsupervised methods› K-means› Olsu’s Threshold› Fuzzy c-means

Literature survey

Title:Twitter for Sentiment Analysis: When Language Resources Are Not Available

Data to annotate given. But no training data or additional

resources provided.

Aim: To create a lexical resource in an

automated way without any human intervention for annotating data.

Affective lexicon to be used for polarity classification.

To obtain training material, use emoticons as indicators of a mood within a message.

Split the tweets into 2 sets: positive - ;) :} :] ... negative - :{ :’( ... We get a positive word list and a

negative word list. If a word present more frequently in

positive set, then it is positive and vice versa.

Title: Opinion Mining and Sentiment Analysis on a Twitter Data Stream

Aim: To analyse the effectiveness of various

popular classifiers and identify the more suitable classifier for twitter that could ease the process of classifying sentiments in tweets.

Strategy: To use two or more classifiers chained

one after the other. This resulted in a high yield, better accuracy of mined data. I

First stage: the incoming preprocessed data is classified into three categories – polar, neutral and irrelevant.

Second stage: the data classified under polar is fed to a second classifier for further segregation into positive and negative.

The classification algorithms used in the research are:

Naive Bayes Random Forest Support Vector Machines(SVM) SMO

Title: Text Mining Facebook Status Updates for Sentiment Classification

The research has been performed on Tunisian user’s statuses on Facebook during the “Arabic Spring” era.

The aim is to extract useful information about user’s sentiments and behaviours during this sensitive and significant period.

For this purpose, a method based on Support Vector Machine(SVM) and Naive Bayes has been proposed.

 

The methodology used is collection of raw data, followed by lexicon development.

Three types of lexicons were created ; lexicon for social acronyms, lexicon for emoticons and lexicon for interjections.

Then data preprocessing is done – stop words removal and stemming, followed by feature extraction.

Finally, the machine learning algorithms are applied.

The performance of different feature sets using Naive Bayer (NB) and SVM classifiers was then compared.

Title: Mining Social Emotions from Affective Text

This paper is concerned with the problem of mining social emotions from text. The aim of this research is to discover the connection between different social emotions and affective terms and based on it automatically predict social emotion of the text.

An official Chinese news portal has been used for the dataset collection.

The proposed solution is to construct a joint emotion-topic model. Latent Dirichlet Allocation (LDA) has been used with an additional layer for emotion modelling.

A three step process has been used for generation of affective terms:

The first step is to generate an emotion from a document specific emotional distribution.

The second step is to generate a latent topic from a Multinomial distribution.

The final step is to develop an approximate inference method based on Gibbs sampling.

As a complete generative model, the proposed model allows to infer a number of conditional probabilities for unseen documents. For example, probabilities of latent topics given an emotion and that of terms given a topic. This method was found to be better than emotion-term model and multiclass SVM as the emotion assignments at the term level could be visualised.

Title: Aspect- Based Twitter Sentiment Classification

This paper proposes an aspect-based sentiment classification approach to analyze sentiments for tweets.

In previous studies, the overall sentiment of a tweet was determined. But this is not useful for the companies which need to monitor consumer opinion of their product/services. For them it would be more useful to have information as to which aspects of the product/service the users are happy or unhappy about.

The aspect-based sentiment classifier makes use of a POS tagger, a sentiment lexicon and a few gazetteer lists to produce results of the form [aspect, sentiment words, polarity]. This process consists of three main steps:

1. Aspect-sentiment extraction: Given a tweet, this step determines a list of possible aspect candidates along with their associated sentiments and polarity.

2. Aspect ranking and selection: A tweet can express many different opinions. Only important aspects should be selected. For example, when classifying tweets on a telecommunication company, some of the aspects of interest include customer service, 3G connectivity, speed, etc. In this step the aspect candidates are then ranked and the set of most significant aspects are selected as the expected aspects.

3. Aspect classification: Using the set of expected aspects and results from the aspect-sentiment extraction step, we obtain the final list of aspects along with their polarity for each tweet.

The experimental results suggested that a layered classification approach which uses the aspect-based classifier as the first layer classification and the tweet-level classifier as the second layer classification is more effective than a classifier trained using target-dependent features. This approach is able to consistently improve the performance of existing sentiment classifiers.

Title: Opinion Mining on Social Media Data

The aim of this research was to automatically extract the set of messages which contain opinions, filter out non-opinion messages and determine their sentiment directions, that is positive or negative.

Manually labelled data has been used as training data to build model.

The initial step is to preprocess the crawled tweets by removing usernames, hashtags, retweet tags, non-English words.

Three resources were constructed for the further preprocessing which included a stop word dictionary, an emotion dictionary and an acronym dictionary.

After preprocessing all words are transformed into the form (word, POS tag, English-word, Stop-word).

Thereafter, tweets containing opinions are extracted, filtering out the non-opinion tweets. Naive Bayes classifier is then used to classify the tweets based on sentiment.

Since a word may have different meanings in different domains, short text classification is done.

Two feature selection algorithms have been used for this purpose – Mutual Information (MI) and X Feature Selection. The short texts are classified into different domains, so that the classifier can automatically classify with greater performance the tweets as being either positive or negative.

Title: Sentiment Analysis using Sentiment Features

The main objective of this research was to compare state-of-the-art Sentiment Analysis methods against a novel hybrid method.

The Hybrid method adopts a combination of both the supervised methods and unsupervised methods.

It utilizes a Sentiment lexicon to generate a new set of features to train a linear SVM classifier.

In this paper, domain based Twitter Sentiment Analysis is done. The domain considered is smartphones.

The Hybrid Polarity Detection System has three modules:

The first module is the Preprocessing Module in which cleaning of data is done. The preprocessing steps include removal of usernames, URL tags etc.

The second module is Sentiment Feature Generator Module. In this module slangs are replaced with their proper language equivalents Senti Strength lexicon is then used to tag the words with their sentiment score. Fourteen features are extracted from the text.

The third module is Machine Learning Classifier, in which a linear SVM takes the input feature set and classifies the tweets as positive or negative.

Title: The Summary of Differential Evolution Algorithm and its Improvements

In this paper, the authors have provided a summary of the differential evolution algorithm and its improved measures in order to facilitate researchers studying the topic. Firstly the differential evolution algorithm basics and its various operations such as Mutation, Crossover and Selection have been explained. Thereafter, the different improvements directed to increase the optimization performance are compared. The efficiency of differential evolution is optimised using improvements making it a more efficient application. The improvement measures mainly include the evolution operation, parameter settings and other improvements, focussing mainly on the mutation operation.

Title: An Automatic Data Clustering Algorithm based on Differential Evolution

Most traditional clustering algorithms simply assume that the number of clusters is given and focus on the quality of clustering results. This paper presents a clustering algorithm for clustering and automatically determining the number of clusters as well. The proposed algorithm has two steps. Firstly, a mechanism, region splitting and merging (RSM) to split and then merge the similar groups until a self adaptive threshold is reached. Secondly, the number of clusters fine tuned using automatic clustering differential evolution (ACDE).

Methodology to be used Data Collection: Retrieval of twitter

status updates Lexicon development Data Pre-Processing Feature Extraction, Normalisation and

Reduction K-means Clustering Differential Evolution

Overall architecture

Data Pre Processing  Casefolding. Removal of: unnecessary punctuations extra blank spaces retweet tag usertags URL’s Hashtags Removal of stopwords Replacement of emoticons Positive emoticons – EPOS Negative emoticons – ENEG Neutral emoticons – ENEUT Replacement of sentiment words Positive words – POS Negative words – NEG Replacement of negation and intensity words Negation words – NEGATION Intensity words -- INTENSITY

Feature Extraction, Normalisation and Reduction

Feature Extraction The feature extraction is the process of extracting the main

characteristics of the text. For a machine learning algorithm to perform well, it is essential to have features that are descriptive of the text. The total number of occurrences of following features have been taken into account for each tweet:

Words Exclamation marks (!) EPOS keyword ENEG keyword ENEUT keyword POS keyword NEG keyword NEGATION keyword INTENSITY keyword Random words (words left, which do not fall into any category)

Feature Extraction

Normalisation The values of all the features are normalised

to the range of 0 to 1. The normalised value is given by

Normalised(e) = e - Emin Emax - Emin

where, e - the original value Emax - the maximum value of the feature Emin - the minimum value of the feature

Normalisation

Feature Reduction Feature reduction is done by computing

cross correlation for the features. One among the features which are closely related is removed from the table.

Results

ALGORITHM K-MEANS DIFFERENTIAL EVOLUTION

ACCURACY 51% 59%

Findings Through this project I have investigated the utility of

sentiment classification on a collection of dataset. While exploring the topic, I observed that there is a

limited number of algorithms that are useful for twitter sentiment analysis.

The twitter statuses have unique characteristics compared to other corpuses. Since there is a limitation of 140 words, the usual data mining techniques used for movie reviews, etc can’t be used.

Also, not many research papers were available for feature reduction.

Conclusion 

This project has been a great learning experience in the field of information retrieval and data mining. In this project, twitter dataset was collected for the purpose of Sentiment analysis. Various data preprocessing techniques were applied on the dataset. Thereafter, features were extracted from each tweet and normalised. Feature reduction was then applied to remove one among the closely related features. The quality of features/ attributes that are extracted from the training dataset affects the performance of the technique. K-means clustering algorithm and Differential Evolution, an optimization algorithm was then applied to cluster data into two classes, positive and negative. Finally, the accuracies of these two algorithms was compared. On the basis of accuracies, it can be said that Differential Evolution performs better than K-Means Algorithm for Twitter dataset.

Future Work As future work, three more clustering

techniques will be applied as part of unsupervised learning which include Olsu’s Threshold, Fuzzy c- means and EM algorithm. Next step would be to compare it with supervised learning methods including SVM, Naive Bayes and LDA. Accuracies of different algorithms will be calculated and compared.

Twitter dataset for a particular product will be collected and Opinion mining will be applied.