Abstract

TITLE

Twitter Sentiment Analysis using Various Classification Algorithms

Abstract

Twitter is a web application to determine online news and social networking service where users post and interact with messages, anywhere in the world. Twitter posts are generally short (140 characters) and generated continuously by public which is well suited for opinion mining. Twitter messages can be classified either in positive or negative sentiment based on certain aspects with respect to term based query. The past studies of sentiment classification are not very conclusive about which features and supervised classification algorithms are good for designing accurate and efficient sentiment classification system. We propose to combine many feature extraction techniques like emoticons, exclamation and question mark symbol, word gazetteer, unigrams to design more accurate sentiment classification system.

Keywords

Twitter; Sentiment Analysis; Opinion Mining; Natural Language Processing

Introduction

Human decision making is extensively influenced by assessment or judgement of others. Before making any move, customers tend to gather as much information as possible about the product they want to buy. The investors analyse and predict the stock market movement of a company based on its popularity among its customers be investing their money in its shares. With the advent development of social media, gathering data for evaluation become easier and less time consuming. Different platform like Twitter, Facebook, Linked In serve as repositories of useful data in terms of reviews, likes, comments etc.

Opinions are linked to almost all human activities because they have key impact on our decision making. We mostly seek others opinions while taking any decisions. In the real world, organizations and business entities are always willing to know public and general opinions about their services and products. On the other hand, consumers also seek the opinions of existing users of a product or service before making a decision to purchase products and subscribing to services. Opinions of public about political candidates can be analysed to forecast results of an election. In the past, organizations, governments and business entities used to conduct surveys and opinion polls on focused groups for obtaining citizen opinions and their sentiments [1].

Twitter is a social networking web application with microblogging feature that has a large and constantly growing user data-base. Thus, the application provides a rich data set in the form of messages that are usually short status updates from Twitter application users that must be expressed in not more than 140 characters in length. On Twitter, data that consists of millions of short messages and user status updates are generated each day on about hundreds of different topics. The task of extracting data from these small texts has become immensely useful for sorting and ranking popularity of topics mentioned within the updates. Nowadays twitter has emerged as one of the most popular platforms for expressing sentiments and thoughts on Internet. It is very useful and obvious to mine and analyse Twitter data for interesting information regarding major trending topics in the media and other spaces.

Methodology

Twitter Sentiment Analysis is generally divided into 3 major categories that is

1. Machine Learning Approach2. Lexicon Based Approach3. Hybrid Approach

The Machine Learning Approach (ML) uses linguistic features and applies well known Machine Learning algorithms.

The Lexicon based approach is driven by a opinion lexicon, which is nothing but a collection of pre-compiled opinion terms. It is mainly divided into two main approaches that is

a) Dictionary based approachb) Corpus Based approach

The Hybrid Approach combines the above two approaches.

To increase the performance and efficiency of sentiment classification system the combination of well-known features extraction methods is considered. The proposed method compares 6 supervised classification algorithms that is

a) Naïve Bayes Algorithmb) Bayes Net Algorithmc) Discriminative Multinomial Naïve Bayes(DMNB) Algorithmd) Sequential Minimal Optimization (SMO) Algorithme) Hyperpipes Algorithmf) Random Forest Algorithm

1) Naïve Bayes(NB): This algorithm is a probabilistic classifier in a simple form that counts the combinations of values and frequency in a data set under consideration and calculates probabilities set. Bayes theorem is the base of this algorithm and assumes that all the attributes are completely independent against a set value of the class variable.

2) Bayes Net (BN): Bayesian nets (BN) are a network-based system that are mainly used for analysing and representing the models that involves uncertainty. Bayesian networks learns the causal relationships and use it to implement incremental learning. To perform classification, first the input nodes must be set with the evidence and then the output nodes can be queried and analysed using standard Bayesian network inference.

3) Discriminative Multinominal Naive Bayes (DMNB): The multinomial Naive Bayes is a well-known and widely used classifier for classification of documents and tested to yield satisfactory performance. Discriminative multinomial Naïve Bayes (DMNB) takes a document and consider it as a bag-of-words. For each class c, P(w|c), the training data is unitized to estimate the probability of observing the word w against the given class. It works on the collection of training documents of the particular class by calculating each word’s relative occurrence frequency. The classifier also needs the prior probability, Pc) which is intuitive to estimate. If the word w occurs nwd number of times in document d, then given a document under test the probability of the class c is calculated in the following manner

4) SMO: Sequential Minimal Optimization (SMO) method is generally used in the training process of Support Vector Machines (SVM) classification algorithm. SMO algorithm consists of many optimizations designed primarily to increase the analysis performance of large datasets. It is designed to ensure that the algorithm converges with results even in degenerate conditions. It works by breaking up a problem into a set of atomic sub-problems, which are solved using analytical approach

5) Hyperpipes: Hyperpipes is a technique that creates a “hyperpipe” for each class of a data set. These Classes are the collections of data build around single object template. it can work extremely fast and effectively.

6) Random Forest: Many trees are produced by this algorithm for classification process. It classifies new object from an input vector by setting the vector against the forest on each of the trees. A classification is generated by each tree. In other words, that class is voted by the tree. The classification having the most votes is chosen by the random forest method across all the trees. It also runs efficiently on large datasets.

Results Obtained

The six selected classification algorithms were executed on features extracted from Sanders Twitter dataset on Weka tool. by configuring it with 10-fold cross validation flag building

and testing of the system is carried out. Simulation results in empirical form are presented in Tables 1-9.

False Positive Rate (FPR), True Positive Rate (TPR), Precision (P), recall (R), F-score (F), and Receiver Operating Characteristic values (ROC) are shown in the following tables.

Table 1: Naïve Bayes Result

Table 2: Bayes Net Results

Table 3: Discriminative Multinominal Naive Bayes(DMNB) Results

Table 4: Sequential Minimal Optimization (SMO) Results

Table 5: Hyperpipes Results

Table 6: Random Forest Results

Performance and Results Comparison

Based on simulation results, the performance of Naive Bayes algorithm is least in comparison of all six algorithms considered in this study. In general, precision and recall scores are sufficiently low against the Positive and Negative classes. This is due to large number of instances in the class ‘other’ in comparison of positive and negative classes. The considered Sanders dataset is highly imbalanced. Overall, the two most balanced and well-performing algorithms are DMNB and SMO, with overall F-scores of 0.769 and 0.75 respectively.

Fig 1: Precision Comparison

Fig 2: Recall Comparison

Fig 3: F-Measure Comparison

References

[1] Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. "Sentiment analysis algorithms and applications: A survey." Ain Shams Engineering Journal 5.4 (2014): 1093-1113.

[2] Liu, Bing. "Sentiment analysis and opinion mining." Synthesis lectures on human language technologies 5.1 (2012): 1-167.

[3] Agarwal, Apoorv, et al. "Sentiment analysis of twitter data." Proceedings of the workshop on languages in social media. Association for Computational Linguistics, 2011.

[4] Imran, Muhammad, et al. "Processing social media messages in mass emergency: A survey." ACM Computing Surveys (CSUR) 47.4 (2015): 67.

[5] Feldman, Ronen. "Techniques and applications for sentiment analysis, “Communications of the ACM 56.4 (2013): 82-89.

[6] Pang, Bo, and Lillian Lee. “Opinion mining and sentiment analysis. “Foundations and trends in information retrieval 2.1-2 (2008): 1-135.

[7] Cambria, Erik, et al. “New avenues in opinion mining and sentiment analysis.” IEEE Intelligent Systems 28.2 (2013): 15- 21.

[8] Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.

[9] Bifet, Albert, and Eibe Frank. "Sentiment knowledge discovery in twitter streaming data." International Conference on Discovery Science. Springer Berlin Heidelberg, 2010.

[10] Saif, Hassan, Yulan He, and Harith Alani. "Semantic sentiment analysis of twitter. International Semantic Web Conference. Springer Berlin Heidelberg, 2012.

Abstract

Engineering

Transcript of Abstract