Titled TTwwiitttteerr SSeennttiimmeenntt AAnnaallyyssiiss...

AAAnnn MMM TTTeeeccchhh DDD iii sss sss eee rrr ttt aaa ttt iii ooo nnn Titled

TTwwiitttteerr SSeennttiimmeenntt AAnnaallyyssiiss uussiinngg HHyybbrriidd

NNaaïïvvee BBaayyeess

Submitted in partial fulfilment towards the award of the degree of

MASTERS OF TECHNOLOGY

IN

COMPUTER ENGINEERING

BY

Mr. Harsh Vrajesh Thakkar

Supervisor(s)

Dr. Dhiren Patel

2012-2013

Department of Computer Engineering

SARDAR VALLABHBHAI NATIONAL

INSTITUTE OF TECHNOLOGY,

SURAT

ii

Declaration

I hereby declare that the work being presented in this dissertation

report entitled “Twitter Sentiment Analysis using Hybrid Naive Bayes” by me

i.e. Harsh Vrajesh Thakkar, bearing Roll No: P11CO010 and submitted to

the Computer Engineering Department at Sardar Vallabhbhai National

Institute of Technology, Surat; is an authentic record of my own work carried

out during the period of July 2012 to June 2013 under the supervision of

Name of the Supervisor. The matter presented in this report has not been

submitted by me in any other University/Institute for any cause.

Neither the source code there in, nor the content of the project report

have been copied or downloaded from any other source. I understand that my

result grades would be revoked if later it is found to be so.

______________________

(Harsh V. Thakkar)

iii

C E R T I F I C A T E

This is to certify that the dissertation report entitled “Twitter Sentiment

Analysis using Hybrid Naïve Bayes”, submitted by Harsh Vrajesh Thakkar,

bearing Roll No: P11CO010 in partial fulfillment of the requirement for the

award of the degree of MASTER OF TECHNOLOGY in Computer

Engineering, at Computer Engineering Department of the Sardar

Vallabhbhai National Institute of Technology, Surat is a record of his/her own

work carried out as part of the coursework for the year 2012--13. To the best

of our knowledge, the matter embodied in the report has not been submitted

elsewhere for the award of any degree or diploma.

Certified by

____________________

(Dr. Dhiren Patel)

Professor,

Department of Computer Engineering,

S V National Institute of Technology,

Surat – 395007

India

_______________________________

Mr. Udai Pratap Rao

PG Incharge,

M Tech in Computer Engineering,

SVNIT, Surat

Mr. Rakesh Gohil

Head,

Department of Computer Engineering,

S V National Institute of Technology,

Surat – 395007, India

iv

Department of Computer Engineering

SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY,

SURAT

(2012--13)

Approval Sheet

This is to state that the Dissertation Report entitled “Twitter Sentiment

Analysis using Hybrid Naïve Bayes” submitted by Mr. Harsh Vrajesh Thakkar

(Admission No: P11CO010) is approved for the award of the degree of

Masters of Technology in Computer Engineering.

Board of Examiners

Examiners

Supervisor(s)

Chairman

Head, Department of Computer Engineering

Date:___________

Place:______________

v

Acknowledgements

The journey of my Master of Technology has so far been very exciting and full of

challenges. I would like to acknowledge that, apart from God’s and my parent’s blessings,

the success of this dissertation work is the result of sheer encouragement and

guidelines of my Guide Dr. Dhiren R. Patel. I am very much thankful to him for being

there for me and empowering me with his methodology and expertise. He has given me

ample amount of space to explore research interests of my own, and I personally believe

this to be a burning requirement for any natural research enthusiast.

I am also grateful to Mr. Rakesh Gohil, the head of Computer Engineering depatment

and his Staff for giving me a lot of their valuable time, as well as the Staff of Central

Computer Centre for allowing me to stay in the laboratory till late.

Last but not the least, I would like to thank to my friends, who stood by me like a shadow,

whenever I needed their assistance and also gave me the strength in my weak and

wrecked to carry on. Today I am here because of the efforts of all of these wonderful

people.

Harsh V. Thakkar

(P11CO010)

vi

Abstract

Millions of users share opinions on diverse aspects of life and politics every day

using microblogging over the internet. Microbloging websites are rich sources of data

for belief mining and sentiment analysis. In this dissertation work, we focus on using

Twitter for sentiment analysis for extracting opinions about events, products, people and

use it for understanding the current trends or state of the art.

Twitter allows its users a limit of only 140 characters; this restriction forces the user to

be concise as well as expressive at the same moment. This ultimately makes twitter an

ocean of sentiments. Twitter also provides developer friendly streaming. We scuttle

datasets over 4 million tweets by a custom designed crawler for sentiment analysis

purpose. We propose a hybrid naïve bayes classifier by integrating an english lexical

dictionary (SentiWordNet) to the existing machine learning naïve bayes classifier

algorithm. Hybrid naïve bayes classifies the tweets in positive and negative classes

respectivel. Experimental results demonstrate the superiority of hybrid naïve bayes on

multi-sized datasets consisting of variety of keywords over existing approaches yielding

>90 percent accuracy in general and 98.59 percent accuacy in the best case. In our

research, we worked with English; however, the proposed technique can be used with any

other language, provided that language lexicon dictionary.

Table of Contents

Chapter 1 Introduction 1-5

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Why sentiment analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Why NLP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.3 Applications of sentiment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

1.3 The network: Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2 Theoretical Background and Literature Review 5-17

2.1 General sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Issues in sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Classification of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Knowledge-based approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

2.3.2 Relationship-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 Language models approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

2.3.4 Discourse structures and semantics approach . . . . . . . . . . . . . . . . . . . . 9

2.4 Twitter specific approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2.4.1 Lexical analysis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2 Machine learning approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.4.3 Hybrid approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

2.5 Performance review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Lexical approach performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

2.5.2 Machine Learning approach performance . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.3 Hybrid approach performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

2.6 Chapter conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Chapter 3 Design and Analysis of proposed approach 18-33

3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Problem introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Proposed approach: Hybrid Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

3.2.1 Data collection : Twitter API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Sentiment analysis: The classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 4 Implementation Methodology 26-30

4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Test application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

Chapter 5 Results and Analysis 34-42

5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Chapter 6 Conclusion and Future Work 43-44

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 45-47

viii

List of Figures

Figure 2.1 Generic architecture of an lexical approach classifier. 11

Figure 2.2 Generic architecture of a machine learning approach classifier. 13

Figure 3.1 Process steps followed by hybrid naive bayes. 20

Figure 3.2 System architecture of hybrid naive bayes approach. 21

Figure 3.3 A sequence of intermediate preprocessing steps taking place

at this level

22

Figure 4.1 Polarity triangle of synset in SentiWordNet. 30

Figure 4.2 Class diagram of Tweet Sentiment Analyzer. 32

Figure 5.1 Comparison of classifier performances, Dataset size (no. of

tweets) vs Accuracy (%).

36

Figure 5.2 The most informative features if hybrid naive bayes classifier. 37

Figure 5.3 The base naive bayes classifier in action on a windows plat-

form with 50k tweets dataset.

37

Figure 5.4 Accuracy of base naive bayes classifier with a 50k tweets

dataset.

38

Figure 5.5 The hybrid naive bayes classifier in action on a windows plat-

form with 50k tweets dataset.

39

Figure 5.6 Accuracy of hybrid naive bayes classifier with a 50k tweets

dataset.

40

Figure 5.7 Base naive bayes classifier in action on a linux server with 10k

tweets dataset.

41

Figure 5.8 Accuracy of base naive bayes classifier on a linux server with

a 10k tweets dataset.

42

IX

List of Tables

Table 2.1 Performance of lexical approach variants 14

Table 2.2 Performance machine learning approach variants 15

Table 2.3 Performance of hybrid approach variants 17

Table 4.1 General system requirements for our approach. 26

Table 5.1 Performance of base naive bayes classifier 34

Table 5.2 Performance of hybrid naive bayes classifier. 35

X

“I don’t know if I will have the time to write anymore letters because I might be toobusy trying to participate. So if this does end up being the last letter I just want you toknow that I was in a bad place before I started high school and you helped me. Even ifyou didn’t know what I was talking about or know someone who has gone through it, youmade me not feel alone. Because I know there are people who say all these things don’thappen. And there are people who forget what it’s like to be 16 when they turn 17. I knowthese will all be stories someday. And our pictures will become old photographs. We’llall become somebody’s mom or dad. But right now these moments are not stories. Thisis happening, I am here and I am looking at her. And she is so beautiful. I can see it.This one moment when you know you’re not a sad story. You are alive, and you standup and see the lights on the buildings and everything that makes you wonder. And you’relistening to that song and that drive with the people you love most in this world.

And in this moment I swear, We are infinite.”

- Charlie’s last letter(The Perks of being a Wallflower, 2012)

XI

Chapter 1

Introduction

1.1 Motivation

With the proliferation of Web to applications such as microbloging, forums and social

networks, there came reviews, comments, recommendations, ratings and feedbacks

generated by users. The user generated content can be about virtually anything

including politicians, products, people, events, etc. With the explosion of user gen-

erated content came the need by companies, politicians, service providers, social

psychologists, analysts and researchers to mine and analyze the content for differ-

ent uses. The bulk of this user generated content required the use of automated

techniques for mining and analyzing. Cases of the bulk user-generated content that

have been studied are blogs [1] and product/movie [2] reviews.

Microbloging has become a very popular communication tool among Internet users.

Millions of messages are appearing daily in popular web-sites that provide services

for microbloging such as Twitter1, Tumblr 2, Facebook 3. Users of these services

write about their life, share opinions on variety of topics and discuss current is-

sues. Because of a free format of messages and an easy accessibility of microbloging

platforms, Internet users tend to shift from traditional communication tools to mi-

crobloging services. As more and more users post about products and services they

use and express their political and religious views, microbloging web-sites become

valuable sources of peoples opinions and sentiments. Such data can be efficiently

used for marketing and social studies.

Sentiment analysis is an exhaustive research field which has been in study since

1http://www.twitter.com2http://www.tumblr.com3http://www.facebook.com

Introduction

decades. Its initial use was made to analyze sentiment based on long texts such

as letters, emails and so on. It is also deployed in the field of pre-and post-crime

analysis of criminal activities. Variants of approaches have been applied for the

same. Applying this field with the microbloging fraternity is a challenging job. This

challenge became our motivation. Needless to say we are not the first ones to work

in this area. There has been substantial research in both machine learning and the

lexical approaches to sentimental analysis for social networks. We try to improve

the existing approaches by diversifying variants to the research.

1.2 Problem Description

we propose a hybrid approach for sentiment analysis that is a combination of a

machine learning algorithm and a special lexical dictionary.

1.2.1 WHY SENTIMENT ANALYSIS?

Everyday enormous amount of data is created from social networks, blogs and other

media and diffused in to the world wide web. This huge data contains very crucial

opinion related information that can be used to benefit businesses and other aspects

of commercial and scientific industries. Manual tracking and extraction of this useful

information is not possible, thus, Sentiment analysis is required. Sentiment Analy-

sis is the phenomenon of extracting sentiments or opinions from reviews expressed

by users over a particular subject, area or product online. It is an application of

natural language processing , computational linguistics , and text analytics to identify

subjective information from source data. It clubs the sentiments in to categories like

“positive” or “negative”. Thus, it determines the general attitude of the speaker or

a writer with respect to the topic in context.

1.2.2 WHY NLP?

Natural language processing (NLP) is the technology dealing with our most ubiqui-

tous product: human language, as it appears in emails, web pages, tweets, product

descriptions, newspaper stories, social media, and scientific articles, in thousands of

languages and varieties. In the past decade, successful natural language processing

2

Introduction

applications have become part of our everyday experience, from spelling and gram-

mar correction in word processors to machine translation on the web, from email

spam detection to automatic question answering, from detecting people’s opinions

about products or services to extracting appointments from your email.

The greatest challenge of sentiment analysis is to design application-specific algo-

rithms and techniques that can analyze the human language linguistics accurately.

1.2.3 Applications of sentiment analysis

Following are the major applications of sentiment analysis in real world scenarios.

• Product and Service reviews - The most common application of sentiment

analysis is in the area of reviews of consumer products and services. There are

many websites that provide automated summaries of reviews about products

and about their specific aspects. A notable example of that is “Google Product

Search”.

• Reputation Monitoring - Twitter and Facebook are a focal point of many

sentiment analysis applications. The most common application is monitoring

the reputation of a specific brand on Twitter and/or Facebook.

• Result prediction - By analyzing sentiments from relevant sources, one can

predict the probable outcome of a particular event. For instance, sentiment

analysis can provide substantial value to candidates running for various posi-

tions. It enables campaign managers to track how voters feel about different

issues and how they relate to the speeches and actions of the candidates.

• Decision making - Another important application is that sentiment analysis

can be used as a important factor assisting the decision making systems. For

instance, in the financial markets investment. There are numerous news items,

articles, blogs, and tweets about each public company. A sentiment analysis

system can use these various sources to find articles that discuss the companies

and aggregate the sentiment about them as a single score that can be used by

an automated trading system. One such system is The Stock Sonar4

4The Stock Sonar [http://www.thestocksonar.com/]: This system (developed by Digital Trowel)shows graphically the daily positive and negative sentiment about each stock alongside the graphof the price of the stock.

3

Introduction

1.3 The network: Twitter

Twitter is an online social networking service and microblogging service that enables

its users to send and read text-based messages called “tweets”. Tweets are publicly

visible by default, but senders can restrict the message delivery to a limited crowd.

Twitter is one of the largest microblogging service having over 500 million registered

users as of 2012. Statistics revealed by the Infographics Labs5 suggest that back in

the year 2012, on a daily basis 175 million tweets were communicated.There is a large

mass of people using twitter to express sentiments, which makes it an interesting and

challenging choice for sentiment analysis. When so much attention is being paid to

twitter, why not monitor and cultivate methods to analyze these sentiments. Twitter

has been selected with the following purposes in mind.

• Twitter is an Open access social network.

• Twitter is an Ocean of sentiments (limited within 140 characters, i.e. high

sentiment density).

• Twitter provides user friendly API making it easier to mine sentiments in

realtime.

1.4 Contribution

In this thesis, we propose a Hybrid Naive Bayes classifier which is the combination

of a machine learning algorithm (Naive Bayes) and a special lexical dictionary (Sen-

tiWordNet6). We crawled multi-size datasets consisting of approximately 4 million

tweets with a variety of most popular keywords like [“#ironman3”, “#amitabh-

bachhan”, “#Google”, “#twitter”, “#robertdowneyjr” and etc] for the training and

testing purposes. We test the proposed Hybrid Naive Bayes approach using Natural

language Toolkit and observe that it outperforms the existing approaches delivering

competitive results having 98.59 percent accuracy (best case).

5http://infographiclabs.com/6http://www.sentiwordnet.isti.cnr.it

4

Introduction

1.5 Thesis Outline

This thesis is organised as follows:

• chapter 1: Introduces the work of this thesis.

• chapter 2: An exhaustive survey of existing approaches.

• chapter 3: Proposed approach and goals.

• chapter 4: Details the implementation methodology.

• chapter 5: Discussion of results and analysis.

• chapter 6: Concludes the work and explorers avenues for future work.

5

Chapter 2

Theoretical Background andLiterature Review

Sentiment analysis caught attention as one of the most active research areas with

the explosion of social networks. The enormous user-generated content resulting

from these social media contained valuable information in the form of reviews, opin-

ions, etc about products, events and people. Most sentiment analysis studies use

machine learning approaches, which require large amount of user generated content

for training. The research on sentiment analysis so far has mainly focused on two

things: identifying whether a given textual entity is subjective or objective, and

identifying polarity of subjective texts [3].

In the following sections, we review literature on both general and twitter-specific

sentiment analysis in brief. Starting with general sentiment analysis, we also discuss

about the issues in sentiment analysis that make it a difficult task than other text-

classification tasks. Later on, we move to the focus area of this thesis, i.e. twitter

specific approaches. In a study, [4] present a comprehensive review of the literature

written before 2008. Most of the material on general sentiment analysis is based on

their review.

2.1 General Sentiment Analysis

Sentiment analysis has been carried out on a range of topics. For example, there

are sentiment analysis studies for movie reviews [4], product reviews [5], and news

and blogs ([3], [6]).

Theoretical Background and Literature Review

2.2 Issues in Sentiment Analysis

Research reveals that sentiment analysis is more difficult than traditional topic-

based text classification, despite the fact that the number of classes in sentiment

analysis are less than the number of classes in topic-based classification [4]. In sen-

timent analysis, the classes to which a piece of text is assigned are usually negative

or positive. They can also be other binary classes or multi-valued classes like clas-

sification into “positive”, “negative” and “neutral”, but still they are less than the

number of classes in topic-based classification. Sentiment analysis is tougher com-

pared to topic-based classification as the latter relies on keywords for classification.

Whereas in the case of sentiment analysis keywords a variety of features have to be

taken into account. The main reason that sentiment analysis is more difficult than

topic-based text classification is that topic-based classification can be done with the

use of keywords while this does not work well in sentiment analysis [2].

Other reasons for difficulty are: sentiment can be expressed in subtle ways without

any perceived use of negative words; it is difficult to determine whether a given text

is objective or subjective (there is always a newline between objective and subjective

texts); it is difficult to determine the opinion holder (example, is it the opinion of the

author or the opinion of the commenter); there are other factors such as dependency

on domain and on order of words [3]. Other challenges of sentiment analysis are to

deal with sarcasm, irony, and/or negation.

2.3 Classification of approaches

Sentiment analysis is formulated as a computational linguistics problem. The classi-

fication can be approached from different perspectives depending on the nature of the

task at hand and perspective of the person carrying out sentiment analysis. The fa-

miliar approaches are discourse-driven, relationship-driven, language-model-driven,

or keyword-driven. We discuss these approaches in the subsequent subsections.

2.3.1 Knowledge-based approach

In this approach, sentiment is calculated as the function of some keywords. The main

task is the construction of sentiment discriminatory-word lexicons that indicate a

7


particular class such as positive class or negative class. The polarity of the words in

the lexicon is determined prior to the sentiment analysis work. There are variations

to how the lexicon is created. For example, lexicons can be created by starting

with some seed words and then using some linguistic heuristics to add more words

to them, or starting with some seed words and adding to these seed words other

words based on frequency in a text [2]. For certain applications, there are publicly

available discriminatory word lexicons for use in sentiment analysis. Twitrratr 1

provides opinion tracking service of public sentiments for Twitter sentiment analysis.

2.3.2 Relationship-based approach

In this approach, the classification task is approached from the different relationships

that may exist between features2 and components. Such relationships include re-

lationships between discourse participants, relationships between product features.

For instance, if one wants to know the sentiment of customers about a product

brand, one may compute it as a function of the sentiments on different features or

components of it.

2.3.3 Language models approach

In this approach, the classification is done by building n-gram language models. A

gram is a token or lexicon taken into consideration for training and classification.

N-gram represents a set of such chosen lexicons. Generally, in this approach fre-

quency of n-grams are used. In traditional information retrieval and topic-oriented

classification, frequency of n-grams gives better results. The frequency is converted

to TF-IDF3 to take term’s importance in the document to be classified. In a study,

[3] show that in the sentiment classification of movie review blog, term-presence

gives better results than term frequency. They indicate that uni-gram presence is

more suited for sentiment analysis. But later ([3], [5]) found that bi-grams and tri-

grams worked better than uni-grams in a study of sentiment classification of product

reviews.

1http://twitrratr.com2Feature: A feature in terms of sentiment analysis is the chosen set of word used for training

in the supervised learning technique, i.e. the “word dictionary” which is fed into the classifier forthe classification of sentiments in the incorporated text

3TF-IDF: Represents the Term Frequency or the Identifier frequency whichever is taken intoaccount.

8


2.3.4 Discourse structures and semantics approach

This approach, is very dominant in the applications where prior classification of

classes is not possible. Text is classified when it is encountered into the best category

it fits (in the context of its objective). Based on the similarity of semantics of words

in the text, they are grouped together and tagged in to classes. For example in

reviews, the overall sentiment is usually expressed at the end of the text [2]. As a

result the approach, in this case, might be discourse-driven in which the sentiment of

the whole review is obtained as a function of the sentiment of the different discourse

components in the review and the discourse relations that exist between them. For

instance, the sentiment of a paragraph that is at the end of the review might be given

more weight in the determination of the sentiment of the whole review. Semantics

can be used in role identification of agents where there is a need to do so. For

example “India beat Australia” is different from “Australia beat India”.

2.4 Twitter specific approaches

The main difference between sentiment analysis of twitter and documents is that,

twitter based approaches are more specific towards determining the polarity of words

(mainly adjectives); whereas the document based approaches are specific towards

the task of determining features in the text. There are three major approaches for

twitter specific sentiment analysis.

• Lexical analysis approach

• Machine learning approach

• Hybrid approach

Using one or a combination of the different approaches, one can employ one or

a combination of lexical and machine learning techniques. Specifically, one can

use unsupervised techniques, supervised techniques or a combination of them. We

first review the lexical approaches, which focus on building successful dictionaries,

then the machine learning approaches, which are primarily concerned with feature

vectors, and finally a combination of both i.e. hybrid approach.

9


2.4.1 Lexical analysis approach

A lexical approach typically utilizes a dictionary or lexicon of pre-tagged words.

Each word that is present in a text is compared against the dictionary. If a word

is present in the dictionary, then its polarity value is added to the “total polarity

score” of the text. For example, if a match has been found with the word “excellent”,

which is annotated in the dictionary as positive, then the total polarity score of the

blog is increased. If the total polarity score of a text is positive, then that text is

classified as positive, otherwise it is classified as negative. Although naive in nature,

many variants of this lexical approach have been reported to perform better ([7],

[11], [8]).

Since the classification of a statement is dependent upon the scoring it receives,

there is a large volume of work devoted to discovering which lexical information

works best. As a starting point for the field, [9] demonstrated that the subjectivity

of an evaluative sentence could be determined through the use of a hand-tagged

lexicon comprised solely of adjectives. They report over 80 percent accuracy on

single phrases. Extending this work, [7] utilized the same methodology and hand-

tagged adjective lexicon as [9], but they tested the paradigm on a dataset composed

of movie reviews. They reported a much lower accuracy rate of about 62 percent.

Moving away from the hand-tagged lexicons, Turney [2] utilized an Internet search

engine to determine the polarity of words that would be included in the lexicon

[7]. Turney [2] performed two AltaVista4 search queries: one with a target word

conjoined with the word “good”, and a second with the target word conjoined with

the word “bad”. The polarity of the target word was determined by the search result

that returned the most hits. This approach improved accuracy to 65 percent.

In a study, [8] and [10] chose to use the WordNet5 database to determine the polarity

of words. We compared a target word to two pivot words (usually “good” and

“bad”) to find the minimum path distance between the target word and the pivot

words in the WordNet hierarchy. The minimum path distance was converted to an

incremental score and this value was stored with the word in the dictionary. The

reported accuracy level of this approach was 64 percent [10]. An alternative to

the WordNet metric, proposed by [11], was to compute the semantic orientation of

4AltaVista [in.altavista.com]: Was previously a popular internet search engine, which lost itsground to Google after its expansion.

5WordNet [wordnet.princeton.edu]: Is a lexical database for English language. It groups Englishwords into sets of synonyms called synsets, provides short, general definitions, and records thevarious semantic relations between these synonym sets.

10


a word [7]. By subtracting a word’s association strength to a set of negative words

from its association strength to a set of positive words, [11] were able to achieve

an accuracy rate of 82 percent using two different semantic orientation statistic

metrics. Figure 2.1 shows the generic architecture of working of a lexical approach.

Figure 2.1: Generic architecture of an lexical approach classifier.

2.4.2 Machine Learning approach

The other main avenue of research within this area has utilized supervised ma-

chine learning techniques. Within the machine learning approach, a series of feature

vectors are chosen and a collection of tagged corpora are provided for training a

classifier, which can then be applied to an untagged corpus of text. In a machine

learning approach, the selection of features is crucial to the success rate of the clas-

sification. Most commonly, a variety of unigrams (single words from a document) or

n-grams (two or more words from a document in sequential order) are chosen as fea-

ture vectors. Other proposed features include the number of positive words, number

11


of negating words, and the length of a document. Support Vector Machines (SVMs)

([14], [15]) and the Naive Bayes algorithm [16] are the most commonly employed

classification techniques. The reported classification accuracy ranges between 63

percent and 84 percent, but these results are dependent upon the features se-

lected.

Since majority of sentiment analysis approaches use machine learning techniques,

the features of text are often represented as feature vectors. The following are

features used in sentiment analysis.

• TF-IDF -Term frequency or Identifier frequency : as commented in the section

2.3.4, represents simply the count of the “terms” or “words” being taken into

account. These terms maybe uni-grams, bi-grams or higher order n-grams.

Which ones of these yield better results is still not known. [4] claim the

superiority of uni-grams over bi-grams in a movie review sentiment analysis,

whereas [5] argue “bi-grams and tri-grams give better results” on the basis of

their product review classification analysis.

• POS -Part Of Speech tags: By nature English is an ambiguous language.

One particular word may have more than one meaning depending upon its

context of use. POS is used to disambiguate sense which in turn is used to

guide feature selection [3]. For instance, these tags can be used to differentiate

adjectives and adverbs since they are generally used as sentiment indicators

[2]. Later on they realised that the performance of adjectives is worse than

same number of uni-grams chosen on the basis of frequency.

• Syntax and negation: - The use of collocations and other syntactic features

can be used to improve performance. In classification of short texts, algorithms

using syntactic features and algorithms using n-gram features were reported

to yield the same performance [3].

Figure 2.2 shows the generic architecture of working of a machine learning approach.

2.4.3 Hybrid approach

There are some approaches which use a combination of other approaches. One com-

bined approach is followed [17]. They start with two word lexicons and unlabelled

12


Figure 2.2: Generic architecture of a machine learning approach classifier.

data. With the two discriminatory-word lexicons (negative and positive), they cre-

ate pseudo-documents containing all the words of the chosen lexicon. After that,

they compute the cosine similarity between these pseudo-documents and the un-

labelled documents. Based on the cosine similarity, a document is assigned either

positive or negative sentiment.Then they use these to train a Naive Bayes classifier.

In a similar combined approach, [18] defined “a unified framework” that allows one

to use background lexical information in terms of word-class associations, and re-

new this information for specific domains using any available training examples.

They proposed and used a different approach which they called polling multino-

mial classifier, which is another name for Multinomial Naive Bayes (based on the

multinomial distribution function). They used manually labelled data for training,

unlike [17]. They report that experiments with data reveal that their incorporation

of lexical knowledge improves performance. They obtained better performance with

their approach than approaches using lexical knowledge or training data in isolation,

or other approaches that use combined techniques. There are also other types of

combined approaches that are complimentary in that different classifiers are used in

such a way that one classifier contributes to another [19].

13


2.5 Performance review

Sentiment analysis has been applied on a variety of applications like movie reviews,

blog reviews, etc. In their study, [20] carried out the task of evaluating the perfor-

mance of the earlier discussed approaches on a news and movie review application

for twitter. For this purpose they employed the following resources.

1. Cornell Movie Review dataset of tagged blogs (1000 positive and 1000 nega-

tive) [2].

2. List of 2000 positive words and 2000 negative words from the General Inquirer

lists of adjectives [12].

3. Yahoo! Web Search API [13].

4. Porter Stemmer [21].

5. WordNet Java API [22].

6. Stanford Log Linear POS Tagger built with the Penn Treebank tag set [23].

7. WEKA Machine Learning Java API (only used for machine learning) [24].

8. SVM-Light Machine Learning Implementation [25].

The performance of various approaches reported by [20] is as follows.

2.5.1 Lexical approach performance

The performance of lexical approach variants are shown in table 2.1. Combining

Table 2.1: Performance of lexical approach variants. [20].

Approach Accuracy

Baseline 50.0Baseline + Stemming 50.2Baseline + Yahoo! words 57.7Baseline + WordNet 60.4Baseline + all 55.7

14


all of the three variants (i.e. stemming, Yahoo! words, and WordNet) lead to a

surprising drop in accuracy. By simultaneously increasing the number of words in the

dictionary, applying a stemming algorithm, and assigning a weight to each dictionary

word, they substantially increased the size of the dictionary. The proportion of

positive and negative token matches did not change, and they hypothesize that for

each positive match that was found, it was equally likely to find a negative match,

thereby creating a neutral net effect.

Given these results, it appears that the selection of words that are included in the

dictionary is very important for the lexical approach. If the dictionary is too sparse

or exhaustive, one risks the chance of over or under analyzing the results, leading to

a decrease in performance. In accordance with previous findings, our results confirm

that it is difficult to surpass the 65 percent accuracy level using a purely lexical

approach.

2.5.2 Machine Learning approach performance

During the initial experimentation phase of the machine learning approach, [20] used

the WEKA package [24] to obtain a general idea of ML algorithms. Their prelimi-

nary experiments indicate that the SVM and Naive Bayes algorithms are the most

accurate (both giving a tough competition).

Table 2.2: Performance machine learning approach variants [20].

Approach SVM-Light NaiveBayes

Unigram integer 77.4 77.1Unigram binary 77.0 75.5Unigram integer + aggregate 68.2 77.3Unigram binary + aggregate 65.4 77.5

Naive Bayes approach performs quite well in all variants (delivering greater than

75 percent). The unigram results classified by the SVM algorithm are on a similar

level. After initially running the SVM-Light experiments, [20] observed that the

confusion matrix results were heavily skewed to the negative side, for all the feature

representation vectors. To correct this imbalance, they tried introducing a threshold

variable into the results. This threshold variable was set to - 0.2, and whenever a

15


blog was assigned a value greater than -0.2, it was classified as positive. Unfortu-

nately, although this modification did succeed in evening out the confusion matrix,

it did not increase the overall accuracy of the results.

The results from table 2.1 and 2.2 clearly indicate the superiority of machine learn-

ing approaches. Even the worst of the ML results is superior to the best of the

lexical results. It seems that the lexical approaches rely too heavily on semantic

information. As both the lexical and ML approaches demonstrated, the inclusion

of any type of ”dictionary” information in an experiment does not automatically

increase the method’s performance. Even though the ML results are superior, one

must not forget that in order for a ML approach to be successful, a large corpus

of tagged training data must first be collected and annotated, and this can be a

challenging and expensive task.

2.5.3 Hybrid approach performance

Last but not the least, few researchers have proposed some hybrid approaches for

analyzing twitter sentiments. Like, [26] and [17] propose a method to automatically

create a training corpus using microblog specific features like emoticons, which is

subsequently used to train a classifier.

Distant learning is used by [27] to acquire sentiment data. They use tweets ending

in positive emoticons like [ “:)”, “:-)”, “:D”] as positive and [“:(”,“:-(”] as nega-

tive. They built models using Naive Bayes, Maximum Entropy and Support Vector

Machines (SVMs), and they report SVM outperforms other classifiers. In terms of

feature space, they try a uni-gram, bi-gram model in conjunction with POS features.

They note that the unigram model outperforms all other models. Specifically, bi-

grams and POS features do not help. The following table shows the performance of

hybrid approach variants.

2.6 Chapter conclusion

It is clear from the analysis of the literature that machine learning approaches have

so far proved to be outstanding in delivering accurate results. Depending upon the

area of application both the approaches have an edge. For less number of features

Lexical analysis technique is respectable, whereas for large features Machine learning

16


Table 2.3: Performance of hybrid approach variants.

Approach Accuracy

SVM and NB classifiers trained 70.0with Usenet news groups data [19]Class-two NB classifier trained 64.0with unlabelled data [17]Class-two NB classifier 84.0trained with twitter data [18]

analysis technique is overshadowing.

Both approaches have pros and cons. Lexical analysis is ready-to-go technique

which does not require any prior classification or training of the datasets. It can

be directly applied on live data given that the feature set is large. Where as in the

Machine learning technique the classifier need to be initially fed or “trained” with

raw datasets and tuned to cluster the sentiments into predefined classes. But it

works efficiently on large texts with large feature support. Low features lead to less

accuracy for this technique.

In this chapter we presented a detailed literature review of the existing approaches

and techniques. The next chapter describes the detailed design and analysis of the

proposed hybrid naive bayes approach.

17

Chapter 3

Design and Analysis of ProposedApproach

We propose a hybrid approach for analysing sentiments of Twitter. We obtain our

inspiration from nature’s way of inter-species hybridization of the living organisms.

The implementation and methodology and the experimental setup is discussed in

chapter 4.

3.1 Problem Statement

3.1.1 Problem Introduction

The field of sentiment analysis has been always a challenging area of research. The

idea of applying sentiment analysis to Open Social Networks (OSNs) like Twitter,

MySpace and Facebook is the current trend in research. Millions of people use

microbloging services like twitter to express their views and opinions on day to

day affairs. Machine learning approaches have so far proved to be out-ranging

lexical approaches in terms of accuracy; however the speed of lexical approaches is

noteworthy. Almost all approaches aimed at classifying twitter sentiments, so far,

have fallen prey to the time vs performance dilemma. In the literature survey we

summarized machine learning approaches yearning accuracy up to 80 percent. We

firmly believe, hybridizing both lexical and machine learning approaches, results up

to 90 percent or more can be realised (see chapter 5).

Design and Analysis of Proposed Approach

3.1.2 Problem Definition

To propose a hybrid approach yearning competitive results by hybridizing machine

learning and lexical approaches which captures and analyses sentiments of users in

an open social network like twitter for exploring public opinion.

3.2 Proposed approach: Hybrid Naive Bayes

The superiority of both lexical approach (for its speed) and machine learning ap-

proach (for its accuracy) are not unknown to the world. A lexical approach is fast

because of the predefined features (e.g. dictionary) it employs for extracting senti-

ments. Having a dictionary to refer at runtime reduces the time consumption almost

exponentially. To improve performance of lexical approaches the feature set has to

be increased drastically, i.e. a very large dictionary of variety of words with their

frequencies has to be provided at runtime. This increases the overhead of the sys-

tem and hence the performance suffers. Thus, there is a constant trade-off between

Performance vs Time. On the other hand, machine learning approaches employ

a recursively learning and tuning of their features, given large input datasets, im-

proves its performance way beyond any lexical approach can achieve. However, due

to this runtime performance tuning and learning the system undergoes drastic fall

in time constraints.

Our goal is to propose an approach that is a combination of both lexical and ma-

chine learning, hence exploit the best features of both in one. For this purpose

we choose to employ a Naive Bayes classifier and empower it with an En-

glish lexical dictionary SentiWordNet (refer chapter 4 section 1 ). Our hybrid

naive bayes follows the ritual four steps namely: Data collection, Preprocess-

ing, Training the classifier and Classification shown in the fig 3.1. Through the

following sections we shall discuss each step in detail, one at a time. The following

figure (fig 3.2) shows the system architecture of our proposed approach. The labels

Phase I and Phase II shown in figure 3.2 are the deployment phases. They have

been discussed in the next chapter.

19


Figure 3.1: Process steps followed by hybrid naive bayes.

3.2.1 Data collection: Twitter API

For classification and training the classifier we need Twitter data. For this purpose

we make use of API’s twitter provides. Twitter provides two API’s; Stream API1 and

REST API2. The difference between Streaming API and REST APIs are: Streaming

API supports long-lived connection and provides data in almost real -time. The

REST APIs support short-lived connections and are rate-limited (one can download

a certain amount of data [*150 tweets per hour] but not more per day). REST APIs

allow access to Twitter data such as status updates and user info regardless of time.

However, Twitter does not make data older than a week or so available. Thus REST

access is limited to data Twittered not before more than a week. Therefore, while

REST API allows access to these accumulated data, Streaming API enables access

to data as it is being twittered.

1Twitter Stream API [https://dev.twitter.com/docs/streaming-apis]: Twitter has provided de-velopers and research to access realtime twitter data easily through its stream API, based on somegiven constraints of API rate and other privacy policies.

2http://dev.Twitter.com/do

20


Figure 3.2: System architecture of hybrid naive bayes approach.

The search API provides users the ability to access twitter search functionality. It

uses GET requests and returns results formatted using ATOM or JSON, JSON is

recommended due to compactness. A maximum of 100 results are returned per page

for a maximum of 15 pages. Search requests take the general form shown below.

Notable Arguments:

• q: The query string (required).

• lang: Restricts results to specified language using ISO 639-1 code.

• rpp: Results to return per page.

• page: Result page number to return.

An example url request which returns 100 posts containing the word “stuff” for-

matted using JSON is given below. Next we follow a preprocessing step, which is

essential to the tweets/text we will acquire in this step.

21


3.2.2 Preprocessing

The tweets gathered from twitter are a mixture of urls, and other non-sentimental

data like hashtags “#”, annotation “@” and retweets “RT”. To obtain n-gram fea-

tures, we first have to tokenize the text input. Tweets pose a problem for standard

tokenizers designed for formal and regular text. The following figure displays the

various intermediate processing feature steps. The intermediate steps are the list of

Figure 3.3: A sequence of intermediate preprocessing steps taking place at this level.

features to be taken account of by the classifier. We discuss each feature deployed

in brief.

• Language detection- Since we are mainly interested in English text only.

All tweets have been separated into English and non English data. This is

possible by using NLTK’s language detection feature.

• Tokenize- For a sample input text say ”Today the weather is sunny and

beautiful”, Tokenizers divide strings into lists of substrings also known as

Tokens. Tokenizing the text makes it easy to separate out other unnecessary

symbols and punctuations and filter out only those words that can add value

to the sentimental polarity score of the text.

• Constructing n-grams- We will make a set of n-grams out of consecutive

words. A negation (such as “no” and “not”) is attached to a word which

precedes it or follows it. For example, a sentence “I do not like fish” will form

two bigrams: “I do+not”, “do+not”, “like”, “not+like fish”. Such a procedure

allows improving the accuracy of the classification since the negation plays a

special role in an opinion and sentiment expression [28].

• Stop words- In information retrieval, it is a common tactic to ignore very

common words such as “a”, “an”, “the”, etc. since their appearance in a

post does not provide any useful information in classifying a document. Since

22


query term itself should not be used to determine the sentiment of the post

with respect to it, every query term is replaced with a QUERY keyword.

Although this makes it somewhat of a stop word, it can still be useful when

not using a bag-of-words model and the location of the query in relation to

other words becomes important.

• Strip smileys- Many microblogging posts make use of emoticons in order to

convey emotion, making them very useful for sentiment analysis. A range of

about 30 emoticons, including “:)”,“:(”, “:D”, “=]”, “:]”, “=)”, “=[”, “=(” are

replaced with either a SMILE or FROWN keyword. In addition, variations of

laughter such as “haha” or “ahahaha” are all replaced with a single LAUGH

keyword.

• Erratic Casting- In order to address the problem of posts containing various

casings for words (e.g. “HeLLo”), We sanitize the input by lower casing all

words, which provides some consistency in the lexicon. Punctuation in mi-

croblogging posts, it is common to use excessive punctuation in order to avoid

proper grammar and to convey emotion. By identifying a series of exclamation

marks or a combination of exclamation and question marks before removing

all punctuation, relevant features are retained while more consistency is main-

tained.

3.2.3 Training Data

To precisely label the text into their respective classes and thus achieve highest

possible accuracy, we plan to train the classifier using pre-labelled twitter data it-

self. Pre-labelled twitter training data is not available freely, since this year Twitter

changed its data privacy policies and it no longer allows open/free sharing of twit-

ter content. However, they mention that using or downloading twitter content for

individual research purposes is acceptable.

Labelled data

Since we do not have direct access to pre-labelled twitter data, we planned to crawl

it manually. We crawled various sizes of datasets with various keywords of approxi-

mately 4 Million tweets from twitter using a custom python scripted crawler (refer

23


chapter 4). The data obtained in such a way is certainly not labelled, so in order to

address this issue, we propose to crawl twitter and form two different datasets. the

first one consisting of all the positive sentiment tweets i.e. [“:)”, “:-)”, “:-D”, “:D”,

“B-)”] and the latter one consisting of all the negative sentiment tweets i.e.[“:(”,

“:-(”, “:’(”, “X(”, “X-(”]. Thus, we feed these datasets for the classifier for training,

which function almost similar to hand labelled datasets as used in other sentiment

analysis domains.

3.2.4 Sentiment Analysis - The classifier

Naive Bayes was our first choice based on the inference from literature review carried

out in chapter 2 . Naive bayes is bayesian probability distribution model based algo-

rithm. In general all bayesian models are derivatives of the well known Bayes Rule,

which suggests that the probability of a hypothesis given a certain evidence, i.e. the

posterior probability of a hypothesis, can be obtained in tems of the prior proba-

bility of the evidence, the prior probability of the hypothesis and the conditional

probability of the evidence given the hypothesis. Mathematically,

P (H|E) =P (H)P (E|H)

P (E)(3.1)

where,

P (H|E)- posterior probability of the hypothesis.

P (H)- prior probability of hypothesis.

P (E)- prior probability of evidence.

P (E|H)- conditional probability of evidence of given hypothesis.

Or in a simpler form:

Posterior =(Prior)× (Likelihood)

Evidence(3.2)

To explain the concept, lets take an example. For instance, we have a new tweet to

be classified in to one of the positive or negative classes. Given that in the previously

classified tweets, positive tweets are twice the number of negative tweets. Since the

new tweet’s class is not known, the problem is estimating correctly the class that

the tweet is to be categorised in. This can be found out by Bayes rule calculating

the probabilities of the likelihood of the tweet to be positive or negative. Hence,

24


from eq. 3.1 we have:

P (n|p) =P (n)P (p|n)

P (p)(3.3)

Since there are twice as many positive tweets as negative, it is reasonable to believe

that a new case (which hasn’t been observed yet) is twice as likely to have member-

ship positive rather than negative. In the Bayesian analysis, this belief is known as

the prior probability. Prior probabilities are based on previous experience, in this

case the percentage of positive tweets and negative tweets, and often used to predict

outcomes before they actually happen. Thus, we can write:

Prior Probability of positive tweetP (p) =No. of positive tweets

Total no. of tweets

Prior Probability of negative tweetP (n) =No. of negative tweets

Total no. of tweets

Let there be say a total of 6k tweets, 4k of which are positive and 2k negative, our

prior probabilities for class membership are(where k = 103):

Prior Probability for positive tweet P (p) =4k

6k=

4

6=

2

3

Prior Probability for negative tweet P (R)=2k

6k=

2

6=

1

3

The likelihood of the tweet falling into either of the classes is equal, since we have

only two classes. So likelihood of X = 0.5. So now calculating the posterior proba-

bility of the new tweet say X, being positive or negative, will be:

• Posterior probability of X being positive = (Prior probability of positive)×(Likelihood of X being positive)

=2

3× 1

2=

1

3= 33.34% chances of X being positive.

• Posterior probability of X being negative = (Prior probability of negative)×(Likelihood of X being negative)

=1

3× 1

2=

1

6= 16.67% chances of X being negative.

Thus this tweet will fall in to the positive class.

In our case we would have two hypothesis and many other features on basis of which

the one that has the highest probability would be chosen as a class of the tweet whose

sentiment is being predicted. After every classification step all the probabilities are

again calculated and updated accordingly.

25

Chapter 4

Implementation Methodology

In this chapter the experimental setup and evaluation methodology for the proposed

approach are discussed.

4.1 Experimental Setup

To test the proposed approach, we created a setup with the following system require-

ments. We tested our approach on both Linux and Windows platforms. We used a

Dell Optiplex 980 Windows-7 core-i5 (64 bit) machine equipped with 4 GB of RAM

and a Linux server system with Quardcore processor equipped with 8 GB of RAM.

The general requirements shown in the table below. The tools and technology used

Table 4.1: General system requirements for our approach.

Component Type

Operating system Windows XP/7/8,Linux (Ubuntu Server 12.04)

Processor C2D/i3/i5/i7 (32/64 bits)Min. Memory(RAM) ≥ 4 GBMin. Storage 20 GBBandwidth Uninterrupted High-speed

Internet (1Mbps connection)Software and Third-party tools Linked Media Framework 2.3.5,

NLTK 2.0, and SentiWordNet 3.0

are as follows.


• Python 2.7-(implementation language): Python is a general-purpose, inter-

preted high-level programming language whose design philosophy emphasizes

code readability. Its syntax is clear and expressive. Python has a large and

comprehensive standard library and more than 25 thousand extension mod-

ules.

We use python for developing the backend of the test application crawler .

This and the other modules implemented are discussed later.

• NLTK-(language processing modules and validation): The Natural Language

Processing Toolkit (NLTK) is an open source language processing module

of human language in python. Created in 2001 as a part of computational

linguistics course in the Department of Computer and Information Science at

the University of Pennslyvania. NLTK provides inbuilt support for easy-to-

use interfaces over 50 lexicon corpora. NLTK was designed with four goals in

mind.

1. Simplicity : Provide and intuitive framework along with substantial build-

ing blocks, giving users a practical knowledge of NLP without getting

bogged down in the tedious house-keeping usually associated with pro-

cessing annoted language data.

2. Consistency : Provide a uniform framework with consistent interfaces and

data structures, and easily guessable method names.

3. Extensibility : Provide a structure into which new software modules can

easily by accommodated, including alternative implementations and com-

peting approaches on the same task.

4. Modularity : Provide components that can be used independently without

needing to understand the rest of the toolkit.

• LMF - (Persistent database): The Linked Media Framework, unlike other

applications of databases this application requires a persistent database. this

is because we have to query twitter in realtime and collect a certain number

of tweets the database connection should be open all the time. This feature

is only available in some database server applications namely Google App.

27


Engine1, Linked Media Framework [LMF]2 The core component of the

Linked Media Framework is a Linked Data Server that allows to expose data

following the Linked Data Principles:

– Use URIs as names for things.

– Use HTTP URIs, so that people can look up those names.

– When someone looks up a URI, provide useful information, using the

standards (RDF, SPARQL).

– Include links to other URIs, so that they can discover more things.

The Linked Data Server implemented as part of the LMF goes beyond the

Linked Data principles by extending them with Linked Data Updates and by

integrating management of metadata and content and making both accessible

in a uniform way. Our extensions are described in more detail in LinkedMe-

diaPrinciples.

In addition to the Linked Data Server, the LMF Core also offers a highly config-

urable Semantic Search service and a SPARQL endpoint. Setting up and using

the Semantic Search component is described in SemanticSearch. Accessing the

SPARQL endpoint is described in SPARQLEndpoint. Whereas the extension

of the Linked Data principles is already conceptually well-described, we are

currently still working on a proper specification and extension of Semantic

Search and SPARQL endpoint for Linked Data servers.

LMF consists of some modules, some of them optional, that can be used to

extend the functionality of the Linked Media Server:

– LMF Semantic Search - offers a highly configurable Semantic Search

service based on Apache SOLR. Several semantic search indexes can be

configured in the same LMF instance.

1Google App. Engine: Google App Engine (often referred to as GAE or simply App Engine,and also used by the acronym GAE) is a platform as a service (PaaS) cloud computing platformfor developing and hosting web applications in Google-managed data centers. Applications aresandboxed and run across multiple servers. App Engine offers automatic scaling for web applica-tions as the number of requests increases for an application, App Engine automatically allocatesmore resources for the web application to handle the additional demand.

2LMF: The Linked Media Framework is an easy-to-setup server application that bundles to-gether some key open source projects to offer some advanced services for linked media management.

28


– LMF Linked Data Cache - implements a cache to the Linked Data

Cloud that is transparently used when querying the content of the LMF

using either LDPath, SPARQL (to some extent) or the Semantic Search

component. In case a local resource links to a remote resource in the

Linked Data Cloud and this relationship is queried, the remote resource

will be retrieved in the background and cached locally.

– LMF Reasoner - implements a rule-based reasoner that allows to pro-

cess Datalog-style rules over RDF triples; the LMF Reasoner will be

based on the reasoning component developed in the KiWi project, the

predecessor of the LMF.

– LMF Text Classification provides basic statistical text classifi-

cation services; multiple classifiers can be created, trained with

sample data and used to classify texts into categories.

– LMF Versioning - implements versioning of metadata updates; the

module allows getting metadata snapshots for a resource for any time in

its history and provides an implementation of the memento protocol.

– LMF Stanbol Integration - allows in-tegrating with Apache Stanbol

for content analysis and interlinking; the LMF provides some automatic

configuration of Stanbol for common tasks.

• SentiWordNet 3.0: SentiWordNet is a lexical resource for opinion mining.

SentiWordNet assigns to each synset of WordNet three sentiment scores: pos-

itivity, negativity, objectivity. It groups English words into sets of synonyms

called “synsets”, provides short, general definitions, and records the various

semantic relations between these synonym sets. The purpose is twofold: to

produce a combination of dictionary and thesaurus that is more intuitively

usable, and to support automatic text analysis and artificial intelligence ap-

plications. The database and software tools have been released under a BSD

style license and can be downloaded and used freely. The database can also

be browsed online3.

SentiWordNet is the result of research carried out by Andrea Esuli and Fabrizio

Sebastiani. Given that the sum of the opinion-related scores assigned to a

synset is always 1.0, it is possible to display these values in a triangle whose

3SentiWordNet available at: http://sentiwordnet.isti.cnr.it

29


vertices are the maximum possible values for the three dimensions observed.

Figure 1 shows the graphical model we have designed to display the scores of a

synset. This model is used in the Web-based graphical user interface through

which SENTIWORDNET can be freely accessed at their website4

Figure 4.1: Polarity triangle of synset in SentiWordNet [courtesy:http://sentiwordnet.isti.cnr.it ].

4.2 Evaluation

Accuracy is the ability of the classifier to correctly classify and label the new

tweets into their respective classes. For this purpose we propose to use Natural

Language Toolkit (NLTK). The classifier is first run on the crawled twitter data,

classifying tweets for the keywords decided previously and thus trained. Thereafter,

the classifier is ran through NLTK accuracy analysis function which tests it on its

own corpora. Thus, we obtain validated results.

4.3 Test Application

The application “TSA” (Tweet Sentiment Analyzer) has been deployed in two

phases (refer chapter 3), namely-

4SentiWordNet : http://patty.isti.cnr.it/esuli/software/SentiWordNet.

30


• Phase I: Deploying TSA (aplha) system with the base naive bayes classifier

and harvesting analyzer results.

• Phase II: Hybridizing naive bayes classifier by embedding SentiWordNet 3.0

lexicon dictionary i.e. (beta). The analyzer is run through NLTK tests and

results are extracted. Then later on we integrate LMF 2.3.5 for analyzing

realtime twitter data on the fly.

The figure 4.2 represents the class diagram of the TSA application, generated using

PyNSource tool5.

5PyNSource: PyNSource can convert your python source into java files (just the class declara-tions of course, not the meat of the code) and thus you can import the resulting java files into amore sophisticated UML modelling tool that understands java. There you can auto-layout yourUML and you can print out or take screen shots of the resulting diagrams.

You could even write a script to automate the java code generation. Thus when you fire up youJava based UML modelling tool and do an import, you only import the changes to your diagramand you don’t lose the custom way you have laid out your diagrams. PyNSource can also generateDelphi (object pascal) code

31

Figure 4.2: Class diagram of Tweet Sentiment Analyzer.


The classes in olive green color are the root classes. Other mixed color classes serve

as an intermediate functions to the root classes. While the ones in greyish-yellow

color are the terminal classes.

33

Chapter 5

Results and Analysis

In this chapter, the results of the Phase I and Phase II are presented along with

their analysis.

5.1 Results

The results of both of the existing classifier and proposed hybrid classifier are

presented and compared in the format of deployment i.e. Phase I & II (refer chapter

4 ).

• ATSPhase I : Tests were carried out using multiple twitter datasets consisting

of a mixture of new and old keywords like [“#ironman3”, “#amitabhbachhan”,

“#Google”, “#twitter”, “#robertdowneyjr” and etc]. Our datasets vary in size

from a few hundreds to a couple of millions, with a goal to test the scalability

and also ensuring performance. The performance of base naive bayes classifier

is shown in the table below with respect to dataset-size. kindly note here

(K=thousand, and M=Million)

Table 5.1: Performance of base naive bayes classifier.

Dataset size Accuracy

1K 28.5410K 29.5750K 50.77100K 59.961M 70.02


• ATSPhase II : After integrating SentiWordNet lexicon dictionary, same proce-

dure was carried on the hybrid naive bayes classifier and the following results

were harvested. Note should be taken that the same datasets were used on

both the classifiers. Comparing the results of base naive bayes to hybrid naive

Table 5.2: Performance of hybrid naive bayes classifier.

Dataset size Accuracy

1K 63.9510K 97.5050K 98.60100K 95.481M 94.50

bayes, we can straight forwardly conclude that the proposed hybrid classifier

clearly outperforms the earlier approach. The main reason of the later ap-

proach’s precise accuracy is the availability of the polarity values at before

hand. However, it can observed that there is a gradual and constant downfall

in the accuracy, in the case of hybrid naive bayes, as the number of tweets

increase above 50K. One of The main factors for this phenomenon are

– Quality of dataset - It is observed that sometimes while crawling large

dataset of tweets, due to crawler’s in-activity for periods of time resulting

from network failure may lead to broken or noisy dataset.

– Use of shorthand of words - i.e. people generally tend write short

forms of words like “today” becomes “2day”, “that” is written as “dat”,

and etc, These kind of mixed literal lexicons, when searched through

the dictionary, do not count as a match or hit, thus resulting degraded

classifier accuracy.

• Most Informative Features: After training the Naive Bayes classifier with

the tweets and comments, we asked the classifier to show the 25 most informa-

tive features for it to recognize whether some text should be classified positive,

or negative. The ratio (neg:pos)OR(pos:neg), implies that the particular

word has been used more often as a positive or negative than else wise. For in-

stance, the word “neverknow” has been used 6.2 times as a negative sentiment

than a positive sentiment in the text(last word in figure 5.2 ).

35


Figure 5.1: Comparison of classifier performances, Dataset size (no. of tweets) vsAccuracy (%).

Some of the snapshots of the working system are presented. Figures 5.3 and 5.4

display the base naive bayes classifier working on a windows platform. Figures 5.5

and 5.6 display the hybrid naive bayes classifier working on a windows platform.

Figures 5.7 and 5.8 display the base naive bayes classifier working on a linux server

platform.

36


Figure 5.2: The most informative features if hybrid naive bayes classifier.

Figure 5.3: The base naive bayes classifier in action on a windows platform with50k tweets dataset.

37


Figure 5.4: Accuracy of base naive bayes classifier with a 50k tweets dataset.

38


Figure 5.5: The hybrid naive bayes classifier in action on a windows platform with50k tweets dataset.

39


Figure 5.6: Accuracy of hybrid naive bayes classifier with a 50k tweets dataset.

40


Figure 5.7: Base naive bayes classifier in action on a linux server with 10k tweetsdataset.

41


Figure 5.8: Accuracy of base naive bayes classifier on a linux server with a 10ktweets dataset.

42

6

Conclusion and Future Work

6.1 Conclusion

The biological hybridization of real life inter-species, it has prominently known that

limitations of both species can be conquered. Having applied the same attitude

in our proposal, the experimental studies performed through chapters 4 and 5 ,

successfully show that hybridizing the existing machine learning analysis and lexical

analysis techniques for sentiment classification yield comparatively outperforming

accurate results. For all the datasets used, we recorded consistent accuracy of ≥90%.

Clearly from the success of Hybrid Naive Bayes, it can positively be applied over

other related sentiment analysis applications like financial sentiment analysis (stock

market opinion mining), customer feedback services, and etc.

6.2 Future Work

Substantial amount of work is left to be carried on, here we provide a beam of light

in direction of possible future avenues of research.

• Interpreting Sarcasm: The proposed approach is currently incapable of

interpreting sarcasm. In general sarcasm is the use of irony to mock or convey

contempt, in the context of current work sarcasm transforms the polarity of

an apparently positive or negative utterance into its opposite. This limitation

Conclusion and Future Work

can be overcome by exhaustive study of fundamentals in “discourse-driven

sentiment analysis”. The main goal of this approach is to empirically identify

lexical and pragmatic factors that distinguish sarcastic, positive and negative

usage of words.

• Multi-lingual support: Due to the lack of multi-lingual lexical dictionary, it

is current not feasible to develop a multi-language based sentiment analyzer.

Further research can be carried out in making the classifiers language indepen-

dent, shown by [30]. The authors have proposed a sentiment analysis system

with support vector machines, similar approach can be applied for our system

to make it language independent.

44

Bibliography

[1] M. Bautin, L. Vijayrenu L, and Skenia., “International sentiment analysis fornews and blogs”, In Second International Conference on Weblogs and SocialMedia (ICWSM), 2008.

[2] P. Turney., “Thumbs up or thumbs down?” Semantic orientation applied tounsupervised classification of reviews. In Proceedings of the 40th annual meetingof the Association for Computational Linguistics,pp. 417424, 2002.

[3] B. Pang and L. Lee., “Opinion mining and sentiment analysis”, Foundations andTrends in Information Retrieval, vol. 2, no. 1-2, pp. 1-135, 2008.

[4] B. Pang and L. Lee., “Using very simple statistics for review search: An ex-ploration”, In Proceedings of the International Conference on ComputationalLinguistics (COLING), 2008.

[5] K. Dave, S. Lawrence, and D. Pennock., “Mining the peanut gallery: Opinionextraction and semantic classification of product reviews”, pp. 519-528, 2003.

[6] N. Godbole, M. Srinivasaiah, and S. Skiena., “Large-scale sentiment anal-ysisfor news and blogs,” 2007.

[7] A. Kennedy, D. Inkpen,. “Sentiment Classification of Movie and Prod-uct Reviews Using Contextual Valence Shifters”, Computational Intelligence,pp.110125, 2006.

[8] J. Kamps, M. Marx, R. Mokken., ”Using WordNet to Measure Semantic Orien-tation of Adjectives”, LREC 2004, vol. IV, pp. 11151118, 2004.

[9] V. Hatzivassiloglou, and J. Wiebe., “Effects of Adjective Orientation and Grad-ability on Sentence Subjectivity”, Proceedings of the 18th International Confer-ence on Computational Linguistics, New Brunswick, NJ, 2000.

[10] A. Andreevskaia, S. Bergler, and M. Urseanu, “All Blogs Are Not Made Equal:Exploring Genre Differences in Sentiment Tagging of Blogs”, In InternationalConference on Weblogs and Social Media (ICWSM-2007), Boulder, CO, 2007.

[11] P. Turney, and M. Littman., “Measuring Praise and Criticism: Inference of Se-mantic Orientation from Association”, ACM Transactions on Information Sys-tems, pp. 315346, 2003.

45

[12] P. Stone, J. Dunphy, and D.Smith, “The General Inquirer: A Computer Ap-proach to Content Analysis”, MIT Press, Cambridge, 1966.

[13] Yahoo! Search Web Services, Online http:// developer. yahoo.com/search/.,2007.

[14] J. Akshay., “A Framework for Modeling Influence, Opinions and Structure inSocial Media”, In Proceedings of the Twenty-Second AAAI Conference on Arti-ficial Intelligence, Vancouver, BC, July 2007, pp. 19331934, 2007.

[15] K. Durant, and M. Smith,. “Mining Sentiment Classification from Political WebLogs”, In Proceedings of Workshop on Web Mining and Web Usage Analysis ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery andData Mining (WebKDD-2006), Philadelphia, 2006.

[16] S. Prasad, Micro-blogging sentiment analysis using bayesian classification meth-ods,” 2010.

[17] A. Pak and P. Paroubek, Twitter based system: Using twitter for disambiguat-ing sentiment ambiguous adjectives,” pp. 436-439, 2010.

[18] B. Liu, X. Li, W.S. Lee, and P.S. Yu. Text classfication by labeling words.,Proceedings of the National Conference on Artificial Intelligence, Menlo Park,CA; Cambridge, MA; London; AAAI Press; pp. 425-430. MIT Press; 2004.

[19] P. Melville, W. Gryc, and R.D. Lawrence. Sentiment analysis of blogs by com-bining lexical knowledge with text classification. In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery and data mining, pp.1275-1284, 2009.

[20] R. Prabowo and M. Thelwall. Sentiment analysis: A combined approach. Jour-nal of Informetrics, 3(2), pp. 143-157, 2009.

[21] M. Annett and G. Kondrak, A comparison of sentiment analysis tech-niques:Polarizing movie blogs,” pp. 25-35, 2008.

[22] No Title, Online http://tartarus.org/ martin/PorterStemmer/ java.txt, 2007.

[23] C. Fellbaum,. “WordNet: An Electronic Lexical Database. Language, Speech,and Communication Series”, MIT Press, Cambridge, 1998.

[24] K. Toutanova, and C. Manning., “Enriching the Knowledge Sources Used ina Maxi-mum Entropy Part-of-Speech Tagger” In Proceedings of EMNLP/VLC-2000, Hong Kong, China, pp. 6371, 2000.

[25] H. Witten, E. Frank., “Data Mining: Practical Machine Learning Tools andTechniques”,In 2nd edition of Morgan Kaufmann, San Francisco, 2005.

[26] A. Go, R. Bhayani, and L. Huang. “Twitter sentiment classification using dis-tant supervision”, Technical report, Stanford, 2009.

[27] J. Read,. “Using emoticons to reduce dependency in machine learning tech-niques for sentiment classification”, In Proceedings of Association for ComputerLinguistics (ACL), 2005.

46

[28] T. Joachims., “Making Large-Scale SVM Learning Practical”, In Advances inKernel Methods - Support Vector Learn-ing, pp. 169184. MIT-Press, Cambridge,1999.

[29] J. Wiebe, T. Wilson, and C. Cardie, Annotating expressions of opinions andemotions in language,” Language Resources and Evaluation (formerly Computersand the Humanities), vol. 39, no. 2/3, pp. 164-210, 2005.

[30] S. Narr, M. Hifulfenhaus, and S. Albayrak, Language-independent twitter sen-timent analysis.”, The 5th SNA-KDD Workshop, 2011.

47

Titled TTwwiitttteerr SSeennttiimmeenntt AAnnaallyyssiiss...

Documents

Transcript of Titled TTwwiitttteerr SSeennttiimmeenntt AAnnaallyyssiiss...