MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network,...

40
ii ABSTRACT Recently, social media has become important for social networking and content sharing. Twitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140 characters. A lot of people use sentiment analysis on Twitter to do opinion mining. People choose Twitter because Twitter serves as a good platform for sentiment analysis because of its large user base from different sociocultural zones. The objective of Sentiment Analysis is to identify any clue of positive or negative emotions in a piece of text reflective of the authors’ opinions on a subject. Twitter API, twitter4j, is processed to search selected popular electronic products on Twitter. K-means cluster approach is used to find some clusters that have similar sentences. Similar sentence means the sentences have the same keywords. It means the tweets in the cluster are about how people think about similar features of selected popular electronic products. Each cluster is entered into feature-based sentiment analysis to get the score. After that, the total tweets also process in the sentiment analysis system to analyze how people think about selected popular electronic products. The system uses TF-IDF, k-means algorithm, SentiWordNet and Stanford tool to handle different level steps.

Transcript of MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network,...

Page 1: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

ii

ABSTRACT

Recently, social media has become important for social networking and content

sharing. Twitter, an online social network, allows users to upload short text messages,

also known as tweets, with up to 140 characters. A lot of people use sentiment analysis

on Twitter to do opinion mining. People choose Twitter because Twitter serves as a good

platform for sentiment analysis because of its large user base from different sociocultural

zones. The objective of Sentiment Analysis is to identify any clue of positive or negative

emotions in a piece of text reflective of the authors’ opinions on a subject.

Twitter API, twitter4j, is processed to search selected popular electronic products

on Twitter. K-means cluster approach is used to find some clusters that have similar

sentences. Similar sentence means the sentences have the same keywords. It means the

tweets in the cluster are about how people think about similar features of selected popular

electronic products. Each cluster is entered into feature-based sentiment analysis to get

the score. After that, the total tweets also process in the sentiment analysis system to

analyze how people think about selected popular electronic products. The system uses

TF-IDF, k-means algorithm, SentiWordNet and Stanford tool to handle different level

steps.

Page 2: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

iii

TABLE OF CONTENTS

Abstract .............................................................................................................................. ii

Table of Contents ............................................................................................................... iii

List of Figures ......................................................................................................................v

List of Tables .................................................................................................................... vii

1. Introduction .................................................................................................................1

2. Background and Rationale ..........................................................................................2

2.1 Sentiment Computing and Classification ...........................................................2

2.2 Clustering ...........................................................................................................3

2.2.1 Twitter Clusters System ............................................................................4

2.2.2 K-means Algorithm ..................................................................................6

2.3 Sentiment Analysis ............................................................................................7

2.4 Feature-based Sentiment Analysis Systems.......................................................8

3. Clustering and Sentiment Analysis…. ......................................................................11

3.1 Problem Report ................................................................................................11

3.2 Project Objective ..............................................................................................11

3.3 The Steps of Project .........................................................................................12

3.3.1 TF-IDF ....................................................................................................12

3.3.2 K-means Algorithm ................................................................................13

3.3.3 Sentiment Analysis System .....................................................................13

4. Implementation and Results…. .................................................................................15

4.1 Environment .....................................................................................................15

Page 3: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

iv

4.1.1 Microsoft Visual C# ................................................................................15

4.1.2 Java Swing ..............................................................................................15

4.1.3 Twitter4j ..................................................................................................16

4.1.4 NetBeans IDE .........................................................................................16

4.2 Software Modules ............................................................................................16

4.3 Clustering Tweets ............................................................................................19

4.4 Sentiment Analysis ..........................................................................................21

5. Testing and Evaluation…. ........................................................................................25

5.1 iPhone 6 ...........................................................................................................25

5.2 Play Station 4 ...................................................................................................27

5.3 Xbox One .........................................................................................................28

6. Conclusion and Future Work…. ...............................................................................32

Bibliography and References .............................................................................................33

Page 4: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

v

LIST OF FIGURES

Figure 2.1. Sentiment Computing and Classification .........................................................3 Figure 2.2. Clustering .........................................................................................................4 Figure 2.3. Twitter Clusters System Design .......................................................................5 Figure 2.4. K-means Algorithm ..........................................................................................6 Figure 2.5. Flow Diagram of the Proposed System ............................................................9 Figure 3.1. The TF * IDF of Term t in Document d is Calculated ...................................13 Figure 3.2. Project Steps ...................................................................................................14 Figure 4.1. Twitter4j Output .............................................................................................16 Figure 4.2. Tweets after Human Inspection ......................................................................17 Figure 4.3. Clustering Interface ........................................................................................17 Figure 4.4. Sentiment Analysis Interface ..........................................................................18 Figure 4.5. Cluster Interface: Enter Cluster Number ........................................................19 Figure 4.6. Cluster Interface: Enter Text Document .........................................................19 Figure 4.7. Cluster 1..........................................................................................................20 Figure 4.8. Cluster 2..........................................................................................................20 Figure 4.9. Sentiment Analysis: Score of the Cluster 1 ....................................................21 Figure 4.10. Sentiment Analysis: Score of the Cluster 2 ..................................................22 Figure 4.11. Sentiment Analysis: Tagging of the Cluster 1 ..............................................23 Figure 4.12. Sentiment Analysis: Tagging of the Cluster 2 ..............................................24 Figure 5.1. U.S. Sales of PS4 and Xbox One ...................................................................30 Figure 5.2. System Output for All Data ............................................................................31

Page 5: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

vi

LIST OF TABLES

Table 5.1. iPhone 6 Clusters and Score ............................................................................25 Table 5.2. Evaluation Report of iPhone 6 .........................................................................26 Table 5.3. PS4 Clusters and Score ....................................................................................27 Table 5.4. Evaluation Report of PS4 ................................................................................28 Table 5.5. Xbox One Clusters and Score ..........................................................................29 Table 5.6. Evaluation Report of Xbox One ......................................................................29 Table 5.7. Compare PS4 and Xbox One ...........................................................................30

Page 6: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

1

1. INTRODUCTION

Twitter is a microblogging website that has become increasingly popular with the

network community. Users update short messages, also known as Tweets, which are

limited to 140 characters. Users frequently share their personal opinions on many

subjects, discuss current topics and write about life events. This platform is favored by

many users because it is free from political and economic limitations and is easily

available to millions of people. As the amount of users increase, microblogging platforms

are becoming a place to find strong viewpoints and sentiment.

People use twitter to predict a lot of different areas. For example, people have

already predicted the stock market success by using data from Twitter [1]. People use

Twitter to forecast box-office revenues for movies [2]. From these case studies, we can

know that Twitter is really useful for predicting products, services, or markets. It is one

important reason why Twitter is chosen to predict how people think about the popularity

of electronic products. Another reason is because Twitter serves as a worthy platform for

sentiment analysis due to its large user base from a variety of social and cultural regions

worldwide. Twitter contains a vast number of tweets, with millions being added every

day. This can be easily collected through its APIs (Application Program Interface), which

makes it easy to build a great training set.

Page 7: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

2

2. BACKGROUND AND RATIONALE

2.1 Sentiment Computing and Classification

Sina Weibo is a Chinese microblogging website, similar to Twitter, which allows

users to post with a 140-character limit, mention or talk to other people using

"@UserName" format, add hashtags with "#HashName#" format. The Weibo is one of

the most popular sites in China, in use by well over 30% of Internet users, with a market

penetration similar to the United States' Twitter [3].

This approach builds a Sentiment Dictionary by using the Word2vec tool, which

is modeled after the Semantic Orientation Pointwise Similarity Distance (SO-SD) model

[4]. Once this step is completed, the Emotional Dictionary is used to get the emotional

trends from messages posted by users on Weibo. In this approach, Weibo contents are

categorized into three groups: positive, negative and neutral. After the grouping has been

completed, the approach uses the Paoding word-segmentation tool to separate Weibo

contents into different Chinese words. Next, 70% of the processed words from Weibo are

used to train the Word2vec tool and this gets an extended Weibo Sentiment Dictionary.

The remaining 30% of words are used to confirm the success of the approach. Last,

Weibo Sentiment Dictionary is used to estimate the Weibo sentiment trends. Figure 2.1

illustrates the steps in this approach.

Page 8: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

relate

helps

the m

Word

sentim

used

2.2

are br

than

F

An easy

ed word or c

s to complete

most closely r

This appr

d2vec tool. T

ment trends.

to extend th

Clusteri

One of th

road. Users

just the pro

igure 2.1. S

way to exam

common syn

e this task. F

related word

roach allow

The remaini

The most u

e basic dicti

ing

e issues with

discuss man

oduct review

entiment C

mine the res

nonym for th

For example

ds and their d

ws for 70%

ing 30% of

useful data is

onary.

h Twitter is

ny different t

w. Based on

3

omputing a

sulting depi

he word spe

e, if you ente

distances to

of the colle

collected w

s not enough

that users po

topics in thei

n this know

and Classific

ctions from

ecified by th

er 'Boston', t

'Boston'.

ected words

ords are use

h because the

ost many op

ir posts, so t

wledge, the

cation [3]

this is to f

he user. The

the distance

to be used

ed to estima

ere is so muc

inions and th

these posts f

collection o

find a closel

distance too

tool display

d to train th

ate the Weib

ch data that

hese opinion

focus on mor

of such wild

ly

ol

ys

he

bo

is

ns

re

d-

Page 9: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

rangi

use fi

most

such

simil

point

belon

2.2.1

there

tweet

time

and th

preve

ing data wou

first in order

important m

a way that o

ar to each ot

In Figure

t to know b

ng to the sam

Twitter C

Figure 2.3

is a set of st

ts are in Eng

frame, appro

he tweets wi

ent repetition

uld result in

to help disc

machine lear

objects in th

ther than to t

2.2, we can

because each

me cluster if t

Clusters Sys

3 shows the

teps to be fo

glish and pro

oximately 10

ith a minimu

n in news tw

n inaccurate

cover data w

rning proble

he same grou

those in othe

Figure 2.

n easily sepa

h object sho

they are clos

stem

whole desig

llowed. Firs

obable to cre

000 tweets is

um of 60 cha

weeted.

4

data, which

with similarit

em. It is the

up are called

er clusters.

.2. Clusterin

arate data to

ould belong

se, according

gn for the me

st, eight Twit

ate clusters.

s collected. T

aracters that

h is reason c

ties. Clusteri

task of gro

d a cluster [5

ng [6]

o 3 clusters.

to a cluste

g to the dista

ethod. In ord

tter feeds mu

Second, 9 d

Third, the Tw

are similar a

clustering is

ing can be c

ouping a set

5]. The clus

Distance is

er. Two or

ance.

der to apply t

ust be select

days out of a

weets must b

are removed

necessary t

onsidered th

of objects i

sters are mor

an importan

more object

this method,

ted so that al

a two months

be organized

d in order to

to

he

in

re

nt

ts

,

ll

s

d

Page 10: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

Fourt

splitt

speci

help i

for se

co-oc

both

“spec

the re

th, spaces m

ing words su

ific twitter st

in clustering

earch, we sh

ccurrence ma

the features

ctral clusterin

everse index

Figure 2

ust be added

uch as “U.S.

top words su

g, if we care

ould avoid s

atrix W” can

i and j. Afte

ng” using W

x to get tweet

2.3. Twitter

d around pun

” or “don’t”

uch as “alert”

about the wo

stemming. Se

n be created.

er that, the w

W to get word

t clusters. [7

5

Clusters Sy

nctuation suc

” is not wante

” and “break

ord clusters

eventh, with

Wij is set to

weight matrix

d clusters. La

].

ystem Desig

ch as , ; : - b

ed. Fifth, ba

king” need to

making sens

h these featur

o n, if there a

x needs to be

ast, in additi

gn [7]

ut not . ’ bec

asic stop wor

o be remove

se and mayb

res, a “word

are n tweets t

e used to per

ion to using

cause

rds and

d. Sixth, to

be use them

d

that contain

rform

the word, usse

Page 11: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

taken

adds

focus

choic

2.2.2

sets,

the w

divid

centr

mean

Unfortuna

n for data co

more to the

sed on findi

ce when tryin

K-means

The k-me

and is one o

well-known c

The Figu

des items int

oids of the

ns the midd

ately, the ne

ollection. Fu

amount of

ing a good

ng to save tim

Algorithm

eans clusterin

f the simple

clustering pro

Fig

ure 2.4 show

o k nonemp

clusters of

le point of

egative side

urthermore,

time used. M

center poin

me.

ng algorithm

st and the be

oblem [8].

gure 2.4. K-

ws the four

ty subgroup

the current

the cluster

6

to using th

this method

Most of the

nt. Therefore

m is known t

est known m

-means Algo

steps of the

ps. In the sec

divisions. T

group. The

his method i

d using clus

time, cluste

e, less clust

to be efficien

machine learn

orithm [9]

e k-means a

cond, the co

The centroid

e third step

is that too m

stering too

ring time co

tering is usu

nt in clusteri

ning algorith

algorithm. T

ompute seed

d is at the c

is when ea

much time

much, whic

onsumption

ually a bette

ing large dat

hms that solv

The first ste

points to th

center, whic

ach object

is

ch

is

er

ta

ve

ep

he

ch

is

Page 12: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

7

assigned to the cluster with the nearest seed point. The fourth and last step goes back to

Step 2 and stops when the assignment does not change [9].

The positive side for k-means is the simplest. All you need to do is choose k and

run it a number of times, especially if the clusters are circular shape. Most of people do

not need a complex cluster algorithm.

K-means process has some weaknesses. First, there is a problem with comparing

the quality of the clusters. Second, because there is a fixed number of cluster, it can be

hard to find out what K should be. Third, k-means only work well with circular cluster

shape. Fourth, when the original partitions are not the same, this may cause final clusters

that are also different. It is useful to run the program again by like and unlike K values, to

compare the outcomes gained [9].

2.3 Sentiment Analysis

Sentiment analysis, also called opinion mining, is the field of study that analyzes

people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards

things such as products, services, organizations, individuals, issues, events, topics, and

their attributes. It represents a large problem space. There are also many names and

slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction,

sentiment mining, subjectivity analysis, affect analysis, emotion analysis, review mining,

etc. However, they are now all under the authority of sentiment analysis or opinion

mining [10].

We can know how users feel about a product or service and this can help,

especially in business decisions for corporates with sentiment analysis. Also, political

Page 13: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

8

parties and social organizations can collect feedback about their programs. Furthermore,

entertainers such as actors, musicians, and artists can connect with their fans and find the

viewpoints on their work. Mostly, this can act as an automatic surveying method, which

does not require manual entry [11].

2.4 Feature-based Sentiment Analysis

The document of people’s opinions is from the paragraphs, the paragraph is from

the sentences, the sentence is from the words. Therefore, the first feature that feature-

based sentiment analysis models discover is the word in a sentence. It determines if the

opinions are positive, negative or neutral. The opinions can be about a topic, event,

product, service, etc. Sentiment analysis separates document into paragraphs and then

separate paragraph into sentences. After that, sentences are separated into words. In the

next step, sentiment analysis forces feature from word-level, sentence-level, paragraph-

level, to document-level. Once this is complete, calculate the positive score, negative

score, or neutral score from each level and add the final score together. Finally, change

the opinion to number, and analyze the number to understand how people’s real thinking

is.

This feature-based sentiment analysis system uses Stanford tool and

SentiWordNet [12]. SentiWordNet is a resource for supporting opinion mining

applications. SentiWordNet relates to the positive, negative, and neutral opinions to tag

all the WordNet synsets [13]. It has two steps: preparing data and building processing

components [14]. First, this system uses SentiWordNet to create positive and negative

words lists, and lists with words that can reverse, increase or decrease the opinion.

Page 14: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

9

Second, this system uses the processing components and enters text files from Twitter to

find the product and the comments. This system uses an open source tool called Stanford

for stemming and tagging the parts-of-speech.

Figure 2.5. Flow Diagram of the Proposed System [14]

First, the Stemming part is when all data from the text document is collected.

Second, the Stanford POS Tagger is used to do the POS Tagging [15]. Third, the

SentiWordNet 3.0 is used to make the positive and negative word lists. Fourth, the

Enriching tag is used as the special tags for reversed word lists. For example, negation

Neg is positive. The increase and decrease words are tagged to increase the opinion

and/or decrease the opinion. Fifth, sentence-level opinion mining sets all opinion values

to begin at 0. The lpos, pos, vpo are +1, +2, +3. The lneg, neg, vneg are -1, -2, -3. For

example, good and easy to use are +2. Bad and hard to use are -2. Next, calculate the

Page 15: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

10

score by using sentence-level opinion combination methods. Last, add all totals of

sentence-level opinion together. There has a table to verify if the opinion text is positive

or negative. For instance, if the final score is more than 60%, this shows a strong positive.

However, if the final total is less than -60%, this shows a strong negative. For example, I

want to analyze a sentence: this phone is good and easy to use, and the sentence becomes

after process:

This/[POS_DT|Stm_this] phone/[POS_NN|Stm_phone] is/[POS_VBZ|Stm_be]

good/[POS_JJ|Stm_good|Opn_positive|pos] and/ [POS_CC|Stm_and]

easy/[POS_JJ|Stm_easy|pos] to/[POS_TO|Stm_to|pos] use/[POS_VB|Stm_use|pos].

The POS tag shows this word is adjective, noun, or verb. The Stm tag is for

separating the words from sentence. If the word is useful, pos is tagged in the end. In this

sentence, pos = +4 because +2 for good and +2 for easy to use, neg = 0,

result=(4*100)/(4+0+1)=80%. The score of the sentence is 80% after calculating the

score of positive and negative words.

The negative side to this method is that it is not able to manage wide ranging

opinions from users. It is necessary for the data need to do pro-process in the beginning

because this allows the sentiment analysis system to make better judgments about useful

opinions and if they are positive or negative.

Page 16: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

11

3. CLUSTERING AND SENTIMENT ANALYSIS

3.1 Problem Report

Feature-based sentiment analysis system already upgrades word-level and

sentence-level to text-level. It is acceptable to use this in the product review on Amazon

because people focus on what their experience after using the products when they post

product review. When we look at Twitter, people do not only talk about the experience of

using product, but also many different things. The tweets from Twitter are very noisy and

more spread out than the product review from Amazon. Therefore, we need to use

clustering to separate all tweets into clusters to check how people think about some

features of products. It can make the approach more accurate and better fit to Twitter.

3.2 Project Objective

This project objective is about receiving high accuracy sentiment analysis. First,

Twitter API is processed to collect the content that includes popular electronic product

name from Twitter and save to text document. In this paper, iPhone 6, Play Station 4, and

Xbox One are chosen to be study cases. Second, the clustering is used to pre-process the

text document and separate all tweets to some clusters. Each clusters has similar

sentences or words. Third, each cluster is chosen to process in the feature-based

sentiment analysis system to see the score for each cluster. Fourth, total tweets also

process in the feature-based sentiment analysis system.

Page 17: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

12

3.3 The Steps of the Project

Sentiment analysis has become a popular method to use for opinion mining on

social networks. Generally, this method is good enough to do the job. However, the

opinions on Twitter are complicated and as a result, the use of clustering is needed to

organized tweets into clusters that have similarities. Twitter API, twitter4j, is used to get

the tweets and save to text document [16]. K-means is chosen to do clustering to see what

people’s thinking is in different features of the products. Each cluster has a high

relationship and similar sentences are entered into feature-based sentiment analysis

system. In addition, total tweets also process in feature-based sentiment analysis system.

Before being able to run k-means on a series of text documents, the documents must be

signified as equally similar directions. To accomplish this, the documents can process the

TF-IDF score.

3.3.1 TF-IDF

The TF-IDF is short for term frequency-inverse document frequency. The main

idea of TF-IDF is this: If a word or phrase in an article appearing in the high frequency

TF, and rarely appears in other articles, you think this word or phrase has a good ability

to distinguish between categories [17].

TF: the term frequency means how many times a term occurs in a document. We

can calculate the term frequency for a word as the ratio of number of times the word

occurs in the document to the total number of words in the document.

IDF: the inverse document frequency is a way to measure if the term is common

or not for all documents. It is taken by dividing the total number of documents by the

Page 18: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

numb

[18].

highe

calcu

multi

docum

3.3.2

determ

the d

steps

3.3.3

cluste

system

tweet

steps

Secon

incre

is stro

ber of docum

The Figu

est when t o

ulation is lo

iple docume

ments [19].

Figure

K-means

K-means

mined. Seco

istance of ea

couple time

Sentimen

Figure 3.

er has simil

m to find ou

ts also proce

. First, POS

nd, SentiWo

asing or dec

onger than “

ments contai

ure 3.1 show

occurs many

wer when t

nts. Third, t

e 3.1. The T

Algorithm

algorithm h

ond, choose

ach object to

es until no ch

nt Analysis S

3 demonstra

lar sentence

ut how peopl

ess in the se

tagging is th

ordNet is use

creasing the

“good”. Four

ning the term

ws how to c

y times with

the term oc

the calculati

TF * IDF of

has some ste

k objects ran

o their closes

hanges on cl

System

ates the pro

s. Then eac

le think abou

ntiment ana

he method o

ed for word-l

score of the

rth, sentence

13

m, and then

calculate TF

hin a small

ccurs fewer

on is lowest

Term t in D

eps. First, ch

ndomly as th

st cluster. W

uster centers

oject steps.

ch cluster is

ut some feat

lysis system

f deciding if

level opinion

e positive or

e-level opini

taking the l

F and IDF.

number of

times in a

t when the t

Document d

hoose k, the

he initial clu

We need to re

s.

Some cluste

s putted into

tures of the p

m. Sentiment

f the word is

n tagging. T

r negative. F

ion mining c

logarithm of

First, the c

documents.

document,

term occurs

d is Calculat

number of c

uster center.

epeat the firs

ers are gott

o the sentim

product. In a

t analysis sy

s verb, adjec

Third, enrichi

For example,

calculates all

f that quotien

calculation

. Second, th

or occurs i

in almost a

ted

clusters to b

Third, assig

st and secon

ten, and eac

ment analys

addition, tota

ystem has fiv

ctive, or noun

ing tags is fo

, “very good

l positive an

nt

is

he

in

all

be

gn

nd

ch

is

al

ve

n.

or

d”

nd

Page 19: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

negat

sente

docum

tive scores

nce-level op

ments.

in the sent

pinion minin

tence. Fifth,

ng, but at th

Figure 3

14

, document-

he documen

.2. Project S

-level opinio

nt-level it ca

Steps

on mining

alculates the

is similar t

e score of a

to

all

Page 20: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

15

4. IMPLEMENTATION AND RESULTS

4.1 Environment

The suggested system is executed in C# and Java. For this, Java Swing and

Twitter4j parser are the main programs utilized. Microsoft Visual C# and Netbeans IDE,

are the programming environments used because they are more suitable for

programming.

4.1.1 Microsoft Visual C#

Microsoft Visual C# is Microsoft's implementation of the C# specification, and is

part of the Microsoft Visual Studio product suite [20]. C# was created by Microsoft and

is a multi-paradigm programming language covering many different programming

subjects, including strong typing, imperative, declarative, functional, generic, object-

oriented, and component-oriented programming disciplines. [21]

4.1.2 Java Swing

Java Swing, which was released by Oracle, is a Graphical User Interface (GUI)

toolkit [22]. This program lets programmers make GUI for java applications. It is stated

that the parts are not heavy because of a high flexibility. Swing offers many a lot of

innovative components including lists, tables, scroll panes and tabbed panels.

Furthermore, there are more familiar components offered, which include labels,

checkboxes and buttons. In addition, some of its components have drag and drop features

to allow for further ease of use.

Page 21: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

4.1.3

can e

4.1.4

with

[23].

deskt

on W

4.2

the te

of thi

use a

The n

inclu

tweet

Twitter4j

Twitter4J

easily integra

NetBeans

NetBeans

Java, but it

Additionall

top applicati

Windows, OS

Softwar

For this m

ext, so the us

is process.

Unfortuna

a combinatio

noisy tweets

de #HashNa

ts after huma

j

J is an unoff

ate your Java

s IDE

s is an integ

is also used

ly, NetBean

ions but othe

S X, Linux, S

re Module

module, Twi

ser name, lo

ately, there

on of compu

s are checke

ame, @User

an inspection

ficial Java li

a application

grated devel

d with other

ns is an app

ers as well. T

Solaris and o

s

itter4j is use

cation and ti

Figure 4.1.

are a lot of

uter and hum

ed manually

rName and w

n.

16

ibrary for th

n with the Tw

opment env

r languages,

plication pla

The NetBean

other platform

ed to collect

ime are all i

. Twitter4j O

noisy tweet

man inspecti

y to identify

website link

he Twitter A

witter servic

vironment (I

, such as PH

atform frame

ns IDE is wr

ms supportin

t the tweets.

ignored. Fig

Output

ts from Twit

ion to sort t

y and elimin

k are deleted

API. With T

ce.

IDE) that is

HP, C/C++,

ework for n

ritten in Jav

ng a compati

. The impor

gure 4.1 show

tter, so it is

through the

nate outliers

d. Figure 4.2

Twitter4J, yo

used mainl

and HTML

not only Jav

a and can ru

ible JVM.

rtant aspect

ws the result

beneficial t

noisy tweet

s. The tweet

2 displays th

ou

ly

L5

va

un

is

ts

to

ts.

ts

he

Page 22: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

begin

enter

new

click

the cl

text d

the da

The inter

nning, the n

ed into inter

document. T

the Start bu

lustering res

Another w

document. T

ata, click sta

Figure

rface for clu

umber of cl

rface. First

The next ste

utton after al

ults appear o

way to enter

Then click ad

art button.

F

e 4.2. Tweets

ustering use

lusters must

way is ente

ep is to click

ll text has b

on the right

tweets is fro

dd button to

Figure 4.3. C

17

s after Hum

es C# and

t be chosen.

ring the tex

k the Add b

een entered.

side.

om text docu

o enter the d

Clustering I

man Inspecti

can be seen

. Then the t

xt in each te

button once

. If these ste

ument. Click

data from te

Interface

ion

n in Figure

text has two

ext box field

the text is e

eps are follo

k file button

ext documen

e 4.3. At th

o ways to b

d represents

entered. The

owed the the

to choose th

nt. After ente

he

be

a

en

en

he

er

Page 23: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

this,

slider

the en

Figure 4.4

first enter th

r bar display

ntire docume

4 illustrates

he text in th

ys the senten

ent-level opi

Figur

the User-In

he text space

nce-level op

inion mining

re 4.4. Senti

18

nterface mod

e above the

pinion minin

g output.

iment Analy

dule and inp

slider bar. T

ng output an

ysis Interfac

put handler.

The text spa

nd the slider

ce

To complet

ace under th

r bar display

te

he

ys

Page 24: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

4.3

add b

Clusteri

Figure 4.5

Figure 4.6

button to add

ing Tweet

5 illustrates e

Figure 4.5.

6 displays cl

d the tweets f

Figure 4.6.

s

enter 3 to th

Cluster Int

lick file butto

from the tex

Cluster Int

19

e number of

erface: Ent

on to choose

xt document t

terface: Ent

f cluster.

er Cluster N

e the text doc

to the cluste

ter Text Do

Number

cument. Afte

ering.

cument

er that, click

k

Page 25: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

comp

comp

Figure 4.7

pleted.

Figure 4.8

pleted.

7 displays th

8 shows the

he cluster 1 o

Figure

cluster 2 onc

Figure

20

once all twee

e 4.7. Cluste

ce all data is

e 4.8. Cluste

ets are entere

er 1

s entered and

er 2

ed and the cl

d the clusteri

lustering is

ing is

Page 26: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

4.4

into t

100%

score

score

Sentime

Figure 4.9

the sentimen

% means the

e of each sen

es together an

F

ent Analys

9 shows how

nt analysis to

most positiv

ntence shows

nd outputs th

Figure 4.9. S

sis

w the cluster

o receive a sc

ve opinion. -

s in the end o

he final scor

Sentiment A

21

1 is selected

core. The ran

100% mean

of the senten

re.

Analysis: Sc

d and how th

nge of score

ns the most n

nce. After tha

core of the C

hat tweets ar

is from 100

negative opin

at, the system

Cluster 1

re inputted

0% to -100%

nion. The

m adds all

%.

Page 27: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

input

Figure 4.1

tted into the

F

10 illustrates

sentiment an

Figure 4.10.

s how the clu

nalysis to rec

Sentiment A

22

uster 2 is sel

ceive a score

Analysis: Sc

lected and ho

e

core of the C

ow that twee

Cluster 2

ets are

Page 28: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

enrich

is “Ju

Figure 4.

hing tags. F

ust/[RB] held

Fig

11 illustrate

or example,

d/[VBN] an/

gure 4.11. S

es stemming

the POS tag

/[DT] iPhone

Sentiment A

23

g, POS taggi

gging of the

e/[NNP] 6/[C

Analysis: Tag

ing, word-le

e sentence, “

CD] +/[CC]

gging of the

evel opinion

“Just held an

”.

e Cluster 1

n tagging an

n iPhone6 +”

nd

”,

Page 29: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

enrich

Figure 4.

hing tags.

Fig

.12 shows

gure 4.12. S

stemming, P

Sentiment A

24

POS taggin

Analysis: Tag

ng, word-lev

gging of the

vel opinion

e Cluster 2

tagging annd

Page 30: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

25

5. TESTING AND EVALUATION

iPhone 6, Play Station 4, and Xbox One were chosen as keywords to search on

Twitter. Tweets with these keywords were collected and saved to the text document.

Once the tweets are collected, the clustering is done followed by processing the sentiment

analysis system. This is because the tweets relative to different features of products. At

the time of clustering, the k-means algorithm is used to deal with the tweets, and k is set

to 3.

5.1 iPhone 6

In the iPhone 6, after human inspection, the data set has a total of 88 tweets. Once

the clustering is processed, 3 clusters are taken. Cluster 1 has 31 tweets, cluster 2 has 37

tweets, and cluster 3 has 20 tweets. The clusters are added into the sentiment analysis

system in order to compute the score. Table 5.1 shows the result of this computation.

Table 5.1. iPhone 6 Clusters and Score

Cluster Tweets Score (%) Feature 1 31 77 screen 2 37 63 battery 3 20 71 price

Total 88 71

Cluster 1 contains 80.6% tweets relative to screen size (25 out of 31 tweets).

Cluster 2 has 86.5% tweets relative to battery life (32 out of 37 tweets). Cluster 3

includes 85% tweets that mentioned price (17 out of 20 tweets). People are more satisfied

Page 31: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

26

with the iPhone 6 screen size compared with the battery life by looking at the scores. The

score of the iPhone 6 screen size is 77%, and the score of the battery life is only 63%.

A few people are asked to manually judge if this content is positive or negative.

After that, classifier evaluation metrics and confusion matrix are used to check the score

from this project and the judgment from the people who review the content [24].

Table 5.2 shows the evaluation report of iPhone 6. True positives (TP) means

human’s check and system output are both positive. True negative (FP) means human’s

check and system output are both negative. TP and FP mean the system output has

correct determine. False negative (FN) means human’s check is positive, but system

output is negative. False positive (FP) means human’s check is negative, but system

output is positive. FN and FP means the system output has wrong determine. ~FN and

~FP means the tweets are not about positive and negative.

Table 5.2. Evaluation Report of iPhone 6

Accuracy of this system developed means percentage of test set tuples that are correctly

classified. It is calculated by using the following formula.

Opinion Extraction Accuracy = (TP+TN)/(TP+TN+FP+FN)

= (42 + 12) / (42 + 12 + 3 + 2)

= 91.5 %

Manual(human)/System OutputPositive

(Score > 0%)

Neutral

(Score = 0%)

Negative

(Score < 0%)

Positive 42 (TP) 15 (~FN) 2 (FN)

Negative 3 (FP) 14 (~FP) 12 (TN)

Page 32: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

27

Precision means what % of tuples that the classifier labeled as positive is actually

positive. It is calculated by using the following formulas.

Precision = TP/(TP+FP)

= 42 / (42 + 3)

= 93.3 %

Recall means what % of positive tuples did the classifier labeled as positive. It is

calculated by using the following formulas.

Recall = TP/(TP+FN)

= 42 / (42 + 2)

= 95.5 %

5.2 Play Station 4

In Play Station 4 (PS4), data set has total of 92 tweets after human inspection.

After processing clustering, 3 clusters are retrieved. Cluster 1 has 34 tweets, cluster 2 has

21 tweets, and cluster 3 has 37 tweets. Each cluster is entered into the sentiment analysis

system to calculate the score. Table 5.3 shows the result.

Table 5.3. PS4 Clusters and Score

Cluster 1 contains 82.4% tweets relative to PS4 controller (28 out of 34 tweets).

Cluster 2 has 81% tweets are about PS4 game (17 out of 21 tweets). Cluster 3 includes

Cluster Tweets Score (%) Feature 1 34 51 controller 2 21 67 game 3 37 72 price

Total 92 64

Page 33: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

28

78.4% tweets mentioned price (29 out of 37 tweets). People are not satisfied with the PS4

controller compared with the price based on the scores. The score of the PS4 controller is

just 51%, whereas the score of the price is 72%.

Table 5.4 shows the evaluation report of PS4.

Table 5.4. Evaluation Report of PS4

Opinion Extraction Accuracy = (30 + 17) / (30 + 17 + 9 + 3)

= 79.7 %

Precision = 30 / (30 + 9)

= 76.9 %

Recall = 30 / (30 + 3)

= 90.9 %

5.3 Xbox One

For Xbox One, data set has total of 109 tweets after human inspection. After

processing clustering, 3 clusters are retrieved. Cluster 1 has 38 tweets, cluster 2 has 23

tweets, and cluster 3 has 48 tweets. Each cluster is entered into the sentiment analysis

system to calculate the score. Table 5.5 shows the result.

Manual(human)/System OutputPositive

(Score > 0%)

Neutral

(Score = 0%)

Negative

(Score < 0%)

Positive 30 (TP) 22 (~FN) 3 (FN)

Negative 9 (FP) 11 (~FP) 17 (TN)

Page 34: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

29

Table 5.5. Xbox One Clusters and Score

Cluster Tweets Score(%) Feature 1 38 60 game 2 23 -59 price 3 48 55 controller

Total 109 53

Cluster 1 contains 86.8% tweets relative to Xbox One game (33 out of 38 tweets).

Cluster 2 has 78.3% tweets are about price (18 out of 23 tweets). Cluster 3 includes

79.2% tweets mentioned Xbox One controller (38 out of 48 tweets). People are not

satisfied with the price of the Xbox and think it is too expensive. The score of the price is

negative (-59%).

Table 5.6 shows the evaluation report of Xbox One.

Table 5.6. Evaluation Report of Xbox One

Opinion Extraction Accuracy = (27 + 29) / (27 + 29 + 4 + 10)

= 80 %

Precision = 27 / (27 + 4)

= 87.1 %

Recall = 27 / (27 + 10)

= 73 %

Manual(human)/System OutputPositive

(Score > 0%)

Neutral

(Score = 0%)

Negative

(Score < 0%)

Positive 27 (TP) 22 (~FN) 10 (FN)

Negative 4 (FP) 17 (~FP) 29 (TN)

Page 35: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

30

Table 5.7. Compare PS4 and Xbox One

PS4 score(%) 

Xbox one score(%) 

game  67  60 

price  72  ‐59 

controller  51  55 

total  64  53 

Table 5.7 shows a comparison of the PS4 and Xbox One. In the game, people are

more satisfied with the PS4 game than the Xbox One game. In the price, most people

think the price of the PS4 is fine (72%), but they think the price of the Xbox One is too

expensive (-59%). In the controller, people like the Xbox One controller a little more.

Actually, the PS4 has better sales than the Xbox One in USA. Figure 5.1 shows

the cumulative U.S. sales since the release of Sony’s PS4 and Microsoft’s Xbox One.

Figure 5.1. U.S. Sales of PS4 and Xbox One [25]

Figure 5.2 shows the system output for all data.

Page 36: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

Figuure 5.2. Syst

31

em Output for All Dataa

Page 37: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

32

6. CONCLUSION AND FUTURE WORK

This project can find how people think about specific popular electronic products.

This project changes people’s words to numbers and then these numbers can be analyzed

to understand the different people’s thinking. The problem is making sure that the change

is correct. Therefore, I process the clustering and feature-based sentiment analysis system

to help with the accuracy of the change.

The clustering and feature-based sentiment analysis system processes the text

document from Twitter. Because the opinions on Twitter are too complex and dispersed,

clustering needs to be used to separate data into clusters. In this paper, Twitter API,

twitter4j, is used to get the data and save to text document. Then k-means algorithm is

used to do clustering. After that, feature-based sentiment analysis system is used to

process the data. The sentiment analysis system is done in seven main steps: stemming,

POS tagging, word-level opinion tagging, enriching tags, sentence-level opinion mining,

document-level opinion mining, and time-level opinion mining. the Stanford tool is used

to process the stemming and POS tagging. Then SentiWordNet is used to handle the

enriching tags and word-level tags.

Apart from the work done towards this system, future work mainly comprises of the

following objectives.

To handle the noisy data without human inspection.

To improve the speed with a large number of sentences and handle huge data.

To run this project on Cloud computing with Hadoop and Mahout.

Run sentiment analysis in Chinese on Weibo.

Page 38: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

33

BIBLIOGRAPHY AND REFERENCES

[1] Liu, B. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock

market. Journal of Computational Science, 2(1), 1-8.

[2] Asur, S., & Huberman, B. A. (2010, August). Predicting the future with social media.

In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010

IEEE/WIC/ACM International Conference on (Vol. 1, pp. 492-499). IEEE.\

[3] Weibo. http://en.wikipedia.org/wiki/Sina_Weibo

[4] Xue, B., Fu, C., & Shaobin, Z. (2014, June). A Study on Sentiment Computing and

Classification of Sina Weibo with Word2vec. In Big Data (BigData Congress),

2014 IEEE International Congress on (pp. 358-363). IEEE.

[5] Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc..

[6] Text Documents Clustering using K-Means Algorithm.

http://www.codeproject.com/Articles/439890/Text-Documents-Clustering-using-

K-Means-Algorithm

[7] Tushar Khot,Clustering Twitter Feeds using Word Co-occurrence CS769 Project

Report. http://pages.cs.wisc.edu/~tushar/projects/cs769.pdf

[8] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering

algorithm. Applied statistics, 100-108.

[9] Han, J., & Kamber, M. (2006). Data Mining, Southeast Asia Edition: Concepts and

Techniques. Morgan kaufmann.

[10] Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on

Human Language Technologies, 5(1), 1-167.

Page 39: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

34

[11] Bora, N. N. (2011). Feature Based Sentiment Analysis on Twitter (Doctoral

dissertation, Indian Institute of Technology Guwahati).

[12] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

(2008). “Introduction to Information Retrieval,” Cambridge University Press.

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-

1.html

[13] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani (2010). “SentiWordNet

3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining.”

[14] Srividya Venumbaka (Spring 2013). “An Enhanced Feature-Based Sentiment

Analysis System.” Graduate Project Report. Texas A&M University Corpus

Christi.

[15] The Stanford Natural Language Processing Group. (n.d.) “Stanford log-linear Part-

of-Speech Tagger.” http://nlp.stanford.edu/software/tagger.shtml

[16] Twitter4J. (2013). http://twitter4j.org/en/index.html

[17] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge

University Press.

[18] TF-IDF means. http://www.tfidf.com/

[19] The Stanford Natural Language Processing Group. TD-IDF weighting.

http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

[20] Microsoft Visual C#. http://en.wikipedia.org/wiki/Microsoft_Visual_C_Sharp

[21] C#. http://en.wikipedia.org/wiki/C_Sharp_(programming_language)

[22] Java Swing. http://en.wikibooks.org/wiki/Java_Swings

[23] NetBeans IDE. http://en.wikipedia.org/wiki/NetBeans

Page 40: MingHsuanWu Graduate Reportsci.tamucc.edu/~cams/projects/431.pdfTwitter, an online social network, allows users to upload short text messages, also known as tweets, with up to 140

35

[24] Kohavi and Provost. (1998). ConfusionMatrix.

http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.

html

[25] Wall Street Journal. http://iknow.stpi.narl.org.tw/post/Read.aspx?PostID=9775