Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George...

Sentiment Analysis of Social Media Contentusing N-Gram Graphs

Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou

Presenter: Konstantinos TserpesNational Technical University of Athens, Greece

International ACM Workshop on Social Media (WSM11)

2

Social Media and Sentiment Analysis

• Social Networks enable users to:– Chat about everyday issues– Exchange political views– Evaluate services and products

• Useful to estimate average sentiment for a topic (e.g. social analysts)

• Sentiments expressed – Implicitly (e.g. through emoticons, specific words)– Explicitly (e.g. the “Like” button in Facebook)

In this work we focus on content-based patterns for detecting sentiments.

30/11/2011


3

Intricacies of Social Media Content

Inherent characteristics that turn established,language-specific methods inapplicable:

– Sparsity: each message comprises just 140 characters in Twitter

– Multilinguality: many different languages and dialects– Non-standard Vocabularty: informal textual content

(i.e., slang), neologisms (e.g. “gr8” instead of “great”)– Noise: misspelled words and incorrect use of phrases.

Solutionlanguage-neutral method that is robust to noise

30/11/2011


4

Focus on Twitter

We selected the Twitter micro-blogging service due to:– Popularity (200 million users, 1 billion posts per

week)– Strict rules of social interaction (i.e.,

sentiments are expressed through short, self-contained text messages)

– Data publicly available through a handy API

30/11/2011


5

Polarity Classification problem

• Polarity: express of a non-neutral sentiment– Polarized tweets: tweets that express either a

positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons)

– Neutral tweets: tweets lacking any polarity indicator

• Binary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive).

• General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral).

30/11/2011


6

Representation Model 1: Term Vector Model

Aggregates the set of distinct words (i.e., tokens) contained in a set of documents.Each tweet ti is then represented as a vector:

vti = (v1, v2, ..., vj) where vj is the TF-IDF value of the j-th term. The same model applies to polarity classes.Drawbacks:• It requires language-specific techniques that

correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging).

• High dimensionality

30/11/2011


7

Representation Model 2: Character n-grams

Each document and polarity class is represented as the set of substrings of length n of the original text.

for n = 2: bigrams, n = 3: trigrams, n = 4: fourgrams

example: “home phone" consists of the following tri-grams: {hom, ome, me , ph, pho, hon, one}.

Advantages: language-independent method.

Disadvantages: high dimensionality30/11/2011


8

Representation Model 3: Character n-gram graphs

Each document and polarity class are represented as graphs, where •the nodes correspond to character n-grams,

• the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), and

• the weight of an edge denotes the co-occurrence rate of the adjacent n-grams.

Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs).

30/11/2011


9

Example of n-gram graphs.

The phrase “home_phone” is represented as follows:

30/11/2011


10

Features of the n-gram graphs model

To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs):

– Containment Similarity (CS): portion of common edges, regardless of their weights

– Size Similarity (SS): ratio of sizes of two graphs– Value Similarity (VS): portion of common edges,

taking into account their weights– Normalized Value Similarity (NVS): value similarity

without the effect of the relative graph size (i.e., NVS =VS/SS)

30/11/2011


11

Features Extraction

• Create Gpos, Gneg (and Gneu) by aggregating half of the training tweets with the respective polarity.

• For each tweet of the remaining training set:– create tweet n-gram graph Gti

– derive a feature “vector” from graphs comparison• Same procedure for the testing tweets.

30/11/2011


12

Discretized Graph Similarities

Discretized similarity values offer higher classification efficiency. They are created according to the following function:

• Binary classification has three nominal features: dsim(CSneg, CSpos)

dsim(NVSneg, NVSpos)

dsim(VSneg, VSpos)

• General classification has six more nominal features: dsim(CSneg, CSneu)

dsim(NVSneg, NVSneu)

dsim(VSneg, VSneu)

dsim(CSneu, CSpos) dsim(NVSneu, NVSpos) dsim(VSneu, VSpos)

30/11/2011


13

Data set

• Initial dataset:– 475 million real tweets, posted by 17 million users– polarized tweets:

• 6.12 million negative• 14.12 million positive

• Data set for Binary Polarity Classification:Random selection of 1 million tweets from each polarity category.

• Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets.

30/11/2011


14

Experimental Setup

• 10-fold cross-validation.• Classification algorithms (default configuration

of Weka):– Naive Bayes Multinomial (NBM)– C4.5 decision tree classifier

• Effectiveness Metric: classification accuracy(correctly_classified_documents/all_documents).

• Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered.

30/11/2011


15

Evaluation results

• n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant)

• n-gram graphs:– low accuracy for NBM, higher values overall for

C4.5– n incremented by 1: performance increases by

3%-4%30/11/2011


16

Efficiency Performance Analysis

• n-grams involve the largest by far set of features -> high computational load

• four-grams: less features than trigrams (their numerous substrings are rather rare)

• n-gram graphs: significantly lower number of features in all cases (<10) -> much higher classification efficiency!

30/11/2011


17

Improvements (work under submission)

• We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency).

• We included in the training stage the tweets that were used for building the polarity classes.

• Outcomes:– Higher performance for all methods.– N-gram graphs again outperform all other models.– Accuracy takes significantly higher values (>95%)

30/11/2011


18

Thank you!

30/11/2011

• SocIoS project: www.sociosproject.eu

Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George...

Documents

Transcript of Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George...