Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George...
-
Upload
barbara-nichols -
Category
Documents
-
view
212 -
download
0
Transcript of Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George...
Sentiment Analysis of Social Media Contentusing N-Gram Graphs
Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou
Presenter: Konstantinos TserpesNational Technical University of Athens, Greece
International ACM Workshop on Social Media (WSM11)
2
Social Media and Sentiment Analysis
• Social Networks enable users to:– Chat about everyday issues– Exchange political views– Evaluate services and products
• Useful to estimate average sentiment for a topic (e.g. social analysts)
• Sentiments expressed – Implicitly (e.g. through emoticons, specific words)– Explicitly (e.g. the “Like” button in Facebook)
In this work we focus on content-based patterns for detecting sentiments.
30/11/2011
International ACM Workshop on Social Media (WSM11)
3
Intricacies of Social Media Content
Inherent characteristics that turn established,language-specific methods inapplicable:
– Sparsity: each message comprises just 140 characters in Twitter
– Multilinguality: many different languages and dialects– Non-standard Vocabularty: informal textual content
(i.e., slang), neologisms (e.g. “gr8” instead of “great”)– Noise: misspelled words and incorrect use of phrases.
Solutionlanguage-neutral method that is robust to noise
30/11/2011
International ACM Workshop on Social Media (WSM11)
4
Focus on Twitter
We selected the Twitter micro-blogging service due to:– Popularity (200 million users, 1 billion posts per
week)– Strict rules of social interaction (i.e.,
sentiments are expressed through short, self-contained text messages)
– Data publicly available through a handy API
30/11/2011
International ACM Workshop on Social Media (WSM11)
5
Polarity Classification problem
• Polarity: express of a non-neutral sentiment– Polarized tweets: tweets that express either a
positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons)
– Neutral tweets: tweets lacking any polarity indicator
• Binary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive).
• General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral).
30/11/2011
International ACM Workshop on Social Media (WSM11)
6
Representation Model 1: Term Vector Model
Aggregates the set of distinct words (i.e., tokens) contained in a set of documents.Each tweet ti is then represented as a vector:
vti = (v1, v2, ..., vj) where vj is the TF-IDF value of the j-th term. The same model applies to polarity classes.Drawbacks:• It requires language-specific techniques that
correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging).
• High dimensionality
30/11/2011
International ACM Workshop on Social Media (WSM11)
7
Representation Model 2: Character n-grams
Each document and polarity class is represented as the set of substrings of length n of the original text.
for n = 2: bigrams, n = 3: trigrams, n = 4: fourgrams
example: “home phone" consists of the following tri-grams: {hom, ome, me , ph, pho, hon, one}.
Advantages: language-independent method.
Disadvantages: high dimensionality30/11/2011
International ACM Workshop on Social Media (WSM11)
8
Representation Model 3: Character n-gram graphs
Each document and polarity class are represented as graphs, where •the nodes correspond to character n-grams,
• the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), and
• the weight of an edge denotes the co-occurrence rate of the adjacent n-grams.
Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs).
30/11/2011
International ACM Workshop on Social Media (WSM11)
9
Example of n-gram graphs.
The phrase “home_phone” is represented as follows:
30/11/2011
International ACM Workshop on Social Media (WSM11)
10
Features of the n-gram graphs model
To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs):
– Containment Similarity (CS): portion of common edges, regardless of their weights
– Size Similarity (SS): ratio of sizes of two graphs– Value Similarity (VS): portion of common edges,
taking into account their weights– Normalized Value Similarity (NVS): value similarity
without the effect of the relative graph size (i.e., NVS =VS/SS)
30/11/2011
International ACM Workshop on Social Media (WSM11)
11
Features Extraction
• Create Gpos, Gneg (and Gneu) by aggregating half of the training tweets with the respective polarity.
• For each tweet of the remaining training set:– create tweet n-gram graph Gti
– derive a feature “vector” from graphs comparison• Same procedure for the testing tweets.
30/11/2011
International ACM Workshop on Social Media (WSM11)
12
Discretized Graph Similarities
Discretized similarity values offer higher classification efficiency. They are created according to the following function:
• Binary classification has three nominal features: dsim(CSneg, CSpos)
dsim(NVSneg, NVSpos)
dsim(VSneg, VSpos)
• General classification has six more nominal features: dsim(CSneg, CSneu)
dsim(NVSneg, NVSneu)
dsim(VSneg, VSneu)
dsim(CSneu, CSpos) dsim(NVSneu, NVSpos) dsim(VSneu, VSpos)
30/11/2011
International ACM Workshop on Social Media (WSM11)
13
Data set
• Initial dataset:– 475 million real tweets, posted by 17 million users– polarized tweets:
• 6.12 million negative• 14.12 million positive
• Data set for Binary Polarity Classification:Random selection of 1 million tweets from each polarity category.
• Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets.
30/11/2011
International ACM Workshop on Social Media (WSM11)
14
Experimental Setup
• 10-fold cross-validation.• Classification algorithms (default configuration
of Weka):– Naive Bayes Multinomial (NBM)– C4.5 decision tree classifier
• Effectiveness Metric: classification accuracy(correctly_classified_documents/all_documents).
• Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered.
30/11/2011
International ACM Workshop on Social Media (WSM11)
15
Evaluation results
• n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant)
• n-gram graphs:– low accuracy for NBM, higher values overall for
C4.5– n incremented by 1: performance increases by
3%-4%30/11/2011
International ACM Workshop on Social Media (WSM11)
16
Efficiency Performance Analysis
• n-grams involve the largest by far set of features -> high computational load
• four-grams: less features than trigrams (their numerous substrings are rather rare)
• n-gram graphs: significantly lower number of features in all cases (<10) -> much higher classification efficiency!
30/11/2011
International ACM Workshop on Social Media (WSM11)
17
Improvements (work under submission)
• We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency).
• We included in the training stage the tweets that were used for building the polarity classes.
• Outcomes:– Higher performance for all methods.– N-gram graphs again outperform all other models.– Accuracy takes significantly higher values (>95%)
30/11/2011
International ACM Workshop on Social Media (WSM11)
18
Thank you!
30/11/2011
• SocIoS project: www.sociosproject.eu