Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

18
Evaluation Datasets for Twitter Sentiment Analysis A survey and a new dataset, the STS-Gold Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom 1 st Workshop on Emotion and Sentiment in Social and Expressive Media

description

Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.

Transcript of Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Page 1: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Evaluation Datasets for Twitter Sentiment Analysis A survey and a new dataset, the STS-Gold

Hassan Saif, Miriam Fernandez, Yulan He and Harith AlaniKnowledge Media Institute, The Open University,

Milton Keynes, United Kingdom

1st Workshop on Emotion and Sentiment in Social and Expressive Media Approaches and perspectives from AI

Page 2: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

• Definition & Background

• Evaluation Datasets for Twitter Sentiment Analysis

• STS-Gold

• Comparative Study

• Conclusion

Outline

Page 3: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Sentiment Analysis

“Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text”

3

The main dish was delicious It is a Syrian dish The main dish was

salty and horrible

Positive NegativeNeutral

Sentiment Analysis – Definition

Page 4: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

4

Sentiment Approaches

Supervised

Unsupervised

Hybrid

Sentiment Tasks

Sentiment Levels

Tweet-level

Phrase-level

Entity-level

Subjectivity

Polarity

Sentiment Strength

Emotion/Mood

Twitter Sentiment Analysis

(Background)

Page 5: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Evaluation Datasets for Twitter Sentiment Analysis

Dataset

SA TaskSA Level

Vocabulary Size

Class Distribution

Construction & Annotation

Sparsity

Dataset

No. of Tweets

Page 6: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Evaluation Datasets – Overview

Dataset SA Level SA Task Annotation/Agreement

Stanford Twitter Corpus (STS) Tweet Subjectivity Manual/UD

Health Care Reform (HCR) Tweet/Target Subjectivity Manual/UD

Obama-McCain Debate (OMD) Tweet Polarity* Manual/α=0.655

Sentiment Strength Twitter Dataset (SS-Tweet) Tweet Strength/Subjectivity**

Manualα≈0.56

Sanders Twitter Dataset Tweet Subjectivity Manual/UD

Dialogue Earth Twitter Corpus (WAB, GASP) Tweet/Target Subjectivity Manual/UD

SemEval-2013 Dataset Tweet/Expression

Subjectivity Manual/UD

Page 7: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

What is Missing?

• Details about the annotation methodology (STS, HCR, Sanders)

• Entity-level Sentiment Evaluation: • Most works are focused on

assessing the performance of sentiment classifiers at the tweet level (STS, OMD, SS-Tweet, Sanders)

• Datasets, which allow for the sentiment evaluation at the entity level, assign similar sentiment labels to the tweet and the entities within it. (HCR, WAB, GASP)

Page 8: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Enables the evaluation at both the entity and tweet levels

Tweets and entities are annotated independently

Contains 58 Entities & 3000 Tweets

STS-Gold

Page 9: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Data Collection

STS-Gold

STS Corpus

Select

Entity-Extraction

Alchemy API

Identify Frequent Concepts

Top & Mid Frequent Entities

28 Entities

100 Tweet/Entity180K Tweets

2800 Tweets

Select

3000 Tweets

+200 tweets

Entity-Extraction

147 Entities

Page 10: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

STS-Gold

Person

US

Brazil

LeBron

England

Country

Person

Taylor Swift

OprahLeBron

Obama

Person

Person

YouTube

Starbucks

McDonalds

Facebook

CompanyPerson

Vegas

Sydney

Seattle

London

City

Person

Cavas

NASA

UN

Lakers

Organization

Person

Flu

CancerFever

HeadacheHealth

Condition

Person

iPod

XboxPSP

iPhone

Technology

Page 11: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

STS-Gold

Data Annotation3000 Tweets 147 Entities

Positive, Negative, Neutral, Mixed, Other

Sentiment Classes

3000 Tweets 147 Entities

58 Entities

Tweet α=0.765

Entity α1=0.416 α2=0.964

FilteringInter-annotation Agreement

Tweenator.com

2205 Tweets

Page 12: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Comparative Study

• Vocabulary Size• Number of Tweets• Data Sparsity• Classification Performance– Polarity Classification– Naïve Bayes & Maximum Entropy

Page 13: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Comparative Study.1

Vocabulary Size vs. No. of Tweets

- There exists a high correction between the vocabulary size and the number of tweets (ρ = 0.95)

- However, increasing the number of tweets does not always lead to increasing the vocabulary size. (OMD)

Page 14: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Comparative Study.2

Data Sparsity

- Twitter datasets are generally very sparse- Increasing both the number of tweets or the vocabulary size increases the sparsity

degree of the dataset:- ρno_of_tweets = 0.71 - ρvocabulary_size = 0.77

Page 15: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Comparative Study.3

Classification Performance vs. Dataset Sparsity (1)

According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the classification performance and the sparsity degree are negatively correlated, i.e., increasing the dataset sparsity hinders the classification performance.

Page 16: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Comparative Study.3

Classification Performance vs. Dataset Sparsity (2)

- No correlation between the classification performance and the sparsity degree across the datasets. (ρacc = −0.06, ρf1 = 0.23)

- The sparsity-performance correlation is intrinsic, meaning that it might exists within the dataset itself, but not necessarily across the datasets.

Page 17: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

• Current datasets to evaluate Twitter sentiment classifiers:– Focus on the tweet-level.– Assign similar sentiment labels to the tweets

and the entities within them.

• STS-Gold allows for sentiment evaluation as both the tweet and the entity levels.

• A correlation between the vocabulary size and the number of tweets does not always exist.

• The sparsity-performance correlation is intrinsic, i.e., it only exists within the dataset itself, but not across the different datasets.

Conclusion!

Page 18: Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

Thank You

Email: [email protected]: hrsaifWebsite: tweenator.com