UNIVERSIDAD DE ALICANTE FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES€¦ · FACULTAD DE...
Transcript of UNIVERSIDAD DE ALICANTE FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES€¦ · FACULTAD DE...
UNIVERSIDAD DE ALICANTE
FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES
GRADO EN ECONOMÍA
Curso académico 2017-2018
USING SOCIAL MEDIA TO MEASURE THE CONSUMER CONFIDENCE:
THE TWITTER CASE IN SPAIN
Manuel García Corbí
Pedro Albarrán Pérez
Departamento de Fundamentos del Análisis Económico
Alicante, mayo de 2018
The Twitter case in Spain University of Alicante
2
The Twitter case in Spain University of Alicante
Resumen
La finalidad de este proyecto es recopilar tuits, analizar su sentimiento, crear un índice a
partir de esa información y comprobar si ésta es útil para predecir la confianza del
consumidor.
Para crear dicho índice, se sigue un proceso de minería de opinión. En primer lugar, se
estudia la precisión de los métodos de análisis para escoger el más preciso, clasificar los
mensajes y obtener su sentimiento. El resultado es una serie temporal mensual,
comprendida entre 2012 y 2017, que se denominará “Índice Español de Sentimiento en
Twitter” (IEST).
Finalmente, se comprueba si la información obtenida puede ser útil para predecir el
índice de confianza del consumidor. Los resultados indican que ambos índices tienen
correlación positiva (r = 0,81), pero con diferente comportamiento en dos períodos
diferenciados, lo que podría implicar un de cambio estructural en la serie.
Abstract
The goal of this project is to collect tweets, analyze its sentiment, make an index and
check if this information can be useful to predict the consumer confidence.
To make the index, an opinion mining process is followed. First, it is developed a study
of the accuracy of different analysis methods in order to choose the best performance one.
Next, over a treated dataset of Tweets, it is applied the sentiment classification. The results
is a monthly time series, in between 2012 and 2017, called “Spanish Twitter Sentiment
Index” (STSI).
The final step is to check if this information can be useful to predict the consumer
confidence index. The results suggest that both indexes have positive correlation (r = 0,81),
but with different behaviour in two different periods, which could suggest a structural break
in the time series.
Keywords: Social Media, Big Data, Machine Learning, Sentiment Analysis, Official
Statistics.
3
The Twitter case in Spain University of Alicante
Index
1. Introduction………………………………………………………………………... 5
2. Literature review…………………………………………………………………...7
a. Big Data……………………………………………………………………...7
b. Social Media………………………………………………………………....9
3. Data sources………………………………………………………………………. 10
a. Tweets Dataset…………………………………………………………….. 11
b. Consumer Confidence Dataset…………………………………………….. 16
4. Opinion mining………………………………………………………………….... 17
a. Methodology………………………………………………………………. 17
b. Valence Lexicon…………………………………………………………....17
c. Supervised Machine Learning…………………………………………….. 19
i. Algorithms description……………………………………………. 21
1. Naive Bayes……………………………………………….. 21
2. Support Vector Machines (SVM) ………………………… 21
3. Regularized Logistic Regression………………………….. 22
ii. Model validation…………………………………………………... 23
d. Accuracy comparison and model selection………………………………... 25
e. Sentiment classification…………………………………………………….27
5. Building the Time Series………………………………………………………….27
a. Calendar effect correction…………………………………………………. 27
b. Aggregating the data………………………………………………………. 28
c. Filtering the volatility……………………………………………………....28
6. Results…………………………………………………………………………….. 30
a. Correlation Analysis………………………………………………………..30
b. Structural break……………………………………………………………. 31
7. Conclusions……………………………………………………………………….. 33
8. Appendix………………………………………………………………………….. 34
9. Acknowledgments………………………………………………………………... 35
10. References……………………………………………………………………….... 35
4
The Twitter case in Spain University of Alicante
1. Introduction
The purpose of this project is to make an index that reflects the consumer confidence, as
an alternative indicator of the official statistics. For this, I use the information within the
messages of Spanish Twitter users, through a process of opinion mining. Then, find out if it
can be useful to predict the consumer confidence index in Spain, in order to check if it the
goal of the project has been accomplished.
The consumer confidence index is an important indicator for policy makers, central
banks, investors, manufacturing companies and marketing researchers, among others,
because it is useful in order to evaluate the demand and make decisions. According to these
reasons, it might be interesting to create an indicator using social media as source of
information, in order to save money and supplement the official statistic.
The relation between the information found in social media and the official statistics is
that the same emotion is reflected in both. This theory is found in Appraisal-Tendency
Framework of Han et al. (2007) and basically says that the human being has two emotions
concerning to consumption decisions, called the integral and the incidental. The difference
is that the incidental emotion reflects the “intention” of buy a product and the integral
emotion the “final decision” of make the purchasing of a product. The consumer confidence
survey is made with questions about the “intention of buy something”, so it reflects the
incidental emotion. The same emotion is reflected on the messages of the active users in
social media as found by Daas & Puts (2014).
To create this indicator, an opinion mining process is carried out. This consist in
extracting the incidental emotion from social media messages, in this case, measuring the
sensitivity of the tweets. To perform this process, It is required Twitter messages and a
sentiment classifier.
The tweets datasets is not available on the internet, but it can be created. It is built using
a Twitter users list which is obtained from the Twitter API and the period chosen to
download the tweets is from 2012 to 2017. This period is chosen because there are some
events that make it interesting, and may have affected the consumer confidence. In 2012 in
Spain, there were debt downgrading problems and the higher risk premium in the recent
Spanish history.
5
The Twitter case in Spain University of Alicante
Also, it is a period with trend changes in some economic indicators, for instance, private
consumption or unemployment rates, as showed at the Bank of Spain article “The Recovery
of the Spanish Economy” by Hernández de Cos (2018: p.2-10).
Furthermore, there are some limitations collecting social media data previous to this
date. The older the tweets are, it is possible those would have been erased by its users.
On the other hand, it is necessary to extract the sentiment that reflects the incidental
emotion from the tweets. To perform this task, it is necessary a classifier that comes from a
selection process between two methods, the valence lexicon dictionary and the supervised
machine learning methods. The comparison of the accuracy of different classification
methods is a commonly task in an opinion mining process. This is because there are some
aspects that might affect the accuracy of the classifiers, for instance, the quality of the data
used to train the algorithms.
Once the tweets dataset and the classifier are created, they are used to make the
sentiment index. The result is a monthly time series for the period mentioned above.
Finally, I check that the same information reflected in the consumer confidence index
(ICC), by using Pearson’s correlation as a measure of prediction power of the proposed
indicator.
The results suggest that this information can be found in social media and despite of the
limitations of the project, the sentiment index performs quite well as predictor of the
consumer confidence, with a high correlation (r = 0,81). However, there are some obvious
divergences which might be explained with a structural break in the sentiment time series.
In the following section, I discuss the potential of the Big Data as a source of
information. Especially, why social media can be an alternative source of data to
supplement the official statistics.
6
The Twitter case in Spain University of Alicante
2. Literature review
a. Big Data
Big Data is a source of information which consists in large datasets and needs an
specific infrastructure and analysis methods to be processed.
During the last decade, the traditional data management methods have been pushed to
the limits by the effects of the e-commerce, among others. The business activity rose thanks
to the high increase of the sales and the speed of trade transaction, producing a data volume
difficult to process. The improvement on the technology infrastructure developed recently,
allowed the expansion of this volume information even more. New generations of smaller
and energy efficient processors, high volume storage devices, cloud storage services,
invisible nanosensors and faster networks as bluetooth 3.0, WiFi or 4G. According to an
update of the data traffic forecast for the period 2016-2021, performed by Cisco (2017), in
its Visual Networking Index, “Global mobile data traffic reached 7.2 exabytes per month at
the end of 2016” (p.1).
Furthermore, the market penetration of intelligent devices, the consolidation of the
e-commerce and the usage of the social media is generating more and more data every day
(Figure 1). The stored data and the data traffic increases faster every year as well.
Figure 1: Annual size of global stored data and 2025 forecast. Source: Reinsel, et al. white paper (2017).
7
The Twitter case in Spain University of Alicante
This progressive rise of data and the improvement of the technology has allowed the
appearance of new sources of information; business apps, public repositories, social media
and sensors. However, these sources of information cannot be useful without the proper
techniques of storage and analysis. Labrinidis & Jagadish (2012) suggest five stages in the
process of extract valuable information from Big Data:
There are certainly advantages in Big Data. It was defined by Laney (2001) with the
3V´s concept; volume, variety and velocity, there is a big amount available of information,
it is faster to obtain and cheap. There is economic interest on Big Data from companies and
governments. It is based on targeting and understanding their customers or citizens
behaviour. Also, to optimise and understand the business process as well or even to
improve the accuracy of official statistics (Dass, et al. 2014). Finally, in the financial sector,
where trading algorithms are commonly used (Marr, 2013).
However, Big Data has some problems to consider. Data sets from Big Data sources are
already made, instead of designed by the analyst, so it is necessary to understand them prior
to analyse (Hassani et al. 2014). A typical task in Big Data is to aggregate data and hidden
missing data can be replaced with incorrect values. Also, there is not random sampling in
Big Data, so it can give information from subsets of determined population, making
selectivity problems. Furthermore, there is a problem of volatility with the frequency of the
incoming data, but a possible solution is to perform filtering techniques over the results,
e.g., Kalman filter or moving averages. Other problems are legal considerations. Social
media data, for instance, has legal terms and conditions about developers usage of the
information. Storage, data management and acquisition can cause high cost in the long
term.
Finally, high performance computing hardware and techniques are necessary to analyse
such amount of data at time or even in streaming (Daas et al. 2015: p. 256-257).
8
The Twitter case in Spain University of Alicante
b. Social Media
Social media is defined as internet platforms where the users exchange personal opinions
about certainly topics, using text messages. It has become an important part of the public
opinion. This information is posted in comments, likes, etc., which are common actions that
people share every day, mainly through its mobile phone. In Spain, the 95% of the
population that uses mobile phone, access to internet and to social media via this device
(Ditrendia 2017: p.4; Elogia 2017: p.4). Furthermore, around the 40% of the Spanish
population uses social media and it is increasing every year (Table 1).
2014 2015 2016 2017
Facebook 20 millions 22 millions 24 millions 23 millions
Twitter 3,5 millions 4,4 millions 4,5 millions 4,9 millions
Instagram - 7,4 millions 9,6 millions 13 millions Table 1: Number of users evolution. Source: The social media family, social media report (2017).
Potentially and according to this, social media usage is expanding, can be used as source
of information and many studies verify the its value. For instance, some of them have been
performed to improve official statistics, (Daas et al. 2013), find relations with consumption
indicators, (Brakel, et al. 2016), or to predict unemployment rates, (Llorente et al. 2015).
The main reasons for its usage is because it can reduce survey costs and allow a faster
release of information.
9
The Twitter case in Spain University of Alicante
3. Data sources
There are two main data sources in this project. The first one is the tweets dataset, which
consists in Twitter messages from Spanish users within the period 2012-2017 and it is used
to create the sentiment index. To make this dataset, it is necessary to create a Spanish
Twitter users list previously, in order to get the tweets. With this list as reference, all the
tweets in the selected period are downloaded and must be cleaned, filtered and prepared to
perform the sentiment classification. The second dataset is the consumer confidence index,
which is the Spanish official statistics created by the Research Center of Sociology and is
used to check if the sentiment index reflects the consumer confidence, comparing both.
All data sets creation processes as manipulation, transformation or cleansing are made
using the R programming language, Python, Shell (Bash commands) or Microsoft Excel, as
support platform. The datasets are stored in compressed R objects, as well as in .csv, .xlsx
or .json formats, as convenient.
The code performed to create these data sets can be found in the following Github
repository: (https://github.com/manugaco/SpanishSocialMediaIndex/tree/master).
As Twitter information is sensitive of privacy by its users, it is followed the
recommended procedure according to the Twitter privacy guide found in:
(https://developer.twitter.com)
10
The Twitter case in Spain University of Alicante
a. Tweets Dataset
This is the main source of information in the project. It is the reference dataset source of
information where to extract the sentiment from the social media platform, Twitter in this
project. Also, there are necessary other sources of information. The process is shown in the
Figure 2.
Figure 2: Process to perform the tweets dataset. Source: prepared by the author.
First of all, there is needed a Spanish Twitter users list as reference of where to extract
the tweets. Unfortunately, there are not users list availables on internet. However, this list
can be created using the Twitter API, but with technical restrictions. It is only allowed to
make fifteen server calls each hour, with a maximum download rate of 5000
friends/followers on each call, otherwise the servers breaks the connection and all the
information is lost. In order to avoid this problem, there are only selected users with less
than 75000 and more than 5000 friends/followers.
According to the findings of Morales, et al. (2010), the large majority of the Twitter
users follows a small group of high participation members (influencers). So querying the
followers and friends of these users can be a good starting point to make a list of users
within a region. To this purpose, a list of the most followed personalities in politics and
economic activities on Twitter in Spain, found in the web page of the marketing Spanish
company blademedia (2018) is used as the preliminar list, called “Spanish most followed
users” as shown in Figure 2.
11
The Twitter case in Spain University of Alicante
Over this list, an iterative loop called “users loop” is applied. It consists in get the
friends/followers from the most followed users making a new list and gathering both lists,
deleting the duplicate entries. Then, from the new list, repeating the process as shown in the
Figure 3.
Figure 3: Iterative loop to get the users dataset. Source: prepared by the author.
The resulting dataset contains around 1.4 million of Twitter users, with information
about the user, for instance, location or language.
This dataset can include users from other countries, so it must to be properly filtered. To
perform this task, a list with the names of municipalities and its regional translations of
Spain (Instituto Nacional de Estadística 2018), called “Location Filter list” is used (Figure
2). It contains 8.064 city names of municipalities, capitals, counties and its regional
translations. Basically, the filtering task consist in compare the location column of the
Twitter users dataset with the location filter list and keep those users that match the same
location.
There are some problems with this method, for instance, Spanish users without
information of location or with fake locations as “in the land of the living” are not selected.
Also, there are coincidences in the location list with foreign regions, for instance
“Guadalajara, México”. Because of the selectivity problem in this section, the results may
be potentially biased. But as Big Data advantages comes from massive amounts of
information, these negative effects can be considered cancelled as mentioned in the work of
Daas, et al. (2015).
12
The Twitter case in Spain University of Alicante
After apply this filter, the original list of 1.4 million users is reduced to around 600.000
users, whom can be assumed as Spanish users. Downloading all the tweets from this list of
users is not possible within the available period of time to develop this project, because of
downloading speed and processing limitations. To avoid this problem, a random sampling
of 240.000 users is drawn, resulting in the final dataset of “Twitter Spanish Users” (Figure
2).
From the resulting list of users, all the tweets are obtained in the selected period.
Because of the storage and time limitation of the project, it is not possible to get all the
tweets from all the days in the period 2012-2017 either.
Fortunately and thanks to the findings of Daas & Puts (2014) on whom this project is
based, it is possible to select some of days instead of select all the days of the month. In
their work, they considered monthly, weekly and daily aggregates of the sentiment
classified tweets, comparing these with the consumer confidence index in Netherlands and
the higher correlation was found in the weekly aggregate. In their study, they took
aggregates of seven days from the publication day of the survey as shown in the Figure 4.
Figure 4: Daas & Puts (2014) aggregates selection. Source: prepared by the author.
According to their work, the maximum correlation found between the consumer
confidence index and the sentiment index was with the aggregate of seven days, the week
before of the survey publication. This week corresponds with the aggregate of the days
from the 8th to the 14th. It is where the 70% of the survey is carried out and that make
sense, because the social media messages collected and the information from the survey
reflects the same emotion.
13
The Twitter case in Spain University of Alicante
So in order to save time, storage and according to these findings, it is selected the same
week but taking the dates of the Spanish survey, as shown in the Figure 5.
Figure 5: Aggregates comparison: Daas & Puts (2014) above and the selected in this project below.
*Values in parentheses corresponds to the correlation coefficients found in Daas & Puts (2014).
Source: prepared by the author.
The previous figure shows the comparison between the sentiment aggregates and the
consumer confidence in Netherlands (above) and Spain (below). In the Spanish case, the
consumer confidence survey is performed the second half of the month instead the first, so
the corresponding aggregate which matches the referenced study is from the 21st to the
27th, where the majority of the Spanish survey is done as well, and the week before the
survey publication.
Because of the limitations of the project mentioned above, it is not possible to download
seven days for each month during six years. The selected aggregate uses has three days
instead of seven, from the 21st to 23rd days of each month.
Once the days to download and the list of Spanish Twitter users as reference have been
selected, the process of tweets downloading is carried out. The result is the “Raw Tweets
Dataset” as shown in the Figure 2.
14
The Twitter case in Spain University of Alicante
This dataset contains an estimation of 100 million Twitter messages, with an average of
270.000 messages per day. It contains information of the name of the user, the date of the
tweet, the text of the tweet and other information like hashtags or magnet links.
In a Big Data project, a significant proportion of time corresponds to transform, clean
and filter the data, with the aim of improve the speed of processing and the accuracy of the
classification methods. This process is called “Data Tidying”, also shown in Figure 2. It
consists in clean and filter the tweets, specifically this task has been performed as suggested
in Kawa (2016). Basically, the tidying tasks in this project can be divided in two groups, as
represented in the Figure 6.
Figure 6: Tidying tasks. Source: prepared by the author.
The reason of stemming, remove stop-words and semantic features as usernames, links,
punctuation symbols, emoticons, mentions, and other information is because it improves
significantly the processing speed and it also allows to achieve better accuracy in the
classification process.
Potentially, the tweets dataset can contain messages from other languages. So, the first
filtering task consist in keep only those tweets written in regional languages, in order to
avoid problems with the classification. For this purpose, two text categorization algorithms,
textcat and cldr (Ramasundaram & Victor, 2013), are used. Because of the text is
previously stemmed, the algorithms find Spanish, Catalan and Galician very similar, and
classify them as the same language.
15
The Twitter case in Spain University of Alicante
The second filter consist in remove the messages with irrelevant information. This task
is done according to Daas & Puts (2014), where a list of economic words and synonyms is
used as filter and also with the objective of find higher correlation with the ICC. In this
project, the filter is a list of 120 economic words and synonyms and the filtering process
consists in keeping only those messages where at least one word of the tweet matches with
on in the list.
The result is the final dataset where the sentiment classification is performed. After
filtering and cleaning, it contains about 40 million of Tweets, with an average of 111.000
tweets per day, called “Final Tweets Dataset” as shown in Figure 2.
b. Consumer Confidence Dataset
This dataset comes from the Spanish Research Center of Sociology, (Centro de
Investigaciones Sociológicas, 2018) also known as CIS. It has time series format and
contains the values resulting from the monthly survey of the consumer confidence in Spain,
for the period 2012-2017. It contains three indexes, the current situation index (ISA), the
economic expectations index (IEE) and the consumer confidence index (ICC). The
consumer confidence index is computed using the two others indexes (ISA & IEE), with the
following formula:
CC I = 2ISA + IEE
It is constructed approximately likewise the consumer index in Netherlands, as
described in Daas & Puts (2014). The survey is conducted from the 14th to the last day of
each month, and it is published monthly in arrears (Figure 5). It consists in questions about
economy improvements and expectations about the future, as well as intention of
purchasing and the goal of this indicator is to predict the future consumer behaviour.
This time series is used as reference of the consumer confidence and to perform the
correlation with the sentiment index.
16
The Twitter case in Spain University of Alicante
4. Opinion mining
Once the tweets dataset is cleaned and filtered, it is performed an opinion mining
process, with the goal of classify the tweets with a level of positiveness.
a. Methodology
Opinion mining consist in extract relevant information from the subjectivity in texts and
consists in label a sentence with a determined strength of opinion, according to Pang & Lee
(2008). Even for a human, it is not easy to measure the overall sentiment of a sentence
building a list of keywords as reference. The goal of this part of the project is to create a
model able to detect and classify automatically if a tweet is positive or negative.
There are some facts that make this task difficult. Textual information comes from
complex language structures as irony, sarcasm or the context of a sentence whom are not
easy to detect.
Those problems can be solved with the right classification method. In this work, I
consider classifiers often used in the opinion mining literature, as the “valence lexicon
dictionary” method and the following supervised machine learning algorithms: “Naive
Bayes”, “Support Vector Machines” and “Regularized Logistic Regression”. The reason of
choosing different classification methods is because depending of the characteristics of the
data, the language (specifically in an opinion mining study) and the different ways of each
algorithm has to classify the data, it is recommended to check which one has the best
performance (Salzberg & Fayyad, 1997).
b. Valence Lexicon
This method consist in measure the overall sentiment of a sentence from the classified
value of its individual words, using a dictionary called lexicon, which is a list of positive
and negative words with a labeled numeric value.
In this project, the dictionary used is the valence lexicon which comes from the study
performed by Stadthagen-Gonzalez, et al. (2017). It is composed by 14031 Spanish words,
classified by emotional valence.
17
The Twitter case in Spain University of Alicante
All the words in the lexicon are stemmed and the duplicates are removed in the same
way that has been performed in the tweets dataset, with the goal of compare the same
language terms.
Emotional valence means that the measurement of each word in the dictionary has a
numeric value given by the degree of its associated emotion. If the word is related to a
negative emotion, it has low valence value and if it is related to a positive emotion, it has
high valence value in the lexicon.
This classifier works giving a value to each word of the tweet, using the lexicon as
reference. It does not include the connectors of the text, because those does not have any
associated emotion. Then, it sums all the valence values of the words within a tweet,
scoring a number with the overall measurement of the sensitivity, for instance, as
represented in the Figure 7.
Figure 7: Lexicon method classification examples. Source: prepared by the author.
This method has the problem of classify sentences, for instance, “it is not bad” as
negative as “it is bad”, because “bad” is a word related to the negative emotion. The context
of this sentence does not indicate something negative in the first one and however both are
classified with the same value. The same problem happens when it measures non positive
sentence as positive, as described in the Figure 7.
18
The Twitter case in Spain University of Alicante
c. Supervised Machine Learning
Machine learning is the discipline where a computer learns without being programmed
to do it. Generally, there are two categories in machine learning methods: The supervised
and the unsupervised learning. In this project, I use the supervised ones. The supervised
machine learning method consist in classify unlabeled data with a model based in a sample
of labeled data. In other words, the models is trained and validated over a dataset which has
been classified previously and once it has learnt from this data, it can classify new
incoming data. In this project, the source of information is called corpus linguistics and
consists in a prelabeled dataset of tweets. The machine learning model building and the
classification process is described in the Figure 8:
Figure 8: Model creation process and classification. Source: prepared by the author.
The first step in this process is to obtain the corpus linguistics, in order to have a source
of information to feed the machine learning models. In this project it is composed by tweets
and it called the “TASS corpus linguistics”, courtesy of the workshop on semantic analysis
at the SEPLN (2017).
19
The Twitter case in Spain University of Alicante
It consists in 70.000 tweets, written in Spanish by 200 personalities, from November
2011 to March 2012 as described in Villena-Román, et al. (2012). It has been used in other
studies to perform accuracy tests in both methods, Lexicon and Machine Learning (Anta, et
al. 2013; Moreno, et al. 2013). It has four levels of polarity: positive, negative, neutral and
none. Because the limitation of time and data processing limits, to make a faster training
and classification, I only used positive and negative tweets. Also, the tweets are stemmed
and stop-words are removed in the same way that has been performed in the tweets dataset.
Once the cleaning task is done, the corpus linguistics has around 40.000 tweets. It is
splitted randomly in two datasets, the train set (80%) and the test set (20%), and these
percentages has been selected because are the most commonly used. The train set has the
purpose of training and optimizing the model parameters. The test set is used to check the
accuracy of the model and thus to compare the different classification methods. The reason
of splitting the corpus is because if the model is trained and tested using the entire dataset,
it can produce overfitting. In other words, the model can perform good with the labeled data
but not as good with new unlabeled data. Once the corpus linguistics is cleaned and splitted,
it is ready to be used.
The next step is to train the different algorithms selected in this project using the training
set of the corpus linguistics and optimize its parameters. The input of each algorithms are
the features (tokens) of the tweets inside the corpus, which are associated to a given class,
positive or negative. These features come from the process of tokenization which consist in
split the text in smaller parts whom can be words, keywords, phrases, symbols or other
elements called tokens. Relying on the features (tokens) structure and the amount of
training data, each algorithm has a different behaviour. Down below, there is a specific
description of each algorithm and how it works.
20
The Twitter case in Spain University of Alicante
i. Algorithms description
1. Naive Bayes
The Naive Bayes algorithm is commonly used in data science projects due to its
simplicity and powerful ability as predictive algorithm. It is based on the “Bayes Theorem”
with the following formula:
(c | x) P = P ( x )
P (c) P ( x | c)
Where “c” is the class and “x” are the attributes. It gives a probability of being a class
conditioned to the probability of a given set of attributes. In this project, the classes are
positive or negative and the attributes are the text features. The parameter to optimize is the
weight given to the features of each class.
The main problem of this method in text classification is that assumes the probability of
subjectivity of each feature is independent to the others in the sentence. This is reflected in
the accuracy of the classifier as mentioned by Brownlee (2016).
The reason of the lack of accuracy in this algorithm in sentiment classification is because
the polarity of a sentence is not reflected by the measurement of the subjectivity of each
word independently. Otherwise, there is a deep relation between the positiveness and the
interrelation of its features. Basically, it has the same problem that the lexicon based
classification method.
2. Support Vector Machines (SVM)
Support Vector Machines is widely used in sentiment classification. First of all, the
algorithm plots all the features of the incoming data as points in the space (hyperplane),
where each axis correspond to the respective class, in this case positive and negative
(Figure 9).
Then, the algorithm creates a frontier, called support vector, made finding the maximum
margin hyperplane that divides the groups of each class. In other words, is the frontier
which best segregates the two classes (hyper-plane/line).
21
The Twitter case in Spain University of Alicante
The new data is classified by the algorithm making a non-probabilistic binary linear
model. There are two parameters to optimize, the kernel “C” and the softness parameter
“𝛾”.
Figure 9: Support vector machines in binary classification. Source: prepared by the author.
3. Regularized Logistic Regression
The Regularized Logistic Regression is a special case of the Generalized Linear Models.
It is a classification algorithm which consists in estimate discrete values from a group of
features by minimizing the variance, with the next equation:
Λ(x , , ) (β β x x )y ~ 1 … xp = Λ 0 + 1 1 + … + βp p
The output of the function is an estimate probability “p”, in between the interval 0 p ≥ ≥
1, which is rounded to 0 or 1, depending the closest to each value the probability is. In this
project, the features are the words of the sentence (tokens), the discrete values are positive
if the output of the function is 1 or negative if it is 0. It is regularized because it finds the
estimators by minimizing the “loss-penalty” problem (Friedman, et al. 2010):
ariance λ biasv +
22
The Twitter case in Spain University of Alicante
Where the bias corresponds to the error estimation of the model, and the variance of the
estimation is the error produced by fluctuations in the model. The relation between both is
the complexity of the model, which is the number of features used to build it. Increasing it,
the bias value decreases and the variance increases.
In other words, regularization means to find the optimal complexity of the model with
the minimum value of the parameter , which is the parameter to be optimize.λ
ii. Model validation
The technique used to select the optimal parameter and validate the model is called
“cross-validation”. There are different cross-validation techniques but in this project is used
the k-fold method, because it is widely used according to Friedman, et al. (2010: p. 8-18). It
is a useful technique to avoid the overfitting problems mentioned before. The dataset to
perform the validation process is the training dataset of the corpus linguistics.
Figure 10: 10- fold cross-validation procedure. Source: prepared by the author.
The technique consist in split randomly the dataset in “k” subsets and choosing one for
testing and the rest for training, repeating the process “k” times, using different parameters
each time. The number of the folds depends on the amount of available data for training,
although there is no formal rule, the 10-fold cross-validation is used in this project as
recommended in the work of Kohavi (1995: p.75).
23
The Twitter case in Spain University of Alicante
Specifically in this case, the “training” dataset of the corpus linguistics is splitted
randomly in 10 equal subsets, selecting one of them to test the selected algorithms and train
them in the rest of the data, repeating the process 10 times and selecting different
parameters (Figure 10). So each time that a parameter is selected in the cross-validation
process, the algorithm trains over the training data and it is tested over the respectively test
set giving a result.
All the results are stored and compared using the “area under the receiving operating
characteristics” criteria also known as area under the ROC curve, in order to select the
optimal parameter (Bradley 1997; Provost & Fawcett, 2001).
The ROC curve quantifies the ability of the classifier to discriminate between positive
and negatives, in this case. It is built by plotting the true positive and the false positive
results of a determined classification.
Figure 11: The ROC curve
Figure 12: Area under the ROC curve
Source: prepared by the author.
It is represented in the Figure 11, where if the curve is closer to the point “A” means
better accuracy (all the tweets are correctly classified) and the straight line “D”, represent
random classification. So according to this, a high value of the area under the ROC curve
(AUC) is a signal of better classification performance (Weiss & Provost, 2001: p. 11). If the
AUC = 1, all the tweets are correctly classified. This area is represented in the Figure 12.
Finally, all the stored values of the respective parameters used in the cross-validation
and the resulting AUC of the respective classification are plotted. The next chart (Figure
13) shows the relation of the AUC and the parameters selected of the regularized logistic
regression, as an example and because this algorithm has the best performance.
24
The Twitter case in Spain University of Alicante
Each red point represents the resulting value of the AUC related to a determined
parameter used in the model and the vertical dotted lines represents the optimal range of the
parameter selection. In this case, the AUC is 0,927 and the respective parameter value, λ =
0,00229 which is chosen as the optimal parameter.
Figure 13: Relation of the AUC and the lambda parameters. Source: prepared by the author.
d. Accuracy comparison and model selection
In this section, the accuracy of the lexicon based classifier and the optimized machine
learning models are compared. Specifically, the process to compare the performance of
both methods is to classify the test set of the corpus linguistics which is already labeled, and
compare the predicted values of each classification with the observed values.
The results of this comparison are presented in a table according to Pak & Paroubek
(2010). This table is called the confusion matrix and it summarizes the result of the
classification. It has two dimensions, with identical sets of classes on each dimension,
where the observed and the predicted values of each classification are presented.
25
The Twitter case in Spain University of Alicante
For instance, the next table represents the confusion matrix, where the cells in the
positions “True positive” and “True Negative” are the correctly classified tweets, and the
cell in the positions “False positive” and “False negative” are the incorrectly classified
tweets:
Confusion Matrix Predicted
Positive Negative
Observed
Positive True positive False Positive
Negative False Negative True Negative
Table 2: Confusion matrix example. Source: prepared by the author. Using this information, there are some metrics that help to choose the most accurate
method. In this study, I use the “accuracy” and the “error of prediction” which are
calculated with the following formulas:
ccuracy A = T otal observationsT rue positive + T rue Negative
rror 1 ccuracyE = − A
The testing process results show that the best performance algorithm is the logistic
regression, but there is not so much difference with the support vector machines algorithm.
However, the logistic regression has been chosen as the final classifier because of the
processing time of this algorithm is lower that the SVM. The next table shows the results of
the testing process:
Bayes Lexicon SVM LogReg
Accuracy 56,25% 77,15% 85,12% 85,83%
Error 43,75% 22,85% 14,88% 14,17% Table 3: Accuracy and error comparison of all the classifiers. Source: prepared by the author.
26
The Twitter case in Spain University of Alicante
e. Sentiment classification
Once the tweets dataset is properly cleaned and filtered and the classification model is
built and optimized, the sentiment classification can be performed, according to the next
process. It consists basically in classify the tweets using the optimized model, which gives a
positive or negative label to each tweet. Once all the tweets are classified, to measure the
overall sentiment within a determined day, the next criteria is followed:
entiment s t = ( )( total tweets t positive tweets t − total tweets t
negative tweets t 00+ 1
This formula follows the same procedure that the CIS uses to build the consumer
confidence index, where each value of the consumer confidence is performed by calculating
the difference of the percentage of positive and negative answers on the monthly survey
and adding 100. This results are stored as a new dataset where each day has an associated
sentiment value.
5. Building the Time Series
a. Calendar effect correction
At this point, the sentiment classification of the tweets dataset is already performed and
stored in a new dataset with panel data structure. In other words, the sentiment of the days
21, 22 and 23 of each month, from January of 2012 to December of 2017 . The selection of
three days can cause a problem of calendar effect. When a day falls in weekends or bank
holidays the sentiment may be different, in particular, more positive. The problem can be
corrected using the next regression:
entiment α α dummys = 0 + 1
Where “sentiment” represents the time series of the classified tweets, and “dummy” is a
binary variable that takes on value 1 when the day falls on weekend, bank holidays or the
Spanish “semana santa”, and 0 otherwise. Then, the residuals of the regression are saved as
the time series corrected of “calendar effect”.
27
The Twitter case in Spain University of Alicante
b. Aggregating the data
The previous dataset has panel data structure of three sentiment classified and calendar
corrected days of each month, from 2012 to 2017. In order to obtain a monthly time series
to compare with the consumer confidence index, the observations of the three days must to
be aggregated in and unique observation.
The aggregated value consists in a the weighted mean of the sentiment values of the
three days, according to the following equation:
s = w + w + w1 2 3
w x + w x + w x1 1 2 2 3 3
Where “ ” is the resulting aggregated value, “ ” is the weight factor and “ ” are the s wp xp
sentiment values of each day. The weight factor is based in the total number of tweets,
because the usage of the platform is deeply related to the discussion, where the sentiment is
reflected. The weights are the total amount of tweets of each day and are calculated and
normalized with the next formula:
, where wp = Nni ∑
n
i=1wp = 1
Where “ ” is the weight factor, “ ” is the total amount of tweets in day “i” and “N” wp ni
is the total number of tweets in the three days considered. Once the data is aggregated, the
result is the monthly time series of the calculated sentiment of the tweets (see Figure 1 in
Appendix).
28
The Twitter case in Spain University of Alicante
c. Filtering the volatility
The resulting aggregated time series is highly volatile, as expected in Big Data projects
and mentioned before as one of the problems to be solved.
In their previous similar work for the Netherlands, Daas & Puts (2014) recommend
filtering methods to smooth the results, as moving averages or the Kalman’s filter. The
moving averages has the disadvantage of using future observations to smooth each
observation. So, to create an index with the aim to predict other one, is meaningless to use
future observations to smooth the time series if those are unknown. The Kalman filter is
clearly out of the scope and level of this project.
There are other options to filter the time series, as the simple exponential smoothing
filter or the Holt-Winters smoothing filter. These filters only needs past observations to
smooth the time series as an advantage of the previous proposed methods, using the
following equation:
y 1 ) y y︿ t = α t + ( − α ︿
t−1
Where “ ” is the smoothing factor and it is used to give weight to the past observations. α
This parameter is selected automatically by the program where the filter is performed, so
this can cause overfitting problems. This is because the model uses all the available data in
the sample to choose the parameter, but it is not sure that this parameter selection would
perform out of the sample as good as in sample.
In order to correct this problem, in time series validation can be performed the split
method of the dataset, where the time series is splitted in two subsets, train and test sets
respectively.Because of the time series structure, this process cannot be performed
randomly, it has to be splitted using a period for training data, where the parameter is
selected, and the following as test set, specifically for this project.
In this case the training set corresponds with the period 2012 - 2015 and the testing set
from 2016 - 2017. The resulting filtered time series is shown in the Figure 14.
29
The Twitter case in Spain University of Alicante
Figure 14: Filtered sentiment index. Source: prepared by the author
6. Results
a. Correlation Analysis
Once the sentiment time series is built, it is necessary to check if it contains information
about the consumer confidence. There are differents methods to measure relations between
two variables, but in this case, the tool used is the Pearson’s correlation coefficient.
The results of the test of correlation between the sentiment time series and the consumer
confidence index gives a value of r = 0.81 (See Figure 2 in Appendix). This implies
evidence of high correlation between both time series and this result is suggestive that the
sentiment index is a good predictor.
Other analysis beyond of the scope of this project (such cointegration or structural
models) could be performed, as suggested by Daas et al, (2015). The following chart shows
both time series:
30
The Twitter case in Spain University of Alicante
Figure 15: Filtered sentiment index. Source: prepared by the author
The behaviour of the sentiment index is similar than the consumer confidence,
apparently and despite of the limitations of the project, it does a reasonable well-work.
However there are some obvious divergences, from August of 2014 in advance, the
sentiment index is more negative than expected.
b. Structural break
The different behaviour may be explained with a structural break in the time series.
There are some reasons that could explain this change. First of all, the filtering parameter
could has not been chosen correctly. This can be explained because the training set which
feeds with information the model and determines the parameter selection has been done
from 2012 to 2015. This can produce better fitting in the train sample than in the test
sample (2016-2017). In order to check if the split method could has been the reason,
another split is going to be performed in two subsets of the time series, as shown in next
Figure:
31
The Twitter case in Spain University of Alicante
Figure 16: Splitting scheme. Source: prepared by the author
The reason of this procedure is to allow the filtering model to have information from
both periods, before and after the structural break, this may solve the overfitting problems
in both periods of estimated values of the filtered time series.
In order to check if the problems have been solved, there is used the correlation between
the filtered sentiment times series and the consumer confidence index on each period. The
next table shows this comparison:
2012 - 2014 2015 - 2017
First method r = 0,9 r = 0,2
Second method r = 0,79 r = 0,38
Table 4: Correlation comparison. Source: prepared by the author
Where the first and second methods are the two different splitting procedures followed.
This results suggest that there is evidence of overfitting using the first method and it is
apparently solved using the second. Even though, the correlation in the second period is
still lower than the expected. The following charts represents the filtered time series with
the optimized parameter on each period:
32
The Twitter case in Spain University of Alicante
Figure 17: Time series charts comparison by periods. Source: prepared by the author
Since the time series still behaves differently, there are other reasons to be discussed that
could explain this result. Even though the number of accounts of Twitter has increased, this
platform had a negative evolution of its active users in the period 2014-2017 (Table 5) in
Spain.
33
The Twitter case in Spain University of Alicante
2014 2015 2016 2017
Active Users 1,4 millions 1,5 millions 1,4 millions 1,08 millions
Growth - 100.000 -100.000 -320.000
Rate - 7,14 -6,67 -22,86 Table 5: Number of Twitter active users evolution in Spain. TSMF, social media report (2017).
Keeping this in mind and according to the findings of Barberá & Rivero (2012: p. 725),
the followers of the leaders of political parties are more active than the rest. If the activity
in the platform has decreased and the most active users are radical followers of the political
parties, this may affect the sensitivity of the aggregated messages.
Furthermore the trends of the number of followers of the political leaders have changed
within the selected period because of the creation of the current “most followed” political
party in the platform in 2014 (Blademedia 2018). This may reflect that the majority of the
messages talking about politics and economic topics are strongly polarized and could have
change the trend.
In order with this arguments and considering which messages were selected to perform
the sentiment classification, the economic filter could have caused this, because it has
selected all the messages related to economic (and also politic) content.
34
The Twitter case in Spain University of Alicante
7. Conclusions
The main purpose of this project was to check if the social media platform Twitter
would provide useful information to predict the consumer confidence. The result shows that
this information can be found in the social media messages. Despite of the limitations of
this project, the sentiment index performs quite well as predictor of the consumer
confidence, with a high correlation (r = 0,81).
The model selection, the different filtering techniques performed over the tweets dataset
and the volatility filters are the main aspects to deeply work with, in order to properly
extract valuable information from social media. Furthermore, it has been interesting to
study the possible reasons of the different behaviour of the sentiment respectively with the
consumer confidence.
It is possible that, due to the lost of active users and the usage of the platform, the
extraction of the sentiment with the aim of predict economic indicators should be studied at
great length and the filtering techniques carefully selected.
Finally and for future analysis, it might be interesting to use the potential of these
findings. Some applications could be to perform weekly, daily or even a streaming indexes
with the consumer confidence, create a regional basis index or even combine both
approaches.
This indexes might be useful for marketing researchers in order to predict the current
behaviour of the consumers. Also, for economic policy researchers to have faster
information useful to develop new economic models. In addition, for investors or
manufacturing companies, to know the willingness of the demand in order to make their
decisions.
35
The Twitter case in Spain University of Alicante
8. Appendix
Figure 1 - Sentiment time series without filtering
Figure 2 - Correlation matrix
36
The Twitter case in Spain University of Alicante
9. Acknowledgments
I am extremely grateful to my girlfriend and future wife Estefanía, my parents Manolo
and Maite, and my colleague and friend Marcos, they have supported me every moment
during this project.
A special mention to my tutor Pedro Albarrán Pérez and the Faculty of Economics and
Business Science of the University of Alicante, for the guidelines and provide me with the
necessary resources to complete this project.
Finally, thanks to Manuel Garrido Peña, Alfonsa Denia Cuesta and Yoan Gutiérrez
Vázquez, for its cooperation and kindness.
10. References
blademedia.co (2018). Twitteros más populares en España. [online] twitter-espana,
available at: http://twitter-espana.com/
Brownlee, J. (2016). Naive Bayes for Machine Learning . [online] Machine Learning
Mastery, available at: https://machinelearningmastery.com
Centro de investigaciones Sociológicas (2018). La construcción del indicador de confianza
del consumidor (ICC) . [online] CIS, available at: http://www.cis.es
Cisco white paper (February, 2017) Visual Networking Index: Global Mobile Data Traffic
Forecast Update, 2016–2021 [online] Cisco, available at https://www.cisco.com
McCandless, M., Sanford, M. and Firat, A (2013). cldr: Language Identifier based on CLD
library . R package version 1.1.0.
Daas, P. and Puts, M. (2014). “Social Media Sentiment and Consumer Confidence”,
Statistics Paper Series (5). [online] European Central Bank, available at:
https://www.ecb.europa.eu
Daas, P. and van deer Loo, M. (2013). Computational Statistics & Data Analysis. [online]
Unesce. Available at: http://www.unescap.org
Daas, P., Puts, M., Buelens, B. and Hurk, P. (2015). “Big Data as a Source for Official
Statistics”. Journal of Official Statistics , 31(2). pp.249-262
Elogia digital marketing (2017). Estudio Anual Redes Sociales 2017. [online] Elogia,
available at: https://iabspain.es
37
The Twitter case in Spain University of Alicante
Fernández Anta, A., Morere, P., Núñez Chiroque, L. and Santos, A. (2013). “Sentiment
Analysis and Topic Detection of Spanish Tweets: A Comparative Study of NLP
Techniques”, Procesamiento del Lenguaje Natural , Revista nº 50 marzo de 2013, pp 45-52.
[online] RUA, available at: https://rua.ua.es
Friedman, J., Hastie, T. and Tibshirani, R. (2010). “Regularization Paths for Generalized
Linear Models via Coordinate Descent”. Journal of Statistical Software , 33(1).
García Corbí, M. (2018). Spanish Social Media Index. [online] GitHub, Available at:
https://github.com
Hassani, H., Saporta, G. and Silva, E. (2014). “Data Mining and Official Statistics: The
Past, the Present and the Future”. Big Data , 2(1), pp.34-43.
Hernández de Cos, P. (2018). La recuperación de la economía española, Evolución
reciente y perspectivas del mercado inmobiliario [online] Banco de España, available at:
https://www.bde.es
Hornik K, Mair P, Rauch J, Geiger W, Buchta C and Feinerer I (2013). “The textcat
Package for n-Gram Based Text Categorization in R.” Journal of Statistical Software ,
52(6), pp. 1-17.
Instituto Nacional de Estadística (2018). Cifras de población [online] INE, available at:
http://www.ine.es/
Instituto Nacional de Estadística (2018). Relación de municipios y códigos por provincias
[online] INE, available at: http://www.ine.es
Kawa, N. (2016). Text Classification. [online] Berkeley University, available at:
https://www.stat.berkeley.edu
Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs .
[online] Stanford, available at: http://robotics.stanford.edu
Labrinidis, A. and Jagadish, H. (2012). “Challenges and opportunities with big data”.
Proceedings of the VLDB Endowment , 5(12), pp.2032-2033
Laney, D. (2001) Application Delivery Strategies [online] Gartner, available at
https://blogs.gartner.com
Llorente, A., Garcia-Herranz, M., Cebrian, M. and Moro, E. (2015). “Social Media
Fingerprints of Unemployment”. PLOS ONE , 10 (5).
Marr, B. (2013), The Awesome Ways Big Data is used Today to Change Our World.
[online] LinkedIn, available at: https://www.linkedin.com
38
The Twitter case in Spain University of Alicante
Morales, A., Borondo, J., Losada, J. and Benito, R. (2014). “Efficiency of human activity
on information spreading on Twitter”. Social Networks , 39, pp.1-11.
Moreno-Ortiz, A. and Pérez Hernández, C. (2018). “Lexicon-Based Sentiment Analysis of
Twitter Messages in Spanish”, Procesamiento del Lenguaje Natural , núm. 50, marzo, 2013,
pp. 93-100 [online] Redalyc, available at: http://www.redalyc.org
Pak, A. and Paroubek, P. (2010). “Twitter as a corpus for sentiment analysis and opinion
mining”. In proceedings of the seventh conference on international language resources and
Evaluation : pp. 1320-1326
Pang, B. and Lee, L. (2008). “Opinion Mining and Sentiment Analysis”. Foundations and
Trends in Information Retrieval , 2(1–2), pp.1-135.
Provost, F. and Fawcett, T. (2001). “Robust Classification for Imprecise Environments”.
Machine Learning , 42, 203–231 [online] Springer, available at: https://link.springer.com
Ramasundaram, S., and Victor S.P. (2013) “Algorithms for Text Categorization : A
Comparative Study”, World Applied Sciences , Journal 22 (9): pp. 1232-1240.
Reinsel, D, John Gantz and John Rydning white paper (March 2017) Total WW Data to
Reach 163ZB by 2025 [online] Storagenewsletter, available at:
https://www.storagenewsletter.com
Rivero, F. (2016). Informe mobile en España y en el Mundo 2016. [online] Amic, available
at: http://www.amic.media
Salzberg, S. and Fayyad, U. (1997), “On comparing classifiers: Pitfalls to avoid and a
recommended approach”. Data Mining and Knowledge Discovery , vol 1, no. 3, pp. 317-328
Stadthagen-Gonzalez, H., Imbault, C., Pérez Sánchez, M. and Brysbaert, M. (2016). Norms
of valence and arousal for 14,031 Spanish words. [online] Springer, available at:
https://link.springer.com
The social media family (2018). IV Estudio sobre los usuarios de Facebook, Twitter e
Instagram en España . [online] thesocialmediafamily.com, available at: http://www.abc.es.
Twitter Developers (2018). Developer Agreement and Policy . [online] Twitter, available at:
https://developer.twitter.com
Van der Brakel, J., Söhler, E., Daas, P. and Buelens, B. (2016). “Social media as a data
source for official statistics; the Dutch Consumer Confidence Index”. Statistics
Netherlands , Discussion Paper , 2016 (01).
39
The Twitter case in Spain University of Alicante
Villena Román, J., García Morera, J., Moreno García, C., Ferrer Ureña, L., Lana Serrano,
S., González Cristóbal, J., Westerski, A., Martínez Cámara, E., Martínez Cumbreras, M.,
Martín Valdivia, M. and Ureña López, L. (2012). Workshop on Sentiment Analysis at
SEPLN. [online] Reserachgate, Available at: https://www.researchgate.net
Weiss, G. and Provost, F. (2001). The Effect of Class Distribution on Classifier Learning:
An Empirical Study . [online] Researchgate, available at: https://www.researchgate.net
40