UNIVERSIDAD DE ALICANTE FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES€¦ · FACULTAD DE...

UNIVERSIDAD DE ALICANTE

FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES

GRADO EN ECONOMÍA

Curso académico 2017-2018

USING SOCIAL MEDIA TO MEASURE THE CONSUMER CONFIDENCE:

THE TWITTER CASE IN SPAIN

Manuel García Corbí

Pedro Albarrán Pérez

Departamento de Fundamentos del Análisis Económico

Alicante, mayo de 2018

The Twitter case in Spain University of Alicante

2


Resumen

La finalidad de este proyecto es recopilar tuits, analizar su sentimiento, crear un índice a

partir de esa información y comprobar si ésta es útil para predecir la confianza del

consumidor.

Para crear dicho índice, se sigue un proceso de minería de opinión. En primer lugar, se

estudia la precisión de los métodos de análisis para escoger el más preciso, clasificar los

mensajes y obtener su sentimiento. El resultado es una serie temporal mensual,

comprendida entre 2012 y 2017, que se denominará “Índice Español de Sentimiento en

Twitter” (IEST).

Finalmente, se comprueba si la información obtenida puede ser útil para predecir el

índice de confianza del consumidor. Los resultados indican que ambos índices tienen

correlación positiva (r = 0,81), pero con diferente comportamiento en dos períodos

diferenciados, lo que podría implicar un de cambio estructural en la serie.

Abstract

The goal of this project is to collect tweets, analyze its sentiment, make an index and

check if this information can be useful to predict the consumer confidence.

To make the index, an opinion mining process is followed. First, it is developed a study

of the accuracy of different analysis methods in order to choose the best performance one.

Next, over a treated dataset of Tweets, it is applied the sentiment classification. The results

is a monthly time series, in between 2012 and 2017, called “Spanish Twitter Sentiment

Index” (STSI).

The final step is to check if this information can be useful to predict the consumer

confidence index. The results suggest that both indexes have positive correlation (r = 0,81),

but with different behaviour in two different periods, which could suggest a structural break

in the time series.

Keywords: Social Media, Big Data, Machine Learning, Sentiment Analysis, Official

Statistics.

3


Index

1. Introduction………………………………………………………………………... 5

2. Literature review…………………………………………………………………...7

a. Big Data……………………………………………………………………...7

b. Social Media………………………………………………………………....9

3. Data sources………………………………………………………………………. 10

a. Tweets Dataset…………………………………………………………….. 11

b. Consumer Confidence Dataset…………………………………………….. 16

4. Opinion mining………………………………………………………………….... 17

a. Methodology………………………………………………………………. 17

b. Valence Lexicon…………………………………………………………....17

c. Supervised Machine Learning…………………………………………….. 19

i. Algorithms description……………………………………………. 21

1. Naive Bayes……………………………………………….. 21

2. Support Vector Machines (SVM) ………………………… 21

3. Regularized Logistic Regression………………………….. 22

ii. Model validation…………………………………………………... 23

d. Accuracy comparison and model selection………………………………... 25

e. Sentiment classification…………………………………………………….27

5. Building the Time Series………………………………………………………….27

a. Calendar effect correction…………………………………………………. 27

b. Aggregating the data………………………………………………………. 28

c. Filtering the volatility……………………………………………………....28

6. Results…………………………………………………………………………….. 30

a. Correlation Analysis………………………………………………………..30

b. Structural break……………………………………………………………. 31

7. Conclusions……………………………………………………………………….. 33

8. Appendix………………………………………………………………………….. 34

9. Acknowledgments………………………………………………………………... 35

10. References……………………………………………………………………….... 35

4


1. Introduction

The purpose of this project is to make an index that reflects the consumer confidence, as

an alternative indicator of the official statistics. For this, I use the information within the

messages of Spanish Twitter users, through a process of opinion mining. Then, find out if it

can be useful to predict the consumer confidence index in Spain, in order to check if it the

goal of the project has been accomplished.

The consumer confidence index is an important indicator for policy makers, central

banks, investors, manufacturing companies and marketing researchers, among others,

because it is useful in order to evaluate the demand and make decisions. According to these

reasons, it might be interesting to create an indicator using social media as source of

information, in order to save money and supplement the official statistic.

The relation between the information found in social media and the official statistics is

that the same emotion is reflected in both. This theory is found in Appraisal-Tendency

Framework of Han et al. (2007) and basically says that the human being has two emotions

concerning to consumption decisions, called the integral and the incidental. The difference

is that the incidental emotion reflects the “intention” of buy a product and the integral

emotion the “final decision” of make the purchasing of a product. The consumer confidence

survey is made with questions about the “intention of buy something”, so it reflects the

incidental emotion. The same emotion is reflected on the messages of the active users in

social media as found by Daas & Puts (2014).

To create this indicator, an opinion mining process is carried out. This consist in

extracting the incidental emotion from social media messages, in this case, measuring the

sensitivity of the tweets. To perform this process, It is required Twitter messages and a

sentiment classifier.

The tweets datasets is not available on the internet, but it can be created. It is built using

a Twitter users list which is obtained from the Twitter API and the period chosen to

download the tweets is from 2012 to 2017. This period is chosen because there are some

events that make it interesting, and may have affected the consumer confidence. In 2012 in

Spain, there were debt downgrading problems and the higher risk premium in the recent

Spanish history.

5


Also, it is a period with trend changes in some economic indicators, for instance, private

consumption or unemployment rates, as showed at the Bank of Spain article “The Recovery

of the Spanish Economy” by Hernández de Cos (2018: p.2-10).

Furthermore, there are some limitations collecting social media data previous to this

date. The older the tweets are, it is possible those would have been erased by its users.

On the other hand, it is necessary to extract the sentiment that reflects the incidental

emotion from the tweets. To perform this task, it is necessary a classifier that comes from a

selection process between two methods, the valence lexicon dictionary and the supervised

machine learning methods. The comparison of the accuracy of different classification

methods is a commonly task in an opinion mining process. This is because there are some

aspects that might affect the accuracy of the classifiers, for instance, the quality of the data

used to train the algorithms.

Once the tweets dataset and the classifier are created, they are used to make the

sentiment index. The result is a monthly time series for the period mentioned above.

Finally, I check that the same information reflected in the consumer confidence index

(ICC), by using Pearson’s correlation as a measure of prediction power of the proposed

indicator.

The results suggest that this information can be found in social media and despite of the

limitations of the project, the sentiment index performs quite well as predictor of the

consumer confidence, with a high correlation (r = 0,81). However, there are some obvious

divergences which might be explained with a structural break in the sentiment time series.

In the following section, I discuss the potential of the Big Data as a source of

information. Especially, why social media can be an alternative source of data to

supplement the official statistics.

6


2. Literature review

a. Big Data

Big Data is a source of information which consists in large datasets and needs an

specific infrastructure and analysis methods to be processed.

During the last decade, the traditional data management methods have been pushed to

the limits by the effects of the e-commerce, among others. The business activity rose thanks

to the high increase of the sales and the speed of trade transaction, producing a data volume

difficult to process. The improvement on the technology infrastructure developed recently,

allowed the expansion of this volume information even more. New generations of smaller

and energy efficient processors, high volume storage devices, cloud storage services,

invisible nanosensors and faster networks as bluetooth 3.0, WiFi or 4G. According to an

update of the data traffic forecast for the period 2016-2021, performed by Cisco (2017), in

its Visual Networking Index, “Global mobile data traffic reached 7.2 exabytes per month at

the end of 2016” (p.1).

Furthermore, the market penetration of intelligent devices, the consolidation of the

e-commerce and the usage of the social media is generating more and more data every day

(Figure 1). The stored data and the data traffic increases faster every year as well.

Figure 1: Annual size of global stored data and 2025 forecast. Source: Reinsel, et al. white paper (2017).

7


This progressive rise of data and the improvement of the technology has allowed the

appearance of new sources of information; business apps, public repositories, social media

and sensors. However, these sources of information cannot be useful without the proper

techniques of storage and analysis. Labrinidis & Jagadish (2012) suggest five stages in the

process of extract valuable information from Big Data:

There are certainly advantages in Big Data. It was defined by Laney (2001) with the

3V´s concept; volume, variety and velocity, there is a big amount available of information,

it is faster to obtain and cheap. There is economic interest on Big Data from companies and

governments. It is based on targeting and understanding their customers or citizens

behaviour. Also, to optimise and understand the business process as well or even to

improve the accuracy of official statistics (Dass, et al. 2014). Finally, in the financial sector,

where trading algorithms are commonly used (Marr, 2013).

However, Big Data has some problems to consider. Data sets from Big Data sources are

already made, instead of designed by the analyst, so it is necessary to understand them prior

to analyse (Hassani et al. 2014). A typical task in Big Data is to aggregate data and hidden

missing data can be replaced with incorrect values. Also, there is not random sampling in

Big Data, so it can give information from subsets of determined population, making

selectivity problems. Furthermore, there is a problem of volatility with the frequency of the

incoming data, but a possible solution is to perform filtering techniques over the results,

e.g., Kalman filter or moving averages. Other problems are legal considerations. Social

media data, for instance, has legal terms and conditions about developers usage of the

information. Storage, data management and acquisition can cause high cost in the long

term.

Finally, high performance computing hardware and techniques are necessary to analyse

such amount of data at time or even in streaming (Daas et al. 2015: p. 256-257).

8


b. Social Media

Social media is defined as internet platforms where the users exchange personal opinions

about certainly topics, using text messages. It has become an important part of the public

opinion. This information is posted in comments, likes, etc., which are common actions that

people share every day, mainly through its mobile phone. In Spain, the 95% of the

population that uses mobile phone, access to internet and to social media via this device

(Ditrendia 2017: p.4; Elogia 2017: p.4). Furthermore, around the 40% of the Spanish

population uses social media and it is increasing every year (Table 1).

2014 2015 2016 2017

Facebook 20 millions 22 millions 24 millions 23 millions

Twitter 3,5 millions 4,4 millions 4,5 millions 4,9 millions

Instagram - 7,4 millions 9,6 millions 13 millions Table 1: Number of users evolution. Source: The social media family, social media report (2017).

Potentially and according to this, social media usage is expanding, can be used as source

of information and many studies verify the its value. For instance, some of them have been

performed to improve official statistics, (Daas et al. 2013), find relations with consumption

indicators, (Brakel, et al. 2016), or to predict unemployment rates, (Llorente et al. 2015).

The main reasons for its usage is because it can reduce survey costs and allow a faster

release of information.

9


3. Data sources

There are two main data sources in this project. The first one is the tweets dataset, which

consists in Twitter messages from Spanish users within the period 2012-2017 and it is used

to create the sentiment index. To make this dataset, it is necessary to create a Spanish

Twitter users list previously, in order to get the tweets. With this list as reference, all the

tweets in the selected period are downloaded and must be cleaned, filtered and prepared to

perform the sentiment classification. The second dataset is the consumer confidence index,

which is the Spanish official statistics created by the Research Center of Sociology and is

used to check if the sentiment index reflects the consumer confidence, comparing both.

All data sets creation processes as manipulation, transformation or cleansing are made

using the R programming language, Python, Shell (Bash commands) or Microsoft Excel, as

support platform. The datasets are stored in compressed R objects, as well as in .csv, .xlsx

or .json formats, as convenient.

The code performed to create these data sets can be found in the following Github

repository: (https://github.com/manugaco/SpanishSocialMediaIndex/tree/master).

As Twitter information is sensitive of privacy by its users, it is followed the

recommended procedure according to the Twitter privacy guide found in:

(https://developer.twitter.com)

10

https://github.com/manugaco/SpanishSocialMediaIndex/tree/master

https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases


a. Tweets Dataset

This is the main source of information in the project. It is the reference dataset source of

information where to extract the sentiment from the social media platform, Twitter in this

project. Also, there are necessary other sources of information. The process is shown in the

Figure 2.

Figure 2: Process to perform the tweets dataset. Source: prepared by the author.

First of all, there is needed a Spanish Twitter users list as reference of where to extract

the tweets. Unfortunately, there are not users list availables on internet. However, this list

can be created using the Twitter API, but with technical restrictions. It is only allowed to

make fifteen server calls each hour, with a maximum download rate of 5000

friends/followers on each call, otherwise the servers breaks the connection and all the

information is lost. In order to avoid this problem, there are only selected users with less

than 75000 and more than 5000 friends/followers.

According to the findings of Morales, et al. (2010), the large majority of the Twitter

users follows a small group of high participation members (influencers). So querying the

followers and friends of these users can be a good starting point to make a list of users

within a region. To this purpose, a list of the most followed personalities in politics and

economic activities on Twitter in Spain, found in the web page of the marketing Spanish

company blademedia (2018) is used as the preliminar list, called “Spanish most followed

users” as shown in Figure 2.

11


Over this list, an iterative loop called “users loop” is applied. It consists in get the

friends/followers from the most followed users making a new list and gathering both lists,

deleting the duplicate entries. Then, from the new list, repeating the process as shown in the

Figure 3.

Figure 3: Iterative loop to get the users dataset. Source: prepared by the author.

The resulting dataset contains around 1.4 million of Twitter users, with information

about the user, for instance, location or language.

This dataset can include users from other countries, so it must to be properly filtered. To

perform this task, a list with the names of municipalities and its regional translations of

Spain (Instituto Nacional de Estadística 2018), called “Location Filter list” is used (Figure

2). It contains 8.064 city names of municipalities, capitals, counties and its regional

translations. Basically, the filtering task consist in compare the location column of the

Twitter users dataset with the location filter list and keep those users that match the same

location.

There are some problems with this method, for instance, Spanish users without

information of location or with fake locations as “in the land of the living” are not selected.

Also, there are coincidences in the location list with foreign regions, for instance

“Guadalajara, México”. Because of the selectivity problem in this section, the results may

be potentially biased. But as Big Data advantages comes from massive amounts of

information, these negative effects can be considered cancelled as mentioned in the work of

Daas, et al. (2015).

12


After apply this filter, the original list of 1.4 million users is reduced to around 600.000

users, whom can be assumed as Spanish users. Downloading all the tweets from this list of

users is not possible within the available period of time to develop this project, because of

downloading speed and processing limitations. To avoid this problem, a random sampling

of 240.000 users is drawn, resulting in the final dataset of “Twitter Spanish Users” (Figure

2).

From the resulting list of users, all the tweets are obtained in the selected period.

Because of the storage and time limitation of the project, it is not possible to get all the

tweets from all the days in the period 2012-2017 either.

Fortunately and thanks to the findings of Daas & Puts (2014) on whom this project is

based, it is possible to select some of days instead of select all the days of the month. In

their work, they considered monthly, weekly and daily aggregates of the sentiment

classified tweets, comparing these with the consumer confidence index in Netherlands and

the higher correlation was found in the weekly aggregate. In their study, they took

aggregates of seven days from the publication day of the survey as shown in the Figure 4.

Figure 4: Daas & Puts (2014) aggregates selection. Source: prepared by the author.

According to their work, the maximum correlation found between the consumer

confidence index and the sentiment index was with the aggregate of seven days, the week

before of the survey publication. This week corresponds with the aggregate of the days

from the 8th to the 14th. It is where the 70% of the survey is carried out and that make

sense, because the social media messages collected and the information from the survey

reflects the same emotion.

13


So in order to save time, storage and according to these findings, it is selected the same

week but taking the dates of the Spanish survey, as shown in the Figure 5.

Figure 5: Aggregates comparison: Daas & Puts (2014) above and the selected in this project below.

*Values in parentheses corresponds to the correlation coefficients found in Daas & Puts (2014).

Source: prepared by the author.

The previous figure shows the comparison between the sentiment aggregates and the

consumer confidence in Netherlands (above) and Spain (below). In the Spanish case, the

consumer confidence survey is performed the second half of the month instead the first, so

the corresponding aggregate which matches the referenced study is from the 21st to the

27th, where the majority of the Spanish survey is done as well, and the week before the

survey publication.

Because of the limitations of the project mentioned above, it is not possible to download

seven days for each month during six years. The selected aggregate uses has three days

instead of seven, from the 21st to 23rd days of each month.

Once the days to download and the list of Spanish Twitter users as reference have been

selected, the process of tweets downloading is carried out. The result is the “Raw Tweets

Dataset” as shown in the Figure 2.

14


This dataset contains an estimation of 100 million Twitter messages, with an average of

270.000 messages per day. It contains information of the name of the user, the date of the

tweet, the text of the tweet and other information like hashtags or magnet links.

In a Big Data project, a significant proportion of time corresponds to transform, clean

and filter the data, with the aim of improve the speed of processing and the accuracy of the

classification methods. This process is called “Data Tidying”, also shown in Figure 2. It

consists in clean and filter the tweets, specifically this task has been performed as suggested

in Kawa (2016). Basically, the tidying tasks in this project can be divided in two groups, as

represented in the Figure 6.

Figure 6: Tidying tasks. Source: prepared by the author.

The reason of stemming, remove stop-words and semantic features as usernames, links,

punctuation symbols, emoticons, mentions, and other information is because it improves

significantly the processing speed and it also allows to achieve better accuracy in the

classification process.

Potentially, the tweets dataset can contain messages from other languages. So, the first

filtering task consist in keep only those tweets written in regional languages, in order to

avoid problems with the classification. For this purpose, two text categorization algorithms,

textcat and cldr (Ramasundaram & Victor, 2013), are used. Because of the text is

previously stemmed, the algorithms find Spanish, Catalan and Galician very similar, and

classify them as the same language.

15


The second filter consist in remove the messages with irrelevant information. This task

is done according to Daas & Puts (2014), where a list of economic words and synonyms is

used as filter and also with the objective of find higher correlation with the ICC. In this

project, the filter is a list of 120 economic words and synonyms and the filtering process

consists in keeping only those messages where at least one word of the tweet matches with

on in the list.

The result is the final dataset where the sentiment classification is performed. After

filtering and cleaning, it contains about 40 million of Tweets, with an average of 111.000

tweets per day, called “Final Tweets Dataset” as shown in Figure 2.

b. Consumer Confidence Dataset

This dataset comes from the Spanish Research Center of Sociology, (Centro de

Investigaciones Sociológicas, 2018) also known as CIS. It has time series format and

contains the values resulting from the monthly survey of the consumer confidence in Spain,

for the period 2012-2017. It contains three indexes, the current situation index (ISA), the

economic expectations index (IEE) and the consumer confidence index (ICC). The

consumer confidence index is computed using the two others indexes (ISA & IEE), with the

following formula:

CC I = 2ISA + IEE

It is constructed approximately likewise the consumer index in Netherlands, as

described in Daas & Puts (2014). The survey is conducted from the 14th to the last day of

each month, and it is published monthly in arrears (Figure 5). It consists in questions about

economy improvements and expectations about the future, as well as intention of

purchasing and the goal of this indicator is to predict the future consumer behaviour.

This time series is used as reference of the consumer confidence and to perform the

correlation with the sentiment index.

16


4. Opinion mining

Once the tweets dataset is cleaned and filtered, it is performed an opinion mining

process, with the goal of classify the tweets with a level of positiveness.

a. Methodology

Opinion mining consist in extract relevant information from the subjectivity in texts and

consists in label a sentence with a determined strength of opinion, according to Pang & Lee

(2008). Even for a human, it is not easy to measure the overall sentiment of a sentence

building a list of keywords as reference. The goal of this part of the project is to create a

model able to detect and classify automatically if a tweet is positive or negative.

There are some facts that make this task difficult. Textual information comes from

complex language structures as irony, sarcasm or the context of a sentence whom are not

easy to detect.

Those problems can be solved with the right classification method. In this work, I

consider classifiers often used in the opinion mining literature, as the “valence lexicon

dictionary” method and the following supervised machine learning algorithms: “Naive

Bayes”, “Support Vector Machines” and “Regularized Logistic Regression”. The reason of

choosing different classification methods is because depending of the characteristics of the

data, the language (specifically in an opinion mining study) and the different ways of each

algorithm has to classify the data, it is recommended to check which one has the best

performance (Salzberg & Fayyad, 1997).

b. Valence Lexicon

This method consist in measure the overall sentiment of a sentence from the classified

value of its individual words, using a dictionary called lexicon, which is a list of positive

and negative words with a labeled numeric value.

In this project, the dictionary used is the valence lexicon which comes from the study

performed by Stadthagen-Gonzalez, et al. (2017). It is composed by 14031 Spanish words,

classified by emotional valence.

17


All the words in the lexicon are stemmed and the duplicates are removed in the same

way that has been performed in the tweets dataset, with the goal of compare the same

language terms.

Emotional valence means that the measurement of each word in the dictionary has a

numeric value given by the degree of its associated emotion. If the word is related to a

negative emotion, it has low valence value and if it is related to a positive emotion, it has

high valence value in the lexicon.

This classifier works giving a value to each word of the tweet, using the lexicon as

reference. It does not include the connectors of the text, because those does not have any

associated emotion. Then, it sums all the valence values of the words within a tweet,

scoring a number with the overall measurement of the sensitivity, for instance, as

represented in the Figure 7.

Figure 7: Lexicon method classification examples. Source: prepared by the author.

This method has the problem of classify sentences, for instance, “it is not bad” as

negative as “it is bad”, because “bad” is a word related to the negative emotion. The context

of this sentence does not indicate something negative in the first one and however both are

classified with the same value. The same problem happens when it measures non positive

sentence as positive, as described in the Figure 7.

18


c. Supervised Machine Learning

Machine learning is the discipline where a computer learns without being programmed

to do it. Generally, there are two categories in machine learning methods: The supervised

and the unsupervised learning. In this project, I use the supervised ones. The supervised

machine learning method consist in classify unlabeled data with a model based in a sample

of labeled data. In other words, the models is trained and validated over a dataset which has

been classified previously and once it has learnt from this data, it can classify new

incoming data. In this project, the source of information is called corpus linguistics and

consists in a prelabeled dataset of tweets. The machine learning model building and the

classification process is described in the Figure 8:

Figure 8: Model creation process and classification. Source: prepared by the author.

The first step in this process is to obtain the corpus linguistics, in order to have a source

of information to feed the machine learning models. In this project it is composed by tweets

and it called the “TASS corpus linguistics”, courtesy of the workshop on semantic analysis

at the SEPLN (2017).

19


It consists in 70.000 tweets, written in Spanish by 200 personalities, from November

2011 to March 2012 as described in Villena-Román, et al. (2012). It has been used in other

studies to perform accuracy tests in both methods, Lexicon and Machine Learning (Anta, et

al. 2013; Moreno, et al. 2013). It has four levels of polarity: positive, negative, neutral and

none. Because the limitation of time and data processing limits, to make a faster training

and classification, I only used positive and negative tweets. Also, the tweets are stemmed

and stop-words are removed in the same way that has been performed in the tweets dataset.

Once the cleaning task is done, the corpus linguistics has around 40.000 tweets. It is

splitted randomly in two datasets, the train set (80%) and the test set (20%), and these

percentages has been selected because are the most commonly used. The train set has the

purpose of training and optimizing the model parameters. The test set is used to check the

accuracy of the model and thus to compare the different classification methods. The reason

of splitting the corpus is because if the model is trained and tested using the entire dataset,

it can produce overfitting. In other words, the model can perform good with the labeled data

but not as good with new unlabeled data. Once the corpus linguistics is cleaned and splitted,

it is ready to be used.

The next step is to train the different algorithms selected in this project using the training

set of the corpus linguistics and optimize its parameters. The input of each algorithms are

the features (tokens) of the tweets inside the corpus, which are associated to a given class,

positive or negative. These features come from the process of tokenization which consist in

split the text in smaller parts whom can be words, keywords, phrases, symbols or other

elements called tokens. Relying on the features (tokens) structure and the amount of

training data, each algorithm has a different behaviour. Down below, there is a specific

description of each algorithm and how it works.

20


i. Algorithms description

1. Naive Bayes

The Naive Bayes algorithm is commonly used in data science projects due to its

simplicity and powerful ability as predictive algorithm. It is based on the “Bayes Theorem”

with the following formula:

(c | x) P = P ( x )

P (c) P ( x | c)

Where “c” is the class and “x” are the attributes. It gives a probability of being a class

conditioned to the probability of a given set of attributes. In this project, the classes are

positive or negative and the attributes are the text features. The parameter to optimize is the

weight given to the features of each class.

The main problem of this method in text classification is that assumes the probability of

subjectivity of each feature is independent to the others in the sentence. This is reflected in

the accuracy of the classifier as mentioned by Brownlee (2016).

The reason of the lack of accuracy in this algorithm in sentiment classification is because

the polarity of a sentence is not reflected by the measurement of the subjectivity of each

word independently. Otherwise, there is a deep relation between the positiveness and the

interrelation of its features. Basically, it has the same problem that the lexicon based

classification method.

2. Support Vector Machines (SVM)

Support Vector Machines is widely used in sentiment classification. First of all, the

algorithm plots all the features of the incoming data as points in the space (hyperplane),

where each axis correspond to the respective class, in this case positive and negative

(Figure 9).

Then, the algorithm creates a frontier, called support vector, made finding the maximum

margin hyperplane that divides the groups of each class. In other words, is the frontier

which best segregates the two classes (hyper-plane/line).

21


The new data is classified by the algorithm making a non-probabilistic binary linear

model. There are two parameters to optimize, the kernel “C” and the softness parameter

“𝛾”.

Figure 9: Support vector machines in binary classification. Source: prepared by the author.

3. Regularized Logistic Regression

The Regularized Logistic Regression is a special case of the Generalized Linear Models.

It is a classification algorithm which consists in estimate discrete values from a group of

features by minimizing the variance, with the next equation:

Λ(x , , ) (β β x x )y ~ 1 … xp = Λ 0 + 1 1 + … + βp p

The output of the function is an estimate probability “p”, in between the interval 0 p ≥ ≥

1, which is rounded to 0 or 1, depending the closest to each value the probability is. In this

project, the features are the words of the sentence (tokens), the discrete values are positive

if the output of the function is 1 or negative if it is 0. It is regularized because it finds the

estimators by minimizing the “loss-penalty” problem (Friedman, et al. 2010):

ariance λ biasv +

22


Where the bias corresponds to the error estimation of the model, and the variance of the

estimation is the error produced by fluctuations in the model. The relation between both is

the complexity of the model, which is the number of features used to build it. Increasing it,

the bias value decreases and the variance increases.

In other words, regularization means to find the optimal complexity of the model with

the minimum value of the parameter , which is the parameter to be optimize.λ

ii. Model validation

The technique used to select the optimal parameter and validate the model is called

“cross-validation”. There are different cross-validation techniques but in this project is used

the k-fold method, because it is widely used according to Friedman, et al. (2010: p. 8-18). It

is a useful technique to avoid the overfitting problems mentioned before. The dataset to

perform the validation process is the training dataset of the corpus linguistics.

Figure 10: 10- fold cross-validation procedure. Source: prepared by the author.

The technique consist in split randomly the dataset in “k” subsets and choosing one for

testing and the rest for training, repeating the process “k” times, using different parameters

each time. The number of the folds depends on the amount of available data for training,

although there is no formal rule, the 10-fold cross-validation is used in this project as

recommended in the work of Kohavi (1995: p.75).

23


Specifically in this case, the “training” dataset of the corpus linguistics is splitted

randomly in 10 equal subsets, selecting one of them to test the selected algorithms and train

them in the rest of the data, repeating the process 10 times and selecting different

parameters (Figure 10). So each time that a parameter is selected in the cross-validation

process, the algorithm trains over the training data and it is tested over the respectively test

set giving a result.

All the results are stored and compared using the “area under the receiving operating

characteristics” criteria also known as area under the ROC curve, in order to select the

optimal parameter (Bradley 1997; Provost & Fawcett, 2001).

The ROC curve quantifies the ability of the classifier to discriminate between positive

and negatives, in this case. It is built by plotting the true positive and the false positive

results of a determined classification.

Figure 11: The ROC curve

Figure 12: Area under the ROC curve

Source: prepared by the author.

It is represented in the Figure 11, where if the curve is closer to the point “A” means

better accuracy (all the tweets are correctly classified) and the straight line “D”, represent

random classification. So according to this, a high value of the area under the ROC curve

(AUC) is a signal of better classification performance (Weiss & Provost, 2001: p. 11). If the

AUC = 1, all the tweets are correctly classified. This area is represented in the Figure 12.

Finally, all the stored values of the respective parameters used in the cross-validation

and the resulting AUC of the respective classification are plotted. The next chart (Figure

13) shows the relation of the AUC and the parameters selected of the regularized logistic

regression, as an example and because this algorithm has the best performance.

24


Each red point represents the resulting value of the AUC related to a determined

parameter used in the model and the vertical dotted lines represents the optimal range of the

parameter selection. In this case, the AUC is 0,927 and the respective parameter value, λ =

0,00229 which is chosen as the optimal parameter.

Figure 13: Relation of the AUC and the lambda parameters. Source: prepared by the author.

d. Accuracy comparison and model selection

In this section, the accuracy of the lexicon based classifier and the optimized machine

learning models are compared. Specifically, the process to compare the performance of

both methods is to classify the test set of the corpus linguistics which is already labeled, and

compare the predicted values of each classification with the observed values.

The results of this comparison are presented in a table according to Pak & Paroubek

(2010). This table is called the confusion matrix and it summarizes the result of the

classification. It has two dimensions, with identical sets of classes on each dimension,

where the observed and the predicted values of each classification are presented.

25


For instance, the next table represents the confusion matrix, where the cells in the

positions “True positive” and “True Negative” are the correctly classified tweets, and the

cell in the positions “False positive” and “False negative” are the incorrectly classified

tweets:

Confusion Matrix Predicted

Positive Negative

Observed

Positive True positive False Positive

Negative False Negative True Negative

Table 2: Confusion matrix example. Source: prepared by the author. Using this information, there are some metrics that help to choose the most accurate

method. In this study, I use the “accuracy” and the “error of prediction” which are

calculated with the following formulas:

ccuracy A = T otal observationsT rue positive + T rue Negative

rror 1 ccuracyE = − A

The testing process results show that the best performance algorithm is the logistic

regression, but there is not so much difference with the support vector machines algorithm.

However, the logistic regression has been chosen as the final classifier because of the

processing time of this algorithm is lower that the SVM. The next table shows the results of

the testing process:

Bayes Lexicon SVM LogReg

Accuracy 56,25% 77,15% 85,12% 85,83%

Error 43,75% 22,85% 14,88% 14,17% Table 3: Accuracy and error comparison of all the classifiers. Source: prepared by the author.

26


e. Sentiment classification

Once the tweets dataset is properly cleaned and filtered and the classification model is

built and optimized, the sentiment classification can be performed, according to the next

process. It consists basically in classify the tweets using the optimized model, which gives a

positive or negative label to each tweet. Once all the tweets are classified, to measure the

overall sentiment within a determined day, the next criteria is followed:

entiment s t = ( )( total tweets t positive tweets t − total tweets t

negative tweets t 00+ 1

This formula follows the same procedure that the CIS uses to build the consumer

confidence index, where each value of the consumer confidence is performed by calculating

the difference of the percentage of positive and negative answers on the monthly survey

and adding 100. This results are stored as a new dataset where each day has an associated

sentiment value.

5. Building the Time Series

a. Calendar effect correction

At this point, the sentiment classification of the tweets dataset is already performed and

stored in a new dataset with panel data structure. In other words, the sentiment of the days

21, 22 and 23 of each month, from January of 2012 to December of 2017 . The selection of

three days can cause a problem of calendar effect. When a day falls in weekends or bank

holidays the sentiment may be different, in particular, more positive. The problem can be

corrected using the next regression:

entiment α α dummys = 0 + 1

Where “sentiment” represents the time series of the classified tweets, and “dummy” is a

binary variable that takes on value 1 when the day falls on weekend, bank holidays or the

Spanish “semana santa”, and 0 otherwise. Then, the residuals of the regression are saved as

the time series corrected of “calendar effect”.

27


b. Aggregating the data

The previous dataset has panel data structure of three sentiment classified and calendar

corrected days of each month, from 2012 to 2017. In order to obtain a monthly time series

to compare with the consumer confidence index, the observations of the three days must to

be aggregated in and unique observation.

The aggregated value consists in a the weighted mean of the sentiment values of the

three days, according to the following equation:

s = w + w + w1 2 3

w x + w x + w x1 1 2 2 3 3

Where “ ” is the resulting aggregated value, “ ” is the weight factor and “ ” are the s wp xp

sentiment values of each day. The weight factor is based in the total number of tweets,

because the usage of the platform is deeply related to the discussion, where the sentiment is

reflected. The weights are the total amount of tweets of each day and are calculated and

normalized with the next formula:

, where wp = Nni ∑

n

i=1wp = 1

Where “ ” is the weight factor, “ ” is the total amount of tweets in day “i” and “N” wp ni

is the total number of tweets in the three days considered. Once the data is aggregated, the

result is the monthly time series of the calculated sentiment of the tweets (see Figure 1 in

Appendix).

28


c. Filtering the volatility

The resulting aggregated time series is highly volatile, as expected in Big Data projects

and mentioned before as one of the problems to be solved.

In their previous similar work for the Netherlands, Daas & Puts (2014) recommend

filtering methods to smooth the results, as moving averages or the Kalman’s filter. The

moving averages has the disadvantage of using future observations to smooth each

observation. So, to create an index with the aim to predict other one, is meaningless to use

future observations to smooth the time series if those are unknown. The Kalman filter is

clearly out of the scope and level of this project.

There are other options to filter the time series, as the simple exponential smoothing

filter or the Holt-Winters smoothing filter. These filters only needs past observations to

smooth the time series as an advantage of the previous proposed methods, using the

following equation:

y 1 ) y y︿ t = α t + ( − α ︿

t−1

Where “ ” is the smoothing factor and it is used to give weight to the past observations. α

This parameter is selected automatically by the program where the filter is performed, so

this can cause overfitting problems. This is because the model uses all the available data in

the sample to choose the parameter, but it is not sure that this parameter selection would

perform out of the sample as good as in sample.

In order to correct this problem, in time series validation can be performed the split

method of the dataset, where the time series is splitted in two subsets, train and test sets

respectively.Because of the time series structure, this process cannot be performed

randomly, it has to be splitted using a period for training data, where the parameter is

selected, and the following as test set, specifically for this project.

In this case the training set corresponds with the period 2012 - 2015 and the testing set

from 2016 - 2017. The resulting filtered time series is shown in the Figure 14.

29


Figure 14: Filtered sentiment index. Source: prepared by the author

6. Results

a. Correlation Analysis

Once the sentiment time series is built, it is necessary to check if it contains information

about the consumer confidence. There are differents methods to measure relations between

two variables, but in this case, the tool used is the Pearson’s correlation coefficient.

The results of the test of correlation between the sentiment time series and the consumer

confidence index gives a value of r = 0.81 (See Figure 2 in Appendix). This implies

evidence of high correlation between both time series and this result is suggestive that the

sentiment index is a good predictor.

Other analysis beyond of the scope of this project (such cointegration or structural

models) could be performed, as suggested by Daas et al, (2015). The following chart shows

both time series:

30


Figure 15: Filtered sentiment index. Source: prepared by the author

The behaviour of the sentiment index is similar than the consumer confidence,

apparently and despite of the limitations of the project, it does a reasonable well-work.

However there are some obvious divergences, from August of 2014 in advance, the

sentiment index is more negative than expected.

b. Structural break

The different behaviour may be explained with a structural break in the time series.

There are some reasons that could explain this change. First of all, the filtering parameter

could has not been chosen correctly. This can be explained because the training set which

feeds with information the model and determines the parameter selection has been done

from 2012 to 2015. This can produce better fitting in the train sample than in the test

sample (2016-2017). In order to check if the split method could has been the reason,

another split is going to be performed in two subsets of the time series, as shown in next

Figure:

31


Figure 16: Splitting scheme. Source: prepared by the author

The reason of this procedure is to allow the filtering model to have information from

both periods, before and after the structural break, this may solve the overfitting problems

in both periods of estimated values of the filtered time series.

In order to check if the problems have been solved, there is used the correlation between

the filtered sentiment times series and the consumer confidence index on each period. The

next table shows this comparison:

2012 - 2014 2015 - 2017

First method r = 0,9 r = 0,2

Second method r = 0,79 r = 0,38

Table 4: Correlation comparison. Source: prepared by the author

Where the first and second methods are the two different splitting procedures followed.

This results suggest that there is evidence of overfitting using the first method and it is

apparently solved using the second. Even though, the correlation in the second period is

still lower than the expected. The following charts represents the filtered time series with

the optimized parameter on each period:

32


Figure 17: Time series charts comparison by periods. Source: prepared by the author

Since the time series still behaves differently, there are other reasons to be discussed that

could explain this result. Even though the number of accounts of Twitter has increased, this

platform had a negative evolution of its active users in the period 2014-2017 (Table 5) in

Spain.

33


2014 2015 2016 2017

Active Users 1,4 millions 1,5 millions 1,4 millions 1,08 millions

Growth - 100.000 -100.000 -320.000

Rate - 7,14 -6,67 -22,86 Table 5: Number of Twitter active users evolution in Spain. TSMF, social media report (2017).

Keeping this in mind and according to the findings of Barberá & Rivero (2012: p. 725),

the followers of the leaders of political parties are more active than the rest. If the activity

in the platform has decreased and the most active users are radical followers of the political

parties, this may affect the sensitivity of the aggregated messages.

Furthermore the trends of the number of followers of the political leaders have changed

within the selected period because of the creation of the current “most followed” political

party in the platform in 2014 (Blademedia 2018). This may reflect that the majority of the

messages talking about politics and economic topics are strongly polarized and could have

change the trend.

In order with this arguments and considering which messages were selected to perform

the sentiment classification, the economic filter could have caused this, because it has

selected all the messages related to economic (and also politic) content.

34


7. Conclusions

The main purpose of this project was to check if the social media platform Twitter

would provide useful information to predict the consumer confidence. The result shows that

this information can be found in the social media messages. Despite of the limitations of

this project, the sentiment index performs quite well as predictor of the consumer

confidence, with a high correlation (r = 0,81).

The model selection, the different filtering techniques performed over the tweets dataset

and the volatility filters are the main aspects to deeply work with, in order to properly

extract valuable information from social media. Furthermore, it has been interesting to

study the possible reasons of the different behaviour of the sentiment respectively with the

consumer confidence.

It is possible that, due to the lost of active users and the usage of the platform, the

extraction of the sentiment with the aim of predict economic indicators should be studied at

great length and the filtering techniques carefully selected.

Finally and for future analysis, it might be interesting to use the potential of these

findings. Some applications could be to perform weekly, daily or even a streaming indexes

with the consumer confidence, create a regional basis index or even combine both

approaches.

This indexes might be useful for marketing researchers in order to predict the current

behaviour of the consumers. Also, for economic policy researchers to have faster

information useful to develop new economic models. In addition, for investors or

manufacturing companies, to know the willingness of the demand in order to make their

decisions.

35


8. Appendix

Figure 1 - Sentiment time series without filtering

Figure 2 - Correlation matrix

36


9. Acknowledgments

I am extremely grateful to my girlfriend and future wife Estefanía, my parents Manolo

and Maite, and my colleague and friend Marcos, they have supported me every moment

during this project.

A special mention to my tutor Pedro Albarrán Pérez and the Faculty of Economics and

Business Science of the University of Alicante, for the guidelines and provide me with the

necessary resources to complete this project.

Finally, thanks to Manuel Garrido Peña, Alfonsa Denia Cuesta and Yoan Gutiérrez

Vázquez, for its cooperation and kindness.

10. References

blademedia.co (2018). Twitteros más populares en España. [online] twitter-espana,

available at: http://twitter-espana.com/

Brownlee, J. (2016). Naive Bayes for Machine Learning . [online] Machine Learning

Mastery, available at: https://machinelearningmastery.com

Centro de investigaciones Sociológicas (2018). La construcción del indicador de confianza

del consumidor (ICC) . [online] CIS, available at: http://www.cis.es

Cisco white paper (February, 2017) Visual Networking Index: Global Mobile Data Traffic

Forecast Update, 2016–2021 [online] Cisco, available at https://www.cisco.com

McCandless, M., Sanford, M. and Firat, A (2013). cldr: Language Identifier based on CLD

library . R package version 1.1.0.

Daas, P. and Puts, M. (2014). “Social Media Sentiment and Consumer Confidence”,

Statistics Paper Series (5). [online] European Central Bank, available at:

https://www.ecb.europa.eu

Daas, P. and van deer Loo, M. (2013). Computational Statistics & Data Analysis. [online]

Unesce. Available at: http://www.unescap.org

Daas, P., Puts, M., Buelens, B. and Hurk, P. (2015). “Big Data as a Source for Official

Statistics”. Journal of Official Statistics , 31(2). pp.249-262

Elogia digital marketing (2017). Estudio Anual Redes Sociales 2017. [online] Elogia,

available at: https://iabspain.es

37

http://twitter-espana.com/

https://machinelearningmastery.com/naive-bayes-for-machine-learning/

http://www.cis.es/cis/export/sites/default/-Archivos/NotasdeInvestigacion/NI006_ICC_Informe.pdf

https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/mobile-white-paper-c11-520862.html

https://www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.en.pdf

http://www.unescap.org/sites/default/files/1-Big%20Data%20%28and%20official%20statistics%29-Netherlands.pdf

https://iabspain.es/wp-content/uploads/iab_estudioredessociales_2017_vreducida.pdf


Fernández Anta, A., Morere, P., Núñez Chiroque, L. and Santos, A. (2013). “Sentiment

Analysis and Topic Detection of Spanish Tweets: A Comparative Study of NLP

Techniques”, Procesamiento del Lenguaje Natural , Revista nº 50 marzo de 2013, pp 45-52.

[online] RUA, available at: https://rua.ua.es

Friedman, J., Hastie, T. and Tibshirani, R. (2010). “Regularization Paths for Generalized

Linear Models via Coordinate Descent”. Journal of Statistical Software , 33(1).

García Corbí, M. (2018). Spanish Social Media Index. [online] GitHub, Available at:

https://github.com

Hassani, H., Saporta, G. and Silva, E. (2014). “Data Mining and Official Statistics: The

Past, the Present and the Future”. Big Data , 2(1), pp.34-43.

Hernández de Cos, P. (2018). La recuperación de la economía española, Evolución

reciente y perspectivas del mercado inmobiliario [online] Banco de España, available at:

https://www.bde.es

Hornik K, Mair P, Rauch J, Geiger W, Buchta C and Feinerer I (2013). “The textcat

Package for n-Gram Based Text Categorization in R.” Journal of Statistical Software ,

52(6), pp. 1-17.

Instituto Nacional de Estadística (2018). Cifras de población [online] INE, available at:

http://www.ine.es/

Instituto Nacional de Estadística (2018). Relación de municipios y códigos por provincias

[online] INE, available at: http://www.ine.es

Kawa, N. (2016). Text Classification. [online] Berkeley University, available at:

https://www.stat.berkeley.edu

Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs .

[online] Stanford, available at: http://robotics.stanford.edu

Labrinidis, A. and Jagadish, H. (2012). “Challenges and opportunities with big data”.

Proceedings of the VLDB Endowment , 5(12), pp.2032-2033

Laney, D. (2001) Application Delivery Strategies [online] Gartner, available at

https://blogs.gartner.com

Llorente, A., Garcia-Herranz, M., Cebrian, M. and Moro, E. (2015). “Social Media

Fingerprints of Unemployment”. PLOS ONE , 10 (5).

Marr, B. (2013), The Awesome Ways Big Data is used Today to Change Our World.

[online] LinkedIn, available at: https://www.linkedin.com

38

https://rua.ua.es/dspace/bitstream/10045/27863/1/PLN_50_05.pdf

https://github.com/manugaco/SpanishSocialMediaIndex

https://www.bde.es/f/webbde/GAP/Secciones/SalaPrensa/IntervencionesPublicas/DirectoresGenerales/economia/Arc/Fic/eco160218.pdf

http://www.ine.es/dyngs/INEbase/es/operacion.htm?c=Estadistica_C&cid=1254736176951&menu=ultiDatos&idp=1254735572981

http://www.ine.es/daco/daco42/codmun/codmunmapa.htm

https://www.stat.berkeley.edu/~aldous/Research/Ugrad/Nura_Kawa_report.pdf

http://robotics.stanford.edu/~ronnyk/teza.pdf

https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

https://www.linkedin.com/today/


Morales, A., Borondo, J., Losada, J. and Benito, R. (2014). “Efficiency of human activity

on information spreading on Twitter”. Social Networks , 39, pp.1-11.

Moreno-Ortiz, A. and Pérez Hernández, C. (2018). “Lexicon-Based Sentiment Analysis of

Twitter Messages in Spanish”, Procesamiento del Lenguaje Natural , núm. 50, marzo, 2013,

pp. 93-100 [online] Redalyc, available at: http://www.redalyc.org

Pak, A. and Paroubek, P. (2010). “Twitter as a corpus for sentiment analysis and opinion

mining”. In proceedings of the seventh conference on international language resources and

Evaluation : pp. 1320-1326

Pang, B. and Lee, L. (2008). “Opinion Mining and Sentiment Analysis”. Foundations and

Trends in Information Retrieval , 2(1–2), pp.1-135.

Provost, F. and Fawcett, T. (2001). “Robust Classification for Imprecise Environments”.

Machine Learning , 42, 203–231 [online] Springer, available at: https://link.springer.com

Ramasundaram, S., and Victor S.P. (2013) “Algorithms for Text Categorization : A

Comparative Study”, World Applied Sciences , Journal 22 (9): pp. 1232-1240.

Reinsel, D, John Gantz and John Rydning white paper (March 2017) Total WW Data to

Reach 163ZB by 2025 [online] Storagenewsletter, available at:

https://www.storagenewsletter.com

Rivero, F. (2016). Informe mobile en España y en el Mundo 2016. [online] Amic, available

at: http://www.amic.media

Salzberg, S. and Fayyad, U. (1997), “On comparing classifiers: Pitfalls to avoid and a

recommended approach”. Data Mining and Knowledge Discovery , vol 1, no. 3, pp. 317-328

Stadthagen-Gonzalez, H., Imbault, C., Pérez Sánchez, M. and Brysbaert, M. (2016). Norms

of valence and arousal for 14,031 Spanish words. [online] Springer, available at:

https://link.springer.com

The social media family (2018). IV Estudio sobre los usuarios de Facebook, Twitter e

Instagram en España . [online] thesocialmediafamily.com, available at: http://www.abc.es.

Twitter Developers (2018). Developer Agreement and Policy . [online] Twitter, available at:

https://developer.twitter.com

Van der Brakel, J., Söhler, E., Daas, P. and Buelens, B. (2016). “Social media as a data

source for official statistics; the Dutch Consumer Confidence Index”. Statistics

Netherlands , Discussion Paper , 2016 (01).

39

http://www.redalyc.org/pdf/5157/515751576011.pdf

https://link.springer.com/article/10.1023/A:1007601015854

https://www.storagenewsletter.com/2017/04/05/total-ww-data-to-reach-163-zettabytes-by-2025-idc/

http://www.amic.media/media/files/file_352_1050.pdf

https://link.springer.com/content/pdf/10.3758%2Fs13428-015-0700-2.pdf

http://www.abc.es/gestordocumental/uploads/internacional/Informe_RRSS_2018_The_Social_Media_Family.pdf

https://developer.twitter.com/en/developer-terms/agreement-and-policy


Villena Román, J., García Morera, J., Moreno García, C., Ferrer Ureña, L., Lana Serrano,

S., González Cristóbal, J., Westerski, A., Martínez Cámara, E., Martínez Cumbreras, M.,

Martín Valdivia, M. and Ureña López, L. (2012). Workshop on Sentiment Analysis at

SEPLN. [online] Reserachgate, Available at: https://www.researchgate.net

Weiss, G. and Provost, F. (2001). The Effect of Class Distribution on Classifier Learning:

An Empirical Study . [online] Researchgate, available at: https://www.researchgate.net

40

https://www.researchgate.net/publication/230771215_TASS-Workshop_on_Sentiment_Analysis_at_SEPLN

https://www.researchgate.net/publication/2364670_The_Effect_of_Class_Distribution_on_Classifier_Learning_An_Empirical_Study

UNIVERSIDAD DE ALICANTE FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES€¦ · FACULTAD DE...

Documents

Transcript of UNIVERSIDAD DE ALICANTE FACULTAD DE CIENCIAS ECONÓMICAS Y EMPRESARIALES€¦ · FACULTAD DE...