Forecasting with Twitter data Presented by : Thusitha Chandrapala 20064923 MARTA ARIAS, ARGIMIRO...

Post on 23-Dec-2015

216 views 0 download

Tags:

Transcript of Forecasting with Twitter data Presented by : Thusitha Chandrapala 20064923 MARTA ARIAS, ARGIMIRO...

Forecasting with Twitter dataPresented by : Thusitha Chandrapala20064923

MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA

What information does twitter messages have?

•Twitter information▫Sentiment analysis: Are people happy or

unhappy about a certain topic? ▫Volume: Number of tweets about a given

topic

•Does twitter really help in predicting time series data?▫Moving stream of info.

This motivation of the paper

•Use three different forecasting model families, vary parameters systematically and analyze under which conditions twitter information is actually useful

•Testing non-linearity and causality between twitter data and the target

•Introduction of summery tree

Related work

• Stock market prediction▫Bollen et al:

Twitter -> sentiment->predict Dow Jones Industrial average

▫Wolfram et al. Twitter as an additional source of features, no sentiment

analysis

• Movie box office income▫Mishne et al:

correlation, blog posts▫Asur et al:

predict sales

Work flow

1)• Collecting data

2)

• Cleaning and preprocessing

3)• Sentiment analysis

4)• Prediction model

Preprocessing:

•Language detection

•Negation handling: considering “I like this…” and “I don’t like this… “ to be 2 features

•Relevance filtering and topic classification: using LDA▫Latent Dirichlet Allocation

Sentiment classification•Whether the text contains negative or

positive impressions on a given subject•Approach 1:

▫Automatic tagging to extract training instances :) :D - Happy sentiment :( - Unhappy sentiment

▫Binary classification problem: Use naïve Bayes to train the classifier

▫Use different dictionaries as features

Sentiment classification•Whether the text contains negative or

positive impressions on a given subject•Approach 1:

▫Automatic tagging to extract training instances :) :D - Happy sentiment :( - Unhappy sentiment

▫Binary classification problem: Use naïve Bayes to train the classifier

▫Use different dictionaries as features

Sentiment index

•A time-series of sentiment values▫The daily value is calculated based on the

daily % of +/- tweets over the total number of messages on a specific topic

Training the model

•ARMA : Auto Regressive Moving Average ▫y[t] = a.x[t]+b.x[t-1]+… +m.y[t-1]+n.y[t-2]

…..

•Simplified prediction:▫A binary prediction, which says if y[t]>y[t-

1]▫Use past values of self, and twitter time

series

Model parameters

Target Time series Share Market :ReturnsMovie box office: Revenue

Twitter series VolumeSentiment Index

Forecasting model family Linear modelsSupport vector machinesNeural networks

Result: Does including Twitter data increase classification accuracy by 5%?

Study details

•Stock market prediction targets▫Companies: Apple, google, … ▫General market indices: S&P100, S&P500

•Box office data▫Daily sales revenue series

Summery Tree

•Helps to identify model parameters that leads to consistently +/- results

•Decision Tree structure ▫Nodes are different parameters▫Leaves : Result

Summery Tree

Results: Stock market data

•Summery of prediction results:▫Generally Linear models do not provide a

significance performance improvement either for twitter volume or sentiment analysis based info.

▫Non-linear models can give an improvement!

▫Neural network based models gave the best performance

Results: Stock market data

Results: Movie box office

•Summary:▫Sentiment analysis did not have a positive

impact▫Volume information had a positive impact

with Linear regression and SVM

Conclusion

•In general, twitter information when used with non-linear models increase the prediction accuracy for long term stock market predictions

•Twitter volume had a linear relationship with movie sales, but sentiment analysis had none

Appendix

•Logarithmic returns of the series

1

1

t tt

t

P PR

P

Testing model adequacy

•Testing the relationship between twitter time series and the time series that has to be forecasted

•Neglected nonlinearity▫Are the 2 Time series non-linearly related?

•Granger causality▫X->Y OR Y->X ?