Data Mining Methods in Twitter

Data Mining Methods in Trading Strategies

Wendi Zhuwendizhu1991gmailcom

An analysis based on news sentiment

The Age of Big Data

8 Terabytes

Twitter 8000000000000 Bytes

Take Twitter SPY in 2010 as a simple example

Question Mining news data from Social Media to enhance trading

Yes

1 A Wall Street news analytics company Sentiment

data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements

with a 75 accuracy rate in 2014

2 A Hedge Fund report We capture a burst of

negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming

validity of the report it took the market over 60minutes to react

3 An Institutional Investor News sentiment Open-

to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31

Claims

1

2

3

4

PreviewFirst look at social media data

ImplemetationParsing twitter news sentiment

ImprovementA brief summary of Advanced methods

Trade the newsTentative trading practices

News Mining Step 1

Whatis a typical Social Media news like

A typical twitter user interface


bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now

bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation

bull 2010-12-10T155950Z $SPY long

bull 2010-01-21T205721Z $SPY closing the lows

bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move

bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout

bull 2010-12-16T152007Z this weeks patterns $SPY see here

Positive

Positive

Positive

Negative

Negative

Negative

What does a financial twitter news look like

Neutral

News Mining Step 2

How can we interpret the news sentiment by machine

Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification

An introduction to NLTKNLTK is a platform for building Python programs to work with human

language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning

Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull

bull Starting Set10000 manually labeled twitter news items

bull Distribution of sentiment

Initial training set sample test

SENTIMENT TRAININGSET TESTING TOTAL

POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000

1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import

bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip

bull neg_tweets = $SPY closing the lows negativehellip

2) Split the text sentence into word features

bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip

3) Build a dictionary

A collection of all the recognized word features in the training set

4) Map text onto word features

bull contains(spy) True

bull contains(support) False

bull contains(strong) True helliphellip

Parsing News Sentiment Dictionary Mapping

5) Apply this mapping into all the news texts and get the following form

Parsing News Sentiment

lsquospyrsquo supportrsquo lsquogonersquo hellip

Twitter_1 True False True hellip

Twitter_2 True False False hellip

Twitter_3 hellip hellip hellip hellip

SentimentWordFeature

_1WordFeature

_2WordFeature

_3hellip

Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip

Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip

Twitter_3 0 (Neu) hellip hellip hellip hellip

A typical classification problem

6) Classification

Naive Bayes classifier

6) A simple description of Naive Bayes

Bayes Formula

The naive assumptions come into play assume that each feature is conditionally independent

of every other feature

In this twitter example it means the word features independently affect the sentiment of the text


k

ki

C class with items news of

C class and x feature word with items news of )C|x(p ki

7) Model trained We got 14356 word features Most Informative Features include

8) In sample test and out-of-sample test

bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout

bull Negative Prob(lsquonegative)= 085525

bull Tweet= lsquoSPY UP I like that

bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936

bull TOTAL in-sample accuracy 792

bull TOTAL out-of-sample accuracy 363

bull With a large enough training set the accuracy rate would get very high

Result

NEWS ITEM CONTAINS RATIO

lsquowidelyrsquo positi negati = 2198 10

lsquoheldrsquo positi negati = 1724 10

lsquomostrsquo positi negati = 457 10

lsquofallrsquo negati positi = 454 10

lsquomightrsquo negati neutra = 306 10

Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get

more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features

A simple summary Nltk and Naive Bayes method

Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words

Currently working on themhellip

News Mining Step 3

Other Advanced Methods in measuring news sentiment

Other advanced approaches

Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other

Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment

More advanced approaches

bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014

bullIndico-httpsindicoio

-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011

indico vader spy_cum_return

Daily averaged news sentiment

A plot of sentiment engines based on 2010 SPY 580000 twitter news


bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde

bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1

Thank you

The Age of Big Data

8 Terabytes




Yes









Claims

1

2

3

4





News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you

8 Terabytes




Yes









Claims

1

2

3

4





News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you



Yes









Claims

1

2

3

4





News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you









Claims

1

2

3

4





News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you

1

2

3

4





News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you

News Mining Step 1






bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you





bull 2010-12-10T155950Z $SPY long





Positive

Positive

Positive

Negative

Negative

Negative


Neutral

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you

News Mining Step 2










POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you









POSITIVE 2379 807 3186

NEUTRAL 3849 1248 5097

NEGATIVE 1214 428 1642

SUBTOTAL 7442 2483 9925

NULL 58 17 75

TOTAL 7500 2500 10000




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you




















_1WordFeature

_2WordFeature

_3hellip





6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you

6) Classification



Bayes Formula





k

ki












Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you










Result












News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you






News Mining Step 3








-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you







-01

-005

0

005

01

015

02

-04

-02

0

02

04

06

08

1

1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011







Thank you




Thank you

Data Mining Methods in Twitter

Data & Analytics

Transcript of Data Mining Methods in Twitter