Using MATLAB for Sentiment Analysis and Text Analytics2 Outline Sentiment Analysis Strings in MATLAB...

29
1 © 2018 The MathWorks, Inc. Using MATLAB for Sentiment Analysis and Text Analytics By Liliana Medina MathWorks UK Software Engineer MATLAB Text Analytics Toolbox

Transcript of Using MATLAB for Sentiment Analysis and Text Analytics2 Outline Sentiment Analysis Strings in MATLAB...

1© 2018 The MathWorks, Inc.

Using MATLAB for Sentiment Analysis and Text Analytics

By Liliana Medina

MathWorks UK – Software Engineer

MATLAB Text Analytics Toolbox

2

Outline

▪ Sentiment Analysis

▪ Strings in MATLAB

▪ Introduction to Text Analytics Toolbox

▪ Using the toolbox for a sentiment analysis task

– Overview

– Step-by-step

– Insights

▪ Additional capabilities and Resources

3

Sentiment Analysis

• Identify sentiment expressed in documents and social media• Microblogging platforms, news articles, company reports, e-mails

Applications

• Volatility and Risk analytics

• Inform trading strategies

• Correlate with stock movements

• Economics research

• Litigation

4

StringsThe better way to work with text

▪ Manipulate, compare, and store text data efficiently

>> "image" + (1:3) + ".png"

1×3 string array

"image1.png" "image2.png" "image3.png"

▪ Simplified text manipulation functions

– Example: Check if a string is contained within another string

▪ Previously: if ~isempty(strfind(textdata,"Dog"))

▪ Now: if contains(textdata,"Dog")

▪ Performance improvement

– Up to 50x faster using contains with string than strfind with cellstr

– Up to 2x memory savings using string over cellstr

5

Text Analytics ToolboxExtract Value from Text Data

Model and Derive InsightsAccess and Explore Data Preprocess Data

Cleanup Text Convert to Numeric

cat dog run two

doc1 1 0 1 0

doc2 1 1 0 1

• Word Docs

• PDF’s

• Text Files

• HTML

• Tables/Spreadsheets

• Stop Words

• Stemming

• Tokenization

• Find special tokens

• Bag of Words

• Bag of N-grams

• TF-IDF

• Word Embeddings

• Latent Dirichlet Allocation (LDA)

• Latent Semantic Analysis (LSA)

• Word clouds

• Text scatter plots

RT @wsv Dry, warm

and sunny for most

today.#weatherRepor

t http://bit.ly/2vcZa8w

Dry warm sunny today

Neural Network

Toolbox

Statistics & Machine

Learning Toolbox

6

Text Analytics Toolbox A Quick Example

• Convert text into input

for numerical analysis

• Visualize text

▪ Applications• Sentiment Analysis: discover sentiment in news, reports, e-mails

• Maintenance: identify hidden groups of issues in maintenance logs

• Document Classification: automatically tag unread documents (e.g. for triaging or routing)

• Explore, manipulate,

and transform text data

7

Compare Sentiment of Tweets with Stock Prices

Goal

▪ Analyze the sentiment of tweets for several

tech companies and compare the sentiment

scores to stock prices

Approach

▪ Access data from Twitter

▪ Preprocess to clean-up text

▪ Identify domain-specific sentiment terms

▪ Map text to numbers with a word embedding

▪ Build a sentiment classification model

▪ Compare sentiment scores to prices

8

Workflow Overview

Tweets Positive & Negative Word List

Word

Embedding

Model

Machine

Learning

Model

Predict Sentiment

Sentiment(neutral,positive,negative)

Create Target

Create Predictors

Preprocess Text

predict(model,newTweet)

9

Access and Import Data

▪ Use Datafeed Toolbox to search

Twitter for keywords and import tweets

▪ Extract the text of the tweets

Search for more keywords:

‘GOOG’, ‘AMZN’, ‘APPL’,

10

Preprocessing Text: Clean and Tokenize

• Remove urls and html tags• eraseTags

• eraseURLs

• Remove tickers with a regular

expressiontorep = {'(?:\$+[\w_]+)’}

strip(regexprep(tweettext,torep,''))

• Split text into tokens to better

support preprocessing and

analysis• tokenizedDocument

11

▪ Find and remove hashtags, at-mentions, digits, punctuation, and stopwords

Stop words: “the”, “a”, “an”, …- English stopWords list

included- removeWords

- erasePunctuation

tokenDetails token information

Preprocessing Text: Clean and Tokenize

12

Preprocessing Text: Explore and Visualize

▪ Explore the text with wordcloud charts:

– Use a bagOfWords model to find the most frequent words

Further cleaning:

- Remove very frequent words

- Remove very short words

13

Modelling: Use Sentiment Word Lists

▪ Import external list of words

– Positive and negative words

– Specific to finance domain (Loughran & McDonald)

– Useful to create a target sentiment variable

>> domainWords = load('domainSentimentWords.mat');

>> domainWords.pos(1:10)

"a+"

"able"

"abound"

"abounds"

"abundance"

"abundant"

"accessable"

"accessible"

"acclaim"

"acclaimed"

>> domainWords.neg(1:10)

"2-faced"

"2-faces"

"abandon"

"abandoned"

"abandoning"

"abandonment"

"abandonments"

"abandons"

"abdicated"

"abdicates"

Positive/Negative

Word List

Machine

Learning

Model

Predictors

Word Embedding

Model

Target

14

▪ Build the sentiment target variable– Subtract number of negative words from number of positive

words

target = number_of_positives – number_of_negatives

– Three sentiment classes

▪ target > 0 tweet belongs to positive class

▪ target = 0 tweet belongs to neutral class

▪ target < 0 tweet belongs to negative class

Are Stocks Overweight In Active Funds At Risk For Correction?

$MSFT moderately bullish in premarket - had a good day yesterday - testing important breakout areas

Positive/Negative

Word List

Machine

Learning

Model

Predictors

Word Embedding

Model

Target

Modelling: Use Sentiment Word Lists

15

▪ Load or train a wordEmbedding model

– Each word represented by a numeric vector

– Embeddings preserve information about word order and context

– Vector operations enable discovery of word similarities and analogies

Brother – Man + Woman ≈ Sister fastTextWordEmbedding support package

precomputed vectors for 16B English

tokens

Woman

Man

Sister

Brother

Brother - Man

Positive/Negative

Word List

TargetPredictors

Machine

Learning

Model

Word Embedding

Model

Modelling: Word Embeddings

16

▪ Use the word embedding model to create a vector per

tweet

– Take advantage of the word2vec function

– Combine individual word vectors into one

▪ Average the individual word vectors in the tweet

>> emb = fastTextWordEmbedding;

>> tweetWords = ["my" "example" "tweet"];

>> tweetVector = mean(word2vec(emb,tweetWords));

Modelling: Word Embeddings

Positive/Negative

Word List

TargetPredictors

Machine

Learning

Model

Word Embedding

Model

17

Modelling: Train a Sentiment Classification Model

>> TrainData = array2table(tweetVectors)

>> TrainData.ClassResponse = target;

>> classificationLearner

• Cubic SVM model: 96.5%

Requires:

Statistics & Machine

Learning Toolbox

18

Derive Insights: Score Tweets for Sentiment

3. Derive a vector with the wordEmbedding

model

1. Use model to predict sentiment scores for new tweets

4. Use classification model to predict the sentiment of the new tweet

>> newTweet = "$AMZN: Here’s Amazon’s Plan to Make Students Better Writers: https://t.co/SVfi0eDHt0"

tokenizedDocument:

7 tokens: heres amazons plan make students better

writers

>> tweetVector = mean(word2vec(emb,string(t)));

>> pred = predict(svmModel,tweetVector)

pred =

1 positive!

2. Preprocess new tweet

Sentiment

(neutral,positive,negative)

predict(model,tweetVector)

Machine Learning Model

19

Derive Insights: Sentiment and Price Movements

▪ Intraday price data

– use synchronize

▪ Smooth sentiment scores

▪ Align with price data

20

Key Takeaways

▪ Text data is everywhere, and contains valuable information

▪ Text Analytics Toolbox has tools to help you extract the signal from the noise

▪ Combine text with other data sources to take advantage of all your data

21

Text Analytics with Deep Neural Networks

▪ Use deep learning to model text data

• Neural Network

Toolbox

• Classification and

regression with

LSTMs

• MATLAB Deep

Learning Onramp

• Free two hour

DL tutorial

22

Additional Text Analytics Toolbox CapabilitiesExtract text from different file formats

▪ Extract text from pdf files, word documents, and HTML

• extractFileText

• extractHTMLText

• readPDFFormData

23

Additional Text Analytics Toolbox CapabilitiesBag of words, Bag of n-grams, and TF-IDF

• Discover words and multi-words

24

Additional Text Analytics Toolbox CapabilitiesLSA and LDA

▪ Unsupervised machine learning techniques

▪ Well-suited to wide feature matrices

▪ Both can be used for dimensionality reduction

▪ LDA is also useful for topic modeling

– Discover latent topics in the data

25

Learn More – MATLAB Discovery Pages for Text

• Sentiment Analysis: Analyze and predict sentiment with machine learning

• Natural Language Processing: Data analytics with human language data

• Text Mining with MATLAB: Deriving insight from text data

Ask the Expert: Sarah Palfreyman

26

Learn More – Documentation Examples

https://www.mathworks.com/help/textanalytics/examples.html

27

Learn More about Word Embeddings

https://blogs.mathworks.com/loren/2017/09/21/math-with-words-

word-embeddings-with-matlab-and-text-analytics-toolbox/

Math with Words – Word Embeddings with MATLAB and Text Analytics Toolbox

28

Learn More about Topic Models

https://blogs.mathworks.com/loren/2018/03/20/what-is-president-

trump-tweeting-about-that-gets-our-attention/

What is President Trump tweeting about that gets our attention?

29© 2018 The MathWorks, Inc.

Questions?