Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text...

15
1 Paper 2708 - 2018 Sentiment Analysis of Netflix and Competitor Tweets to Classify Customer Opinions Rucha Jadhavar, Oklahoma State University; Agastya Kumar Komarraju, Sam’s Club ABSTRACT With more than 310 million users worldwide, twitter is an important source of data for social media analytics. Each day, thousands of customers share their opinions and reviews about Netflix and its competitors in the video on-demand space. By analyzing the content of these tweets, companies can learn more about their customer preferences. This can help make businesses profitable by helping design effective social media campaigns and converting prospects into customers. It will also help companies make better-informed decisions to maintain their competitive edge. In this study, we analyze tweets for Netflix and its competitors such as Amazon Prime Video, Hulu, and HBO NOW. We demonstrate the use of multiple SAS® tools to analyze these tweets, generate quick summaries, identify different categories of tweets, and classify reviews. Over 32,000 tweets were captured over five days and SAS® Enterprise Miner™ was used to identify commonly used terms and to categorize similar tweets into groups. It was also used to analyze customer sentiment by classifying tweets into positive, neutral, and negative categories. The sentiment analysis feature in SAS® Visual Analytics gives us a quick overview through word clouds for each text topic helping us understand customer opinions. INTRODUCTION ‘Twitter’, a microblogging site continues gaining popularity across the world. Considered to be an integral part of their digital strategy by most large corporations, customers also use twitter to actively voice their opinions regarding products and services. Netflix is the most popular online streaming service that lets customers watch a wide variety of award-winning TV shows, movies, documentaries without any commercials. Using text mining, we can identify the opinions of customers most closely related to Netflix as well as its competitors and their shows. We can identify terms in tweets that correspond to specific shows. Using R packages and code, we can classify tweets as positive, negative or neutral with a score given to each tweet based on the number of positive or negative words in each tweet. Using SAS® Visual Analytics, we can create word clouds for each text topic and analyze sentiments to get a quick summary of most relevant terms and related sentiments. DATA ACCESS Two approaches were used to extract Twitter data. The first approach involves writing code in R Studio to capture tweets for specific time frames and create a JSON file which would later be parsed and converted to CSV format. Packages such as ‘ScheduleR’, ‘RCurl’, ‘streamR’, ‘ROAuth’ and ‘twitteR’ were used for downloading tweets. A certificate file to access tweets is downloaded and api key, api passowrd, access token and password provided by twitter was accessed. These credentials give authentication to extract tweets using usertimeline function from Twitter R package. This approach recorded as many as 30 input variables for further analysis. In this method tweets were captured as long as the system was running. In the second approach, ‘Twitter Archiver’ application on google Chrome was used. Here a rule was written to capture tweets that contained terms such as "Netflix", "HBO", "hbonow", "Hulu", "amazon_movies", "AmazonChannels", "AmazonVideo". Once the rule was set and application run, tweets were saved for the defined keywords in the google sheet. This application would pull in matching tweets every hour into the spreadsheet. Using Twitter Archiver, 16 input variables were recorded, even though lesser than the first approach inputs required for our analysis were available. Twitter Archiver application captured tweets over 5 days from 19th April to 25th April 2017. For our study, tweets collected from Twitter Archiver were analyzed, an overview provided in Figure 1. Further, data was prepared for analysis using complete and unique tweets.

Transcript of Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text...

Page 1: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

1

Paper 2708 - 2018 Sentiment Analysis of Netflix and Competitor Tweets to Classify Customer

Opinions Rucha Jadhavar, Oklahoma State University; Agastya Kumar Komarraju, Sam’s Club

ABSTRACT

With more than 310 million users worldwide, twitter is an important source of data for social media analytics. Each day, thousands of customers share their opinions and reviews about Netflix and its competitors in the video on-demand space. By analyzing the content of these tweets, companies can learn more about their customer preferences. This can help make businesses profitable by helping design effective social media campaigns and converting prospects into customers. It will also help companies make better-informed decisions to maintain their competitive edge.

In this study, we analyze tweets for Netflix and its competitors such as Amazon Prime Video, Hulu, and HBO NOW. We demonstrate the use of multiple SAS® tools to analyze these tweets, generate quick summaries, identify different categories of tweets, and classify reviews. Over 32,000 tweets were captured over five days and SAS® Enterprise Miner™ was used to identify commonly used terms and to categorize similar tweets into groups. It was also used to analyze customer sentiment by classifying tweets into positive, neutral, and negative categories. The sentiment analysis feature in SAS® Visual Analytics gives us a quick overview through word clouds for each text topic helping us understand customer opinions.

INTRODUCTION ‘Twitter’, a microblogging site continues gaining popularity across the world. Considered to be an integral part of their digital strategy by most large corporations, customers also use twitter to actively voice their opinions regarding products and services. Netflix is the most popular online streaming service that lets customers watch a wide variety of award-winning TV shows, movies, documentaries without any commercials.

Using text mining, we can identify the opinions of customers most closely related to Netflix as well as its competitors and their shows. We can identify terms in tweets that correspond to specific shows. Using R packages and code, we can classify tweets as positive, negative or neutral with a score given to each tweet based on the number of positive or negative words in each tweet. Using SAS® Visual Analytics, we can create word clouds for each text topic and analyze sentiments to get a quick summary of most relevant terms and related sentiments.

DATA ACCESS Two approaches were used to extract Twitter data. The first approach involves writing code in R Studio to capture tweets for specific time frames and create a JSON file which would later be parsed and converted to CSV format. Packages such as ‘ScheduleR’, ‘RCurl’, ‘streamR’, ‘ROAuth’ and ‘twitteR’ were used for downloading tweets. A certificate file to access tweets is downloaded and api key, api passowrd, access token and password provided by twitter was accessed. These credentials give authentication to extract tweets using usertimeline function from Twitter R package. This approach recorded as many as 30 input variables for further analysis. In this method tweets were captured as long as the system was running.

In the second approach, ‘Twitter Archiver’ application on google Chrome was used. Here a rule was written to capture tweets that contained terms such as "Netflix", "HBO", "hbonow", "Hulu", "amazon_movies", "AmazonChannels", "AmazonVideo". Once the rule was set and application run, tweets were saved for the defined keywords in the google sheet. This application would pull in matching tweets every hour into the spreadsheet. Using Twitter Archiver, 16 input variables were recorded, even though lesser than the first approach inputs required for our analysis were available. Twitter Archiver application captured tweets over 5 days from 19th April to 25th April 2017. For our study, tweets collected from Twitter Archiver were analyzed, an overview provided in Figure 1. Further, data was prepared for analysis using complete and unique tweets.

Page 2: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

2

Figure 1: Data access and preparation with twitter archiver data

DATA PREPARATION Of the 32,000 tweets, those related to Netflix were 23,700 (~75%) and all of the competitor tweets constituted the remaining 25%. Only tweets and retweets were considered for the purpose of this analysis. Publicly available ‘good’ and ‘bad’ word dictionaries were used and each tweet was given a score based on the number of positive and negative words identified. Tweets with a score of less than 0 were considered negative, 0 was considered neutral and above 0 were considered positive.

For the Netflix data, the scores and their distribution in a histogram can be seen in Figure 2.

Figure 2: Frequency distribution for scores and Histogram of Netflix Tweets

The positive and negative tweet trends can be found in the chart below. In Figure 3, it can be seen that Twitter Archiver collected tweets from previous days as well which is why there are very high number of tweets on first day. The yellow line is the summary line consisting of the sum of positive, negative and neutral tweets.

Figure 3: Positive, Neutral and Negative Netflix tweet distribution by Date Similar analysis was done for the competitor tweets. Figure 5 shows the increase in competitor positive tweets on 22nd April 2016.

Page 3: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

3

Figure 4: Frequency distribution for scores and Histogram of Competitor’s Tweets

Figure 5: Positive, Neutral and Negative Netflix tweet distribution by Date Files with scores and sentiment of tweets were exported and used for further analysis in SAS® Enterprise Miner. Snapshot of data for Netflix is shown in Figure 6. A similar format was used for the competitor tweets dataset. An ID variable, the actual text string and Sentiment associated with the tweet (for Text Rule Builder only) were used in the analysis. Data dictionary explaining variable roles is given at Table 1.

Figure 6: Netflix data ready for import into SAS®

Variable Name Level Description ID ID Identifier Variable

Text Text Customer comments posted on Twitter

Tweet Input Positive, neutral, negative classification (Only used for Text Rule Builder)

Table 1: Data Dictionary

METHODOLOGY SAS® Enterprise Miner was used for text clustering as shown in Figure 7 and creating rule based models as shown in Figure 21. This section goes through a step by step implementation of both analyses. Following

Page 4: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

4

analysis from SAS® Enterprise Miner, SAS® Visual Analytics helped with creating text topics and word clouds.

TEXT CLUSTER

Figure 7: Enterprise Miner™ Diagram for Netflix and Competitor Tweets

FILE IMPORT As shown in Figure 8, A CSV file was selected from the “Import File” option in the Import node in SAS® Enterprise Miner.

Figure 8: Property Panel for File Import node

TEXT PARSING After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better analyze the data. Figure 9 and Figure 10 show the settings used in text parsing node.

'Different Parts of Speech' under Detect was set to NO to eliminate repetitive terms. 'Find entities' under Detect was set to 'Standard'. Apart from default options, we have ignored 'Abbr', 'Num' and 'Prop' parts of speech. For entities, we ignored 'Address', 'Currency', 'Date', 'Internet', 'Percent', 'Phone', 'SSN', 'Time',

'Time Period', 'Title' and 'Vehicle' were ignored. 'Num' and 'Punct' types of attributes were ignored.

Page 5: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

5

Figure 9: Property panel for file parsing node

Figure 10: Ignore settings for parts of speech, entities and attributes We get the Document Matrix from the text parsing node in Figure 11. The document term matrix will be used to identify the frequency of the occurring terms. As per Zipf's Law, terms with average frequency will give the most information.

Terms with maximum frequency such as ‘Netflix’, ‘show’, ‘series’ and ‘season’ do make sense as we are reading twitter comments on Netflix and its competitors. However, they are not very informative.

Page 6: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

6

Figure 11: Text parsing node output

TEXT FILTER As there are a lot of uninformative terms in the data, we will use the “Filter” node to help filter them out. Figure 12 shows options used in the node’s property panel.

'Check Spelling' was set to 'Yes'. It’s a check to correct possible spelling mistakes in the dataset. Minimum Number of Documents' was set to 15. This will only consider terms that occurred at least

15 times between the documents.

Figure 12: Text filter properties panel

Page 7: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

7

Additionally, the interactive text filter tool was used to drop irrelevant terms, keep important terms and group all synonyms together. For example, the terms 'amaze' and 'amazing' are grouped together.

Figure 13: Synonyms defined for 'Amazing' term

CONCEPT LINKS Under the interactive filter, we used concept links to check for word associations and their strengths. A stronger association will be visible by a thicker line connecting the terms. The concept link for Netflix is shown in Figure 14. Terms such as Selena Gomez who is one of the producers of the famous ‘13 Reasons Why’ series can be seen in the concept link.

Figure 14: Concept Link for Netflix Figure 15 shows us that for HBO ‘Game of Thrones’ series and ‘Henrietta Lacks’ are being talked about.

Page 8: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

8

Figure 15: Concept Link for HBO from competitor’s dataset Looking at concept link in Figure 16, we can see that negative word is associated with series ‘Bob’s Burgers’ which was recently taken off Netflix.

Figure 16: Concept Link for Negative word

TEXT CLUSTER Text cluster node as the name suggests is used to group similar comments. For Netflix dataset, in the properties panel we select clustering algorithm as ‘Expectation Maximization', we define number of clusters to be 6 and number of terms in each cluster is fixed at 8. Running the text cluster node we get 6 clusters as shown in Figure 17. Looking at the text clustering output in Figure 18, we can see that each cluster has similar percent of term distribution and each cluster is far away from other clusters.

Page 9: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

9

Figure 17: Clusters terms from Text cluster Node

Figure 18: Text Clustering Output: Pie chart and distance between clusters

Figure 17 above has a list of text clusters generated with their frequencies. Table 2 below, shows cluster descriptions.

Cluster ID Description 1 Clusters of negative sentiments regarding the shows leaving Netflix suddenly and the

new rating system change. 2 Talks about tweets which mention plans people have such as sit at home in ‘bed’, ‘eat’

good food and ‘binge’ ‘watch Netflix’. 3 Clusters of new ‘Netflix show’ such as ‘Bill Nye’ which people could be finding ‘good’. 4 Consists mainly of words talking about new seasons that are added or will be added or

are requested to be added such as ‘series’, ‘season’, original ‘Netflix series’. 5 Cluster of tweets regarding the series ’13 Reasons Why’. 6 Tweets talking about people who want to spend time with loved ones watch episodes on

Netflix. Table 2: Netflix data cluster ID descriptions For the Competitors (Hulu, Amazon, HBO) dataset, in the properties panel we chose the Expectation Maximization' clustering algorithm for exactly 5 number of clusters with 5 terms in each cluster. Text Cluster descriptions in Table 3 below can help get a sense of the discussions that are going on in the social media space and facilitate design of better marketing campaigns.

Figure 19: Clusters terms from Text cluster Node

Page 10: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

10

Figure 20: Text Clustering Output: Pie chart and distance between clusters

Cluster ID Description 1 Clusters talking about handmaid’s tale on Hulu 2 Cluster with the famous HBO series ‘Game of thrones’ and ‘Silicon Valley’ 3 Cluster talking about ‘HBO’ ‘movies’ ‘watch’ which are ‘good’ 4 Cluster talking about famous ‘Immortal life’ of ‘Henriata Lacks’, ‘oprah’ on Hulu 5 Cluster of tweets regarding Beauty and the Beast ‘batb’ and customers asking hulu to

save batb with ‘hulusavebatb’ Table 3: Competitor data cluster ID descriptions

RULE BASED MODEL

METHODOLOGY

Figure 21: Rule Based Model created for Netflix and Competitors With over 13,500 unique Netflix tweets and 4,700 competitor tweets, the rule based model was used to check accuracy of the models. The ‘tweet’ variable which classified each tweet as positive, negative or neutral was used as the target variable.

Initially, data was partitioned using the “Data Partition” node into 60 % training and 40% validation. “Text Parsing” and “Text Filter” nodes were added and the properties were set the same way as previously discussed. We tried to get the best possible text rule builder by trying different property combinations.

The “text rule” builder node was run with low, medium and high settings for the generalization error, purity of rules and exhaustiveness settings. For the Netflix dataset, via the “model comparison” node in Figure 22, the model with “high” setting with lowest misclassification rate of 26% for validation data was chosen.

Page 11: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

11

Figure 22: Model Comparison of different rules in model comparison node

Figure 23: Cumulative Lift comparison of the Text Rule Builder models For the Competitor’s dataset, the model with “low” setting gave the lowest misclassification rate of 30% for validation data.

Figure 24: Model Comparison of different rules in model comparison node

Figure 25: Cumulative Lift comparison of the Text Rule Builder models

Page 12: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

12

To understand how the rule based model is working, the positive classification rules are explored. The most important rule contains the term “good” with 82.71% precision and has a higher precision in the validation dataset (84.62%). The most important rules are sorted in the Figure 26 below.

Figure 26: Positive classification rules for Netflix dataset Among the negative words, “Suicide” has a precision of 83.45%. The next most important rules consisted of the words “hate”, “sad”, “murder” and “mental”.

Figure 27: Text Clustering Output: Pie chart and distance between clusters

Figure 28: Text Clustering Output: Pie chart and distance between clusters

For the neutral target value consists of ‘watch American which was classified as correctly every time. Similarly, models for the Competitor dataset with the low setting were studied to understand how the rule based model worked.

TEXT TOPIC: SENTIMENT ANALYSIS IN SAS® VISUAL ANALYTICS For further analysis, SAS® Visual Analytics tool was used to get a sense of the most important text topics. The software is very intuitive and easy to use, giving quick insights on text topics along with identifying sentiment. The word clouds make it extremely easy to identify the most important terms within each topic by making use of larger fonts.

Page 13: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

13

METHODOLOGY The Netflix and competitor datasets were imported into the SAS® Visual Analytics software and the text variable was prepared for sentiment analysis. Word clouds were generated for the document collection of the text data. Sentiment analysis was carried out on the word clouds to get text topics. Text topics for the Netflix dataset is shown in Figure 29. Word clouds for 2 of the topics are shown in Figure 30.

Figure 29: Text topics for Netflix dataset

Figure 30: Word clouds on text topic for Netflix dataset: For the competitor dataset, 8 text topics were created. As an example, selecting ‘hulusavebatb’ term in the word cloud pulls up the actual tweet text along with the sentiment and relevance values associated with it. In the tweets, customers are requesting Hulu to continue streaming ‘Beauty and the beast’, a teen drama series.

Figure 31: Word Cloud on text topic for Competitor dataset

Page 14: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

14

Figure 32: Tweets with term ‘hulusavebatb’

FUTURE SCOPE This analysis could be further enhanced by working to identify the emotion associated with each tweet.This could include emotions such as ‘Happiness’, ‘Sadness’, ‘Anger’ and ‘Disgust’. We could also utilize SAS® Viya for integrating open source softwares to code in the SAS® environment.

Analyzing success of marketing promotions activities by analyzing retweets and identifying influencers to optimize marketing initiatives could be another possible next step.

A daily visual report to identify daily trends, influencers and customer response to ad campaigns can be useful for effective marketing strategies.

Using multiple language dictionaries, scope can be increased to all languages used on twitter and capture a larger chunk of the population.

CONCLUSION Companies value all the information they can get about their customers. Twitter is a platform for customers to express their opinions freely and an estimated 80 % of data in the world is unstructured. We can use the models discussed in this paper to get a quick overview of customer sentiment. This information can be used by companies to help make informed decisions in the future. We can analyze the trend for competitors and identify sudden changes in the customer opinions by running simple procedures in the SAS® environment. In SAS® Enterprise Miner, the raw data needs to be parsed and filtered before being analyzed to correct for spelling mistakes, to group synonyms together and to drop the terms that do not contribute in making sense of the data. The concept links help in identifying associations. Using the text rule based models, we can identify terms and rules that help us classify tweets into positive, neutral or negative categories. The recently added functionality of sentiment analysis in SAS® Visual Analytics can help generate word clouds and identify sentiments effortlessly.

Page 15: Sentiment Analysis of Netflix and Competitor Tweets to ... · After importing the csv file, “Text Parsing” node is attached to the diagram and few modifications are made to better

15

REFERENCES 1. Learning Word Vectors for Sentiment Analysis by Andrew L. Maas, Raymond E. Daly, Peter T. Pham,

Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

2. Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS® by Goutam Chakraborty, Murali Pagolu, Satish Garla.

3. Sentiment Analysis and Opinion Mining by Bing Liu (May 2012).

4. SAS Institute Inc 2014. Getting Started with SAS® Text Miner 13.2. Cary, NC: SAS Institute Inc.

5. Huayi Li, Arjun Mukherjee, Jianfeng Si and Bing Liu. Extracting Verb Expressions Implying Negative Opinions. Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15). 2015.

6. Zhiyuan Chen, Nianzu Ma and Bing Liu. "Lifelong Learning for Sentiment Classification" to appear in Proceedings of the 53st Annual Meeting of the Association for Computational Linguistics (ACL-2013, short paper), 26-31, July 2015, Beijing, China.

7. Ameya Jadhavar, Prithvi Raj Sirolikar, Dr. Goutam Chakraborty , 2016 “Analysis of IMDB Reviews For Movies And Television Series using SAS® Enterprise Miner™ and SAS® Sentiment Analysis Studio” SAS Global Forum, Paper 11001-2016

8. Analyze core “Twitter sentiment analysis with R “ accessed April 2017 https://www.r-bloggers.com/twitter-sentiment-analysis-with-r/

ACKNOWLEDGMENTS

We thank SAS® Global Forum 2018 conference committee for providing us with a chance to present our work. We also thank Dr. Miriam McGaugh for her continuous support and guidance.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at:

Rucha Jadhavar Oklahoma State University Phone: 405-612-8187 E-mail: [email protected] Rucha Jadhavar is a graduate student enrolled in Business Analytics program at the Spears School of Business, Oklahoma State University. She has more than 4 years of professional and academic experience in data analytics and project management in prominent global companies with a value driven, dynamic and challenging culture. She is a SAS® Certified Base Programmer, SAS® Certified Advanced Programmer and has earned the SAS® and OSU Data Mining Certificate. She presented a poster at the SAS® Analytics X conference in 2017. Agastya Kumar Komarraju Phone: 925-353-8954 E-mail: [email protected] Agastya Kumar Komarraju is a senior manager of the decision sciences team at Sam's Club. With a master's degree from Oklahoma State University, Agastya has worked for several CPG, financial, technology & telecommunication clients while working as a consultant at Nielsen. He is a SAS® certified professional with expertise in market mix modeling, social media analytics, predictive modeling and retail analytics.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.