Text mining

23
Wednesday, November 25, 2015 TEXT MINING AU FOOTBALL FACEBOOK PAGE Sai Praneeth Reddy Auburn University

Transcript of Text mining

Wednesday, November 25, 2015

TEXT MINING

AU FOOTBALL FACEBOOK PAGE

Sai Praneeth Reddy

Auburn University

Executive Summary

Auburn athletic department is interested in analyzing the facebook posts on the AU football page in order to get an idea of the topics that people are mostly talking about. They would also like to know the sentiment of the people towards auburn football, if it is either positive or negative.

In addition to that they also want to know which player or coach they are mostly talking about and if the comments about them are positive or negative. The athletic department would like to use this answers in order to improve there football team.

We use text analytics to answer the above mention questions and then perform sentimental analysis to find if the general outlook towards the team is positive or negative.

Table of Contents

Introduction … … … … … … … … … … … … … … 4Description of Business Problem

Methodology … … … … … … … … … … … … … …4Text Mining

Analysis and Results … … … … … … … … … …5-12Tables and figures

Analysis

Conclusions … … … … … … … … … … … … … 13

IntroductionIn order to improve Auburn’s football team performance the Auburn athletic department is interested in

analyzing the facebook posts, comments and replies on the AU football page. The athletic department would like to know the general outlook of the public towards there football team. In addition to that the

athletic department wants to know the players that are being talked about the most and the opinion of public towards these players.

Methodology One of the most commonly used methodology to analyze texts is the text mining. in our case we perform the analysis of the facebook posts using SAS E-miner. We first import the posts, comments

and replies on the the AU facebook into an EXCEL file using Web Crawler. Once the posts are imported into an EXCEL file it is converted into a SAS readable format file using File import.

We then use the Text parsing node to remove unwanted words followed by text filtering were we group

words that are synonyms and also drop certain words that we are not interested in. In the text filtering node we can also get the snippets of the text of the word we are interested in analyzing.The text cluster

node groups the terms into clusters where each cluster represents the terms that occur together.

!4

Analysis and RESULTSDATA PREPARATION

The facebook posts are imported into an EXCEL file using web crawling, the EXCEL file is then imported in SAS and is converted into SAS readable format using file import node.

TEXT PARSINGAll the variables in the input data are set to rejected except for the post id whose role is set to id and message whose role is role is set to Text. The text parsing node enables us to parse the text and analyze the the number of terms and documents by frequency. In our Text parsing node we dropped all words except for nouns, proper nouns and adjectives.

fig 1The above ZIPF plot shows shows that Gus Malzahn is one of the widely disused topic, along with Jermey Johnson and Will Muschamp.

! 5

Some of the most widely discussed players and coaches are as follows:

Table 1

fig 2

The above Number of documents by weight plot shows that Jermy Johnson has relatively heigh weight compared to all other players.

Names Weight

Malzahn 0.354

Jonathan wallace 0.534

Rhett Lashlee 0.614

Jermy Johnson 0.618

Carl Lawson 0.608

Will Muschamp 0.510

Sean White 0.6

! 6

Some of the most widely discussed topics / words are:

Table 2

TEXT FILTERINGText filter node is used to keep/drop terms that are either are too frequent or highly infrequent as these terms are not of much use in grouping topics.The node also helps us in grouping words that are similar to one another (i.e synonyms).

Using the interactive text filter we can also know in what context people are using someones name are a word we are interested in. It will help us understand the sentiment of the people towards a particular person or a topic.Using text filtering it is also possible to know which words are strongly associated based on the concept link diagram which shows relationship towards terms.

Topics Number of DocumentsDefence 27

Offence 18

good 30

QB 163

Running 12

Receiver 10

! 7

Sentiment Analysis:

• Gus Malzahn

Table 3

The above text snippets indicate that there is a negative perception among lot of people about coach Gus Malzahn, lot of people seem to be blaming Gus Malzahn for the defeat.

! 8

• Jermy Johnson

Table 4

From the above text snippets it appears that even though Jeremy Johnson did not have a great year lot of people

still seem to trust his abilities. It also appears that people think Auburn’s offense is better when Jermy Johnson is

the quarter back rather then Sean white.

• Sean White

! 9

From the above text snippet it appears that there seems to be a no clear favorite quarter back, as there is a lot of divided opinion on who the starting quarter back should be.

• Offence

Table 6

It appears that lot of people seem to blame the offense for the Auburns bad performance. There seem to be a general opinion that the defense is doing better and the offense is letting the team down.

• Defence

Table 7

! 10

From the above text snippet it appears that there seem to be a generally positive outlook about Auburn’s defense. They think that defense has improved a lot under Muschamp and it is the offense that is letting them down.

TEXT CLUSTER

The text cluster node groups the terms into clusters where each cluster represents the related terms that occur together. This can be particularly useful in the sense that the related terms are grouped into clusters and the biggest sector into the circle represents the topic that most customers are talking about.

Table 8

The words defense, explosive and Muschamp are placed in a single cluster indicating that there that people are generally happy with the defense and attribute this improvement in performance to Will Muschamp.

The words don’t, improvement, and Lashlee are used together a lot indicating that people in general want the offense and the offense coach Rhett Lashlee to do better.

! 11

Fig 3

Fig 4

!12

TEXT TOPIC

From the text topic node output, we can find the terms that are grouped together and there cutoffs . The text topic node can be refined further by using text cluster node. The text topic node performs cluster analysis to combine words that are interesting to analysts.

Table 9

!13

CONCLUSION

From the analysis of the facebook posts it appears that people are in general disappointed with the overall performance of the team. Though they feel that the defense has done better then last season it is the offense that let them down.

It also appears that people prefer Jermy Johnson as the teams Quarter back over Sean White. In addition to that majority of the people seem to blame the head coach Gus Malzahn for the teams failure and think that the defense coach Will Muschamp has done a good job.

!14

APPENDIX

!15

!16

3

4

6

14

13

14

15