Text mining on Twitter information based on R platform

Text mining on Twitter information based on R platform

Qiaoyang ZHANG∗

Computer and informationscience system

Macau University of Scienceand Technology

[email protected]

Fayan TAO†



[email protected]

Junyi LU‡



[email protected]

ABSTRACTTwitter is one of the most popular social networks, whichplays a vital role in this new era. Exploring the informationdiffusion on Twitter is attractive and useful.

In this report, we apply R to do text mining and anal-ysis, which is about the topic–”#prayforparis” in Twitter.We first do the data preprocessing, such as data cleaningand stemming words. Then we show tweets frequency andassociation. We find that the word ”prayforparis” ranks thehighest frequency, and most of words we mined are related to”prayforparis”, ”paris” and ”parisattack”. We also show lay-outs of whole tweets and some extracted tweets. Additional,we cluster tweets as 10 groups to see the connections of dif-ferent topics. Since tweets can indicate users ↪ar attitudesand emotions well, so we further do the sentiment analysis.We find that most people expressed their sadness and angerabout Paris attack by ISIS and pray for Paris. Besides, themajority hold the positive attitudes toward this attack.

Keywordstext mining; Twitter; R; ”#prayforparis”; sentiment analysis

1. INTRODUCTION AND MOTIVATIONAs data mining and big data become hot research spots

in this new era, the technique of data analysis is requiredmuch higher as well. It is difficult to store and analyze largedata by using traditional database methodologies. So we tryto employ the powerful statistics platform R to do big datamining and analysis, because R provides kinds of statisticalmodels and data analysis methods, such as classic statisticaltests, time-series analysis, classification and clustering.

∗We rank the authors’ names by the inverse alphabet orderof the first number of authors’ last name.Stu ID:1509853G-II20-0033†Stu ID:1509853F-II20-0019‡Stu ID:1509853G-II20-0061

ACM ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

We try to analyze a large social network data set, whichis mainly focus on Twitter users and their expressions aboutlatest news in this project. And it is executed to discoversome characteristics those tweets have. By analyzing thelarge amount of social network data, we can get better knowl-edge on users’ preferences and habits, which will be helpfulfor people who are interested in such data. For example,business firms/companies can provide better services afteranalyzing similar social networks data. That is why we wantto choose this topic.

2. RELATED WORKS

2.1 Sentiments analysis by searching Twitter andWeibo

1

User-Level sentiment evolution can be analysis in Weibo.ZHANG Lumin and JIA Yan et al[16] firstly proposed a mul-tidimensional sentiment model with hierarchical structure toanalyze users’s complicate sentiments.

Michael Mathioudakis and Nick Koudas[11] presented a”TwitterMonitor”. It is a system that performs trend detec-tion over the Twitter stream. The system identifies emerg-ing topics (i.e. ’trends’) on Twitter in real time and providesmeaningful analysis that synthesize an accurate descriptionof each topic. Users interact with the system by orderingthe identified trends using different criteria and submittingtheir own description for each trend.

Twitter, in particular, is currently the major microblog-ging service, with more than 50 million subscribers. Twitterusers generate short text messages–the so–called ”tweets”to report their current thoughts and actions, comment onbreaking news and engage in discussions.[11]

Passonneau and Rebecca[1] mainly introduced a modelbased on Tree Kernel to analysis the POS-specific prior po-larity features of Twitter data and used a Partial Tree (PT)kernel which first proposed by Moschitti (2006) to calculatethe similarity between two trees(You can see an example infigure 1).They are divided sentiment in tweets into 3 cat-egories: positive, negative and neutral. They marked thesentiment expressed by emotions with an emotional dictio-nary and translate acronym (e.g. gr8, gr8t = great. lol =laughing out loud.) with an acronym dictionary. Those dic-tionaries can maps emotions or acronyms to their polarity.And they used an English ”Word stop” dictionary which arefound in Word Net to identify stop words and used an sen-timent dictionary which has many positive words, negative

1This part of related works is provided by Qiaoyang Zhang.

Figure 1: A tree kernel for a synthesized tweet: ”@Fernando this isn’t a great day for playing the HARP! :)”

words and neutral words to map words in tweets to theirpolarity.

The accuracy of the model they used is higher than theaccuracy of Unigram model by 4.02%. And the standarddeviation of the model they used is lower than the standarddeviation of Unigram model by 0.52%.

2.2 Study information diffusion on Twitter2

A number of recent papers have explored the informationdiffusion on Twitter, which is one of the most popular socialnetworks.

In the year 2011, Shaomei Wu et al.[14] focused on pro-duction, flow, consumption of information in the contextof Twitter. They exploited a Twitter ”lists” to distinguishelite users(Celebrities, Media, Organizations, Bloggers) andordinary users; and they found strong homophily within cat-egories, which means that each category mainly follows it-self. They also re–examined the classical ”two- step flow”theory[10] of communications, finding considerable supportfor it on Twitter. Additional, various URLs’ lifespans weredemonstrated under different categories. Finally, they ex-amined the attention paid by the different user categories todifferent news topics.

This paper Sheds clearly light on how media informationis transmitted on Twitter. The approach of defining a limitset of predetermined user categories presented can be ex-plored to automatic classification schemes. However, theyjust focus on the one narrow cross-section of the media in-formation(URLs). It would be better if their methods areapplied to other channels(TV, Radio); Another weaknessexists in this paper is lacking information flow on Twitterwith other sources of outcome data(e.g. users’ opinions andactions).

Daniel Ramage et al.[13]studied search behaviors on Twit-ter especially for the information that users prefer searching.They also compared Twitter search with web search in termsof users’ queries. They found that Twitter results containmore social events and contents, while web results includemore facts and navigation.

Eytan Bakshy et al.[3] conducted regression model to ana-lyze Twitter data. They explored word of mouth marketingto study users ↪ar influence on Twitter not only on commu-nication but also on URLs. They found that the largest

2This part of related works is provided by Fayan Tao.

cascades tend to be generated by users who have been influ-ential in the past and who have a large number of followers.They also found that URLs that were rated more interestingand/or elicited more positive feelings by workers on Mechan-ical Turk were more likely to spread.

As we can see that, all of three papers mentioned above areall focus on a large number of tweets, and employ differentmethods to analyze various characteristics of tweets fromdifferent aspects. But they are all limited to mainly focuson Twitter data rather extend to other more social networks.

2.3 Semantic Analysis and Text Mining3

Many researches are done to gain better understandingpeople’s characteristics in a specific field by analyzing thesemantics of social network content. This has many appli-cations especially for the business marketing purpose

Topic mining and sentiment analysis is done on follow-ers’ comments on a company’s Facebook Fan Page and theauthors get most frequent terms in each domain (TF, TF-IDF, three sentiments) and sentiment distributions through-out one year and its relation versus ”Likes”, respectively [5].This can help the marketing staffs aware of the sentimenttrend as well as main sentiment to adjust the marketingtechnique. Support Vector Machine (SVM) ClassificationModel is used in their analysis method. Before classification,word segmentation and feature extraction is done. Featureextraction is based on semantic dictionary and some addi-tional rules. They found that the sentiment distribution ofthe comments can be a contributing factor of the distribu-tion of ”Likes”.

Hsin-Ying Wu et al. [15] presented a method of analyzingFacebook posts serving as a marketing tool to help youngentrepreneurs to identify existing competitors in the marketand also their succeed factors and features during decisionmaking process. The overall mining process consists of threestages:

1 Extracting Facebook posts;

2 Text data preprocessing;

2 Key phrases and terms filtering and extraction.

In detail, they did word segmentation to original com-ments based on the lexicons, morphological rules for quan-tifier words and reduplicated words. The words and phrases3This part of related works is provided by Junyi Lu.

are extracted from text files and transformed into a keyphrase matrix based on frequencies. And next, a k-meansclustering algorithm based on the phrase frequency matrixand their similarity is used to identify the most importantphrases(i.e. features and factors of each shop). Various toolsis utilized in their study. CKIP is for Chinese word segmen-tation, PERL is for extracting text files and WEKA is forkey phrase clustering.

Social network mining is also done in educational field.Chen et al.[4] conducted an initial research on mining tweetsfor understanding the students’ learning experiences. Theyfirst use Radian6, a commercial social monitoring tool, toacquire students ↪ar posts using the hashtag #engineering-Problems and they collected 19,799 unique tweets. Due tothe ambiguity and complexity of natural language, they con-ducted inductive content analysis and categorized the tweetsinto 5 prominent themes and one group called ”others”. Themain hashtag, non-letter symbols, repeating letters, stop-words is removed in preprocessing stage. Multi-label naiveBayesian classifier is used because a tweet can reflect sev-eral problems. Then they obtained another data set us-ing the geocode of Purdue University with a radius of 1.3miles to demonstrate the effectiveness of the classifier andtry to detect the students ↪ar problems. They also demon-strated the multi-label naive Bayesian classifier performsbetter than other state-of-the-art classifiers (SVM and M3L)according to 4 parameters(Accuracy, Precision, Recall, F1).But there ↪ars a main defect in their method since they as-sume the categories are independent when they transformthe problem into single-label classification problems.

Most of the text mining process is much the same as eachother. Generally, text preprocessing is conducted(stopwords,punctuation, weird symbols and characters removal, seg-mentation) at the beginning. (Some study like sentimentanalysis need Part-of-speech Tagging.) Then a term fre-quency matrix is built with the data set to calculate the termfrequencies. Finally, classification and clustering is mostlyused to analyze the data and generate knowledge.

3. TEXT MINING UNDER R PLATFORM

3.1 About R

R[18] is a language and environment for statistical com-puting and graphics. It is a GNU project which is similar tothe S language and environment which was developed at BellLaboratories (formerly AT&T, now Lucent Technologies) byJohn Chambers and colleagues. R can be considered as adifferent implementation of S.

R provides a wide variety of statistical (linear and nonlin-ear modelling, classical statistical tests, time-series analysis,classification, clustering.) and graphical techniques, and ishighly extensible. R also provides an Open Source route toparticipation in statistical research. R is available as FreeSoftware under the terms of the Free Software Foundation’sGNU General Public License in source code form. It com-piles and runs on a wide variety of UNIX platforms andsimilar systems (including FreeBSD and Linux), Windowsand MacOS.

3.2 The idea

Text mining[2][17] is the discovery of interesting knowl-edge in text documents. It is a challenging issue to findthe accurate knowledge from unstructured text documents

to help users to find what they want. It can be defied asthe art of extracting data from large amount of texts. Itallows to structure and categorize the text contents whichare initially non-organized and heterogeneous. Text miningis an important data mining technique which includes themost successful technique to extract the effective patterns.

This report presents examples of text mining with R.Twitter text (”prayforparis”)is used as the data to analyze.It starts with extracting text from Twitter. The extractedtext is then transformed to build a document-term matrix.After that, frequent words and associations are found fromthe matrix. Next, words and tweets are clustered to findgroups of words and topics of tweets. Finally, a sentimentanalysis of tweets are explored, and a word cloud is used topresent important words in documents.

In this report, ”tweet” and ”document” will be used in-terchangeably, so are ”word” and ”term”. There are threeimportant packages used in the examples: twitteR, tm andwordcloud Package twitter[8] provides access to Twitter data,tm[6] provides functions for text mining, and wordcloud[7]visualizes the result with a word cloud 2.

4. IMPLEMENTATIONS4

4.1 Data Preprocessing

We firstly mine 3200 tweets from twitter by search themain topic ”prayforparis” during the time 13rd, Nov,2015 to13rd, Dec,2015. Then we do some data preprocessing.

4.1.1 Data Cleaning

The tweets are first converted to a data frame and thento a corpus, which is a collection of text documents. Afterthat, the corpus needs a couple of transformations, includingchanging letters to lower case, adding ”pray” and ”for” as ex-tra stop words and removing URLs, punctuations, numbersextra whitespace and stop words.

Next, we keep a copy of corpus to use later as a dictionaryfor stem completion

4.1.2 Stemming Words

Stemming[19] is the term used in linguistic morphologyand information retrieval to describe the process for reduc-ing inflected (or sometimes derived) words to their wordstem, base or root form ↪a lgenerally a written word form.A stemmer for English, for example, should identify thestring ”stems”, ”stemmer”, ”stemming”, ”stemmed” as basedon ”stem”. Using word stremming makes the words wouldlook normal. This can be achieved with function ”stemCom-pletion()” in R .

In the following steps, we use ”stemCompletion()” to com-plete the stems with the unstemmed corpus ”myCorpus-Copy” as a dictionary. With the default setting, it takesthe most frequent match in dictionary as completion.

4.1.3 Building a Term-Document Matrix

A term-document matrix indicates the relationship be-tween terms and documents, where each row stands for aterm and each column for a document, and an entry isthe number of occurrences of the term in the document.

4All of our implementation codes are attached in the end ofthis report.

TermDocumentMatrix terms: 3621, documents: 3200Non-/sparse entries 27543/11559657

Sparsity 100%Maximal term length 38

Weighting term frequency (tf)

Table 1: TermDocumentMatrix

Figure 2: layout of whole tweets

Alternatively, one can also build a document-term matrixby swapping row and column. In this report, we builda term-document matrix from the above processed corpuswith function ”TermDocumentMatrix()”.

As the table 1 shows, there are totally 3621 terms and3200 document in the ”TermDocumentMatrix”. We can seethat it is very sparse, with nearly 100% of the entries beingzero, which means that the majority terms are not containedin the document.

We can also see the layout of whole tweets in figure2, theyare mainly located in two parts. As the large number of data,we cannot tell those words clearly. Therefore, we select someterms from the total data and show their distributions asfigure3 and figure4 shows. We can see that most of termsare connected within a bounded zone, which means thatthey have associations more or less.

5. FREQUENT TERMS AND ASSOCIATIONSBased on the above data process, now we show the fre-

quent words. Note that there are 3200 tweets in total.We first choose those words which appear more than 100

times, the results are shown as table 2. We can see that, forexample, the number of ”parisattack” , ”pour” and ”victim”are all more that 100, which means they have high frequencyin this topic ↪a l”prayforparis”.

In the further process, we show the number of all of wordsthat appear at least 100 times. The result is as figure5 shows.As figure5 contains so many terms that we cannot tell thenumber of each term. So we only choose 70 terms, andshow the number of all of words that appear at least 100

Figure 3: layout-1 of some parts selected from wholetweets

Figure 4: layout-2 of some parts selected from wholetweets

Figure 5: Total words that appear at least 100 times

Figure 6: Selecting some Words that appear at least100 times

lose over papajackadvic struggl trust0.56 0.56 0.56 0.56 0.56worri prayfor think hope simoncowel0.56 0.40 0.40 0.32 0.29scare stay0.28 0.25

Table 3: words associated with ”pray” with correla-tion no less than 0.25

number of tweets words

[1] ”ld’” ”attentat” ”aux” ”?a”[5] ”de” ”deja” ”et” ”everyon”[9] ”fait” ”franc” ”go” ”il”[13] ”jamai” ”jour” ”la” ”les”[17] ”louistomlinson” ”moi” ”ne” ”noubliera”[21] ”novembr” ”pari” ”parisattack” ”pas”

[25] ”pensl ↪e” ”pour” ”prayforpari” ”que”[29] ”rt” ”simoncr” ”thought” ”un”[33] ”victim” ”vous” ”y” ”ytbclara”

Table 2: the number of words are larger than 100

times. As the figure6 shows, it is not surprised that the countof ”prayforparis” is highest, which is more than 3000. Thesecond one is ”pari” with the ”parisattack” following. Thisresult indicates that most of people care for Paris attack andpray for Paris.

To find the associations among words, we take the ”pray”for example to see which words are associated with ”pray”with correlation no less than 0.25.

From the table 3,we can see that there are 12 terms includ-ing ”loss”, ”struggl”, ”trust”, ”hope” have connection with”pray”, in which six terms such as ”loss”, ”papajackadvic”and ”trust” are associated with ”pray” by the correction of0.56, While ”prayfor” and ”hope” have correction 0.40 and0.32 with ”pray”, respectively.

6. CLUSTERING WORDSWe then try to find clusters of words with hierarchical

clustering. Sparse terms are removed, so that the plot ofclustering will not be crowded with words. we cut relateddata into 10 clusters. The agglomeration method is set toward, which denotes the increase in variance when two clus-ters are merged.

In the figure7, we can see different topics related to ”pray-forparis” in the tweets. Words ”les”, ”parisattack” ↪arfait ↪asand some other words are clustered into one group, becausethere are a couple of tweets on parisattack. Another groupcontains ”everyone” and ”thought” because everyone are fo-cus on this event. We can also see that ”moir”, ”deja” ”pray-forpari” are in a single group, which means they have fewrelationships.

Figure 7: cluster (10 groups)

7. EXPERIMENTS ABOUT SENTIMENTS

Figure 8: Emotion categories of #prayforparis

Figure 9: Classification by polarity of #prayforparis

Figure 10: A wordcloud of #prayforparis

Stages&Individuals Works

Stage 1 Literatures surveyDetermine project topic

Stage 2 R programming andtext ming learning

Stage 3 ImplementationsStage 4 Presentation

and Final reportQiaoyang Zhang Mainly read the references:

[1],[9],[11],[12], and[16];Sentiment analysis implementation.

Fayan Tao Mainly read [2], [3], [10], [13] and [14];Data preprocessingand data analysis

Junyi Lu Mainly read [4] [5] and [15];Analyze data association

and do cluster words.Remark All of us read [6], [7], [8], [17] [18], and [19].

Table 4: Timetable and working plan

We also did an experiment about sentiment in R softwarewith the method mentioned in the related works. We loadeda package named ”sentiment” in R software and analyzedthe sentiment of tweets about a hashtag ”#prayforparis” inTwitter. We used the ”sentiment” package to mine morethan 6800 tweets on Twitter and established corpus[12] inR to mainly analysis the related words of speech, frequencyand its correlation. figure 8 shows the emotion categories of”#prayforparis” with emotion dictionary. In this figure, wecan see that nearly 1000 persons felt sad and angry aboutthe terrorist attacks in Paris(angry about the terrorist attackfrom ISIS).And there are a small number of people felt afraidand surprised.

In the figure 9,we can see that nearly 5000 people usedpositive words and more than 1500 people used negativewords in their tweets. In addition, there are less than 500people used no polarity words in the hashtag about ”#pray-forparis”.

From the picture 10 of word cloud[17], we can intuitivelysee the most frequently used words about ”#prayforparis” inTwitter(the larger the front, the more used in tweets). Themost of polarities were concentrated in the type of sadness,anger and disgust.

From these experimental data, we can draw a conclusionthat the general attitudes of people around the world towardterrorist attack is sad and anger. Most people feel sorry forthe victims and pray for the victims in paris.They are alsostrongly against terrorism.

8. WORKING PLANTo finish this project, we made a timetable and working

plan as table 4 shows.

9. CONCLUSION AND FUTURE WORKSIn this report, we apply R to do text mining and analy-

sis about ”#prayforparis” in Twitter. We first do the datapreprocessing, such as data cleaning and stemming words.

Then we show the tweets frequency and association, we findthat ”prayforparis” ranks the highest frequency, and mostof words we mined are related to ”prayforparis”, ”paris” and”parisattack”. We also show the layout of whole tweets andsome extracted tweets. Additional, we cluster tweets topicas 10 groups to see the connections of terms. Since tweetscan indicate users’ attitudes and emotions well, so we fur-ther do the sentiment analysis. We find that most peopleexpressed their sadness and anger about Paris attack by ISISand praied for paris. As the results show, the majority holdthe positive attitudes toward to this attack, mainly becauseof hope for good future to Paris and whole wold as well.

As the data we mined is limited to one topic, and it is notso large, which may result in data incompleteness. Addi-tional, there are some problems existing during the data pre-processing, for example, the ”termdocmatrix” is so sparse,which are likely to have an bad influence on the followinganalysis and evaluations. In the future works, we plan todevelop a better model or algorithm, which can be used tomine and analyze different kinds of social networks data byR. We will also focus on the improvement of data prepro-cessing, so that it can make the result more precise.

10. ACKNOWLEDGMENTWe wish to thank Dr. Hong-Ning DAI for his patient

guidance and vital suggestions on this report.

11. REFERENCES[1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.

Passonneau. Sentiment analysis of twitter data.Proceedings of the Workshop on Languages in SocialMedia. 39(4):620lC622, 2011.

[2] V.Aswini, S.K.Lavanya, Pattern Discovery for TextMining Computation of Power, Energy, Informationand Communication (ICCPEIC), 2014 InternationalConference on IEEE. PP. 412-416. 2014.

[3] E. Bakshy, J.M. Hofman, W. A. Mason, and D. J.Watts. Everyone’s an influencer: quantifying influenceon twitter. In *Proceedings of the fourth ACMinternational conference on Web search and datamining* (WSDM ’11). ACM, New York, NY, USA,pp. 65-74. 2011.DOI=http://dx.doi.org/10.1145/1935826.1935845

[4] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan,Mining Social Media Data for UnderstandingStudents ↪ar Learning Experiences. IEEE Trans. Learn.Technol., vol. 7, no. 3, pp. 246lC259, 2014.

[5] Kuan-Cheng Lin et al., Mining the user clusters onFacebook fan pages based on topic and sentimentanalysis. Information Reuse and Integration (IRI),2014 IEEE 15th International Conference on , vol.,no., pp.627-632, 13-15 Aug. 2014

[6] I.Feinerer, tm: Text Mining Package. R packageversion 0.5-7.1. 2012.

[7] I.Fellows, wordcloud: Word Clouds. R package version2.0. 2012.

[8] J. Gentry, twitteR: R based Twitter client. R packageversion 0.99.19. 2012.

[9] I.Guellil and K.Boukhalfa. Social big data mining: Asurvey focused on opinion mining and sentimentsanalysis. In Programming and Systems (ISPS), 2015

12th International Symposium on, pp. 1–10, April2015.

[10] E. Katz.The two-step flow of communication: Anup-to-date report on an hypothesis. Public OpinionQuarterly, 21(1):61lC78, 1957.

[11] M. Mathioudakis and N. Koudas. TwitterMonitor :Trend Detection over the Twitter Stream. Proceeding:SIGMOD ’10 Proceedings of the 2010 ACM SIGMODInternational Conference on Management of data.ACM New York, NY. pp. 1155–1157. 2010.

[12] A. Pak and P. Paroubek. Twitter as a corpus forsentiment analysis and opinion mining. In SeventhConference on International Language ResourcesEvaluation, 2010.

[13] J. Teevan, D. Ramage, and M. R. Morris.]TwitterSearch: a comparison of microblog search andweb search. In *Proceedings of the fourth ACMinternational conference on Web search and datamining* (WSDM ’11). ACM, New York, NY, USA,pp. 35-44. 2011.DOI=http://dx.doi.org/10.1145/1935826.1935842

[14] S.M. Wu, J. M. Hofman, W. A. Mason, and D.J.Watts. Who says what to whom on twitter. In*Proceedings of the 20th international conference onWorld wide web* (WWW ’11). ACM, New York, NY,USA, pp.705-714. 2011.DOI=http://dx.doi.org/10.1145/1963405.1963504

[15] Hsin-Ying Wu; Kuan-Liang Liu; C. Trappey,Understanding customers using Facebook Pages: Datamining users feedback using text analysis. ComputerSupported Cooperative Work in Design (CSCWD),Proceedings of the 2014 IEEE 18th InternationalConference on , vol., no., pp.346-350, 21-23 May 2014

[16] L. M. Zhang, Y. Jia, X. Zhu, B. Zhou and Y. Han.User-Level Sentiment Evolution Analysis in Microblog.Browse Journals & Magazines, Communications,China. Volume:11 Issue:12 pp. 152–163. 2011.

[17] Y.C. Zhao, R and Data Mining: Examples and CaseStudies. Published by Elsevier. 2012.

[18] More details about R:https://www.r-project.org/about.html

[19] More information about stemming:https://en.wikipedia.org/wiki/Stemming

APPENDIXA. CODES FOR TEXTMINING

1 l i b r a r y (ROAuth)2 l i b r a r y ( b i tops )3 l i b r a r y ( RCurl )4 l i b r a r y ( twitteR )5 l i b r a r y (NLP)6 l i b r a r y (tm)7 l i b r a r y ( RColorBrewer )8 l i b r a r y ( wordcloud )9 l i b r a r y (XML)

10 #Set t w i t t e r auth u r l11 reqTokenURL <− ”https : // api . t w i t t e r . com/oauth/ reques t token ”12 accessTokenURL <− ”https : // api . t w i t t e r . com/oauth/ acce s s t oken ”13 authURL <− ”https : // api . t w i t t e r . com/oauth/ author i z e ”14 #Set t w i t t e r key15 consumerkey <− ”PXoumpl5ndvroikd1DPeGkcqE ”16 consumerSecret <− ”raDtyWXPYBS5zAH0WVjUGKoiObIAEpHroWJ8G6UjlVn5DBdzbv”17 accessToken <− ”3954258018−HALNbJ0Jo0pPVK844ZvNBnz5yRCXcdyTPKNE4rq”18 a c c e s s S e c r e t <− ”K45pUUUpWjqwSM0VgQZWDzx7D7F7RN74fB7gDg1EAh05B”19 s e tup tw i t t e r oau th ( consumerkey , consumerSecret , accessToken ,20 +a c c e s s S e c r e t )21 l i b r a r y ( twitteR )22 tweets <− searchTwitte r ( ”Pray forPar i s ” , s i n c e = ”2015−11−13 ” ,23 + u n t i l = ”2015−12−14 ” , n = 3200)24 ( nDocs <− l ength ( tweets ) )25 #[ 1 ] 320026 # convert tweets to a data frame27 tweets . df <− twListToDF ( tweets )28 dim ( tweets . df )29 # 3200 1630 #Text c l e an ing31 l i b r a r y (tm)32 # bui ld a corpus , and s p e c i f y the source to be charac t e r v e c t o r s33 myCorpus <− Corpus ( VectorSource ( tweets . d f $ t ex t ) )34 # convert to lower case35 # tm v0 . 636 myCorpus <− tm map(myCorpus , content t rans fo rmer ( to lower ) )37 # tm v0.5−1038 # myCorpus <− tm map(myCorpus , to lower )39 # remove URLs40 removeURL <− f unc t i on ( x ) gsub ( ”http [ ˆ [ : space : ] ] ∗ ” , ”” , x )41 # tm v0 . 642 myCorpus <− tm map(myCorpus , content t rans fo rmer (removeURL ) )43 # tm v0.5−1044 # myCorpus <− tm map(myCorpus , removeURL)45 # remove anything other than Engl i sh l e t t e r s or space46 removeNumPunct <− f unc t i on ( x ) gsub ( ” [ ˆ [ : alpha : ] [ : space : ] ] ∗ ” , ”” , x )47 myCorpus <− tm map(myCorpus , content t rans fo rmer ( removeNumPunct ) )48 # remove punctuat ion49 # myCorpus <− tm map(myCorpus , removePunctuation )50 # remove numbers51 # myCorpus <− tm map(myCorpus , removeNumbers )52 # add two extra stop words : ”pray ” and ” f o r ”53 myStopwords <− c ( stopwords ( ’ e n g l i s h ’ ) , ”pray ” , ” f o r ”)54 # remove ”ISIS ” and ”Par i s ” from stopwords55 myStopwords <− s e t d i f f ( myStopwords , c ( ”ISIS ” , ”Par i s ” ) )56 # remove stopwords from corpus57 myCorpus <− tm map(myCorpus , removeWords , myStopwords )58 # remove extra whitespace59 myCorpus <− tm map(myCorpus , s t r ipWhitespace )60 # keep a copy o f corpus to use l a t e r as a d i c t i o n a r y61 #f o r stem complet ion62 myCorpusCopy <− myCorpus63 # stem words64 myCorpus <− tm map(myCorpus , stemDocument )65 # i n s p e c t the f i r s t 5 documents ( tweets )66 # i n s p e c t ( myCorpus [ 1 : 5 ] )67 # The code below i s used f o r to make text f i t f o r paper width68 f o r ( i in c ( 1 : 2 , 320)) {69 cat ( paste0 ( ” [ ” , i , ” ] ” ) )

1

2 wr i t eL ine s ( strwrap ( as . cha rac t e r ( myCorpus [ [ i ] ] ) , 60))}3 #[ 1 ] RT BahutConfess PrayForPari4 #[ 2 ] FCBayern dontbombsyria i s i PrayForUmmah i s r a i l spdbpt bbc5 #PrayforPar i Merkel f r anc BVBPAOK saudi6 #[ 3 2 0 ] RT RodrigueDLG Rip aux vict im du batac lan AMAs PrayForParid7 # tm v0.5−108 # myCorpus <− tm map(myCorpus , stemCompletion )9 # tm v0 . 6

10 stemCompletion2 <− f unc t i on (x , d i c t i o n a r y ) {11 x <− u n l i s t ( s t r s p l i t ( as . cha rac t e r ( x ) , ” ” ) )12 # Unexpectedly , stemCompletion completes an empty s t r i n g to13 # a word in d i c t i o n a r y . Remove empty s t r i n g to avoid above i s s u e .14 x <− x [ x != ”” ]15 x <− stemCompletion (x , d i c t i o n a r y=d i c t i o n a r y )16 x <− paste (x , sep=”” , c o l l a p s e=” ”)17 PlainTextDocument ( s t r ipWhitespace ( x ) )18 }19 myCorpus <− l app ly (myCorpus , stemCompletion2 ,20 +d i c t i o n a r y=myCorpusCopy)21 myCorpus <− Corpus ( VectorSource ( myCorpus ) )22 # count f requency o f ”ISIS ”23 ISISCases <− l app ly (myCorpusCopy ,24 f unc t i on ( x ) { grep ( as . cha rac t e r ( x ) , pattern = ” \\< ISIS ”) } )25 sum( u n l i s t ( ISISCases ) )26 ## [ 1 ] 827 # count f requency o f ”pray ”28 prayCases <− l app ly (myCorpusCopy ,29 f unc t i on ( x ) { grep ( as . cha rac t e r ( x ) , pattern = ” \\<pray ”) } )30 sum( u n l i s t ( prayCases ) )31 ## [ 1 ] 113632 # r e p l a c e ”Is lam ” with ”ISIS ”33 myCorpus <− tm map(myCorpus , content t rans fo rmer ( gsub ) ,34 pattern = ”Is lam ” , replacement = ”ISIS ”)35 tdm <− TermDocumentMatrix (myCorpus , c o n t r o l =36 +l i s t ( wordLengths = c (1 , I n f ) ) )37 tdm38 #<<TermDocumentMatrix ( terms : 3621 , documents : 3200)>>39 #Non−/spar s e e n t r i e s : 27543/1155965740 #Spar s i ty : 100%41 #Maximal term length : 3842 #Weighting : term frequency ( t f )43 #Frequent Words and Asso c i a t i o n s44 idx <− which ( dimnames (tdm) $Terms == ”pray ”)45 i n s p e c t (tdm [ idx + ( 0 : 5 ) , 1 0 : 1 6 ] )46 #############47 <<TermDocumentMatrix ( terms : 6 , documents : 7)>>48 Non−/spar s e e n t r i e s : 2/4049 Spar s i ty : 95%50 Maximal term length : 1451 Weighting : term frequency ( t f )52 Docs53 Terms 10 11 12 13 14 15 1654 pray 0 1 0 0 0 0 055 prayed 0 0 0 0 0 0 056 prayer 0 0 0 0 1 0 057 prayersburundi 0 0 0 0 0 0 058 p r a y e r s f o r f r 0 0 0 0 0 0 059 p r a y e r s f o r p a r i 0 0 0 0 0 0 060 ##########61 # i n s p e c t f r equent words62 f indFreqTerms (tdm , lowf r eq =100)63 termFrequency <− rowSums( as . matrix (tdm ) )64 termFrequency <− subset ( termFrequency , termFrequency>=100)65 # i n s p e c t f r equent words66 ( f r e q . terms <− f indFreqTerms (tdm , lowf r eq = 100))67 term . f r e q <− rowSums( as . matrix (tdm ) )68 term . f r e q <− subset ( term . f req , term . f r e q >= 100)69 df <− data . frame ( term = names ( term . f r e q ) , f r e q = term . f r e q )

1

2 l i b r a r y ( ggp lot2 )3 ggp lot ( df , aes ( x = term , y = f r e q ) ) + geom bar ( s t a t = ” i d e n t i t y ”) +4 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ( )5 #s e l e c t some terms6 ggp lot ( df [ 3 0 : 6 0 , 4 0 : 8 0 ] , aes ( x = term , y = f r e q ) ) +7 +geom bar ( s t a t = ” i d e n t i t y ”) +8 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ( )9 # which words are a s s o c i a t e d with ”pray ”?

10 f i ndAssoc s (tdm , ’ pray ’ , 0 . 2 5 )11 #c l u s t i n g words12 # remove spar s e terms13 tdm2 <− removeSparseTerms (tdm , spar s e =0.95)14 m2 <− as . matrix ( tdm2)15 #### c l u s t e r terms16 di s tMatr ix <− d i s t ( s c a l e (m2) )17 f i t <− h c l u s t ( d istMatr ix , method=”ward . D2”)18 #other methods : complete average c en t r o id19 p lo t ( f i t )20 # cut t r e e in to 10 c l u s t e r s21 r e c t . h c l u s t ( f i t , k=10)22 ( groups <− cut r e e ( f i t , k=10))23 ##############################24 > ( groups <− cut r e e ( f i t , k=10))

25 ld’ a t t en ta t ?a d l ↪e j l d’ e t everyon26 1 2 2 3 1 427 f a i t i l jamai l e s moi noub l i e r a28 2 5 2 2 6 229 par i p a r i s a t t a c k pour p r a y f o r p a r i r t s imoncr30 7 2 1 8 9 231 thought un v ict im y ytbc l a ra32 4 10 1 5 133 ##################34 #change tdm to a Boolean matrix35 termDocMatrix=as . matrix (tdm)36 #termDocMatrix=as . matrix (tdm [ 4 0 : 2 4 0 , 4 0 : 2 4 0 ] )37 #remove ”r ” , ”data ” and ”mining ”38 idx <− which ( dimnames ( termDocMatrix ) $Terms \%in\% c ( ”pray ” , ”p a r i s ” , ”shoot ” ) )39 M <− termDocMatrix [− idx , ]40 # bui ld a tweet−tweet adjacency matrix41 tweetMatrix <− t (M) %∗% M42 l i b r a r y ( igraph )43 g <− graph . adjacency ( tweetMatrix , weighted=T, mode = ”und i rec ted ”)44 V( g ) $degree <− degree ( g )45 g <− s i m p l i f y ( g )46 #s e t l a b e l s o f v e r t i c e s to tweet IDs47 V( g ) $ l a b e l <− V( g )$name48 V( g ) $ l a b e l . cex <− 149 V( g ) $ l a b e l . c o l o r <− rgb ( . 4 , 0 , 0 , . 7 )50 V( g ) $ s i z e <− 251 V( g ) $frame . c o l o r <− NA52 barp lo t ( t ab l e (V( g ) $degree ) )53 tdm=tdm [ 1 : 2 0 0 , 1 : 2 0 0 ]54 idx <− V( g ) $degree == 055 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )56 #load t w i t t e r t ext57 #l i b r a r y ( twitteR)# load ( f i l e = ”data /rdmTweets . RData”)58 #convert tweets to a data frame59 df <− do . c a l l ( ”rbind ” , l app ly (tdm , as . data . frame ) )60 #s e t l a b e l s to the IDs and the f i r s t 20 c h a r a c t e r s o f tweets61 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,62 +subs t r ( d f $ t ex t [ idx ] , 1 , 20) , sep=” : ”)63 egam <− ( l og (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)64 E( g ) $ c o l o r <− rgb ( . 5 , . 5 , 0 , egam)65 E( g ) $width <− egam66 s e t . seed (3152)67 l ayout2 <− l ayout . fruchterman . r e i n g o l d ( g )68 p lo t ( g , layout=layout2 )69 #termDocMatrix=as . matrix (tdm [ 4 0 : 1 0 0 , 1 4 0 : 2 0 0 ] )70 dim ( termDocMatrix )71 #[ 1 ] 3642 3200

1

2 termDocMatrix [ termDocMatrix>=1] <− 13 # transform in to a term−term adjacency matrix4 termMatrix <− termDocMatrix %∗% t ( termDocMatrix )5 # i n s p e c t terms numbered 5 to 106 dim ( termMatrix )7 # [ 1 ] 3642 32008 termMatrix [ 5 : 1 0 , 5 : 1 0 ]9 ################

10 Terms abrahammateomus abzzni accept account acontecem ac r o s s11 abrahammateomus 1 0 0 0 0 012 abzzni 0 1 0 0 0 013 accept 0 0 2 0 0 014 account 0 0 0 1 0 015 acontecem 0 0 0 0 2 016 ac r o s s 0 0 0 0 0 217 ##############18 l i b r a r y ( igraph )19 # bui ld a graph from the above matrix20 g <− graph . adjacency ( termMatrix , weighted=T, mode=”und i rec ted ”)21 # remove loops22 g <− s i m p l i f y ( g )23 # s e t l a b e l s and degree s o f v e r t i c e s24 V( g ) $ l a b e l <− V( g )$name25 V( g ) $degree <− degree ( g )26 # s e t seed to make the layout r e p r oduc i b l e s e t . seed (30)27 l ayout1 <− l ayout . fruchterman . r e i n g o l d ( g )28 p lo t ( g , layout=layout1 )29 s e t . seed (3000) #315230 l ayout2 <− l ayout . fruchterman . r e i n g o l d ( g )31 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,32 +subs t r ( de f $ t ex t [ idx ] , 1 , 20) , sep=” : ”)33 egam <− ( l og (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)34 E( g ) $ c o l o r <− rb ( . 5 , . 5 , 0 , egam)35 E( g ) $width <− egam36 s e t . seed (3152)37 l ayout2 <− l ayout . fruchterman . r e i n g o l d ( g )38 p lo t ( g , layout=layout2 )39 #########################################40 termMatrix <− termMatrix [ 1 5 0 0 : 2 0 0 0 , 1 5 00 : 2 00 0 ]41 # c r e a t e a graph42 #g <− graph . i n c i d e n c e ( termDocMatrix , mode=c ( ” a l l ” ) )43 g <− graph . i n c i d e n c e ( termMatrix , mode=c ( ” a l l ” ) )44 # get index f o r term v e r t i c e s and tweet v e r t i c e s45 nTerms <− nrow (M)46 nDocs <− nco l (M)47 idx . terms <− 1 : nTerms48 idx . docs <− ( nTerms +1):( nTerms+nDocs )49 # s e t c o l o r s and s i z e s f o r v e r t i c e s50 V( g ) $degree <− degree ( g )51 V( g ) $ c o l o r [ idx . terms ] <− rgb (0 , 1 , 0 , . 5 )52 V( g ) $ s i z e [ idx . terms ] <− 653 V( g ) $ c o l o r [ idx . docs ] <− rgb (1 , 0 , 0 , . 4 )54 V( g ) $ s i z e [ idx . docs ] <− 455 V( g ) $frame . c o l o r <− NA56 # s e t ver tex l a b e l s and t h e i r c o l o r s and s i z e s57 V( g ) $ l a b e l <− V( g )$name58 V( g ) $ l a b e l . c o l o r <− rgb (0 , 0 , 0 , 0 . 5 )59 V( g ) $ l a b e l . cex <− 1 .4∗V( g ) $degree /max(V( g ) $degree ) + 160 # s e t edge width and c o l o r61 E( g ) $width <− . 362 E( g ) $ c o l o r <− rgb ( . 5 , . 5 , 0 , . 3 )63 s e t . seed (1500)64 p lo t ( g , layout=layout . fruchterman . r e i n g o l d )65 idx <− V( g ) $degree == 066 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )

1 # convert tweets to a data frame2 df <− do . c a l l ( ”rbind ” , l app ly ( termMatrix , as . data . frame ) )3 # s e t l a b e l s to the IDs and the f i r s t 20 c h a r a c t e r s o f tweets4 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,5 +subs t r ( d f $ t ex t [ idx ] , 1 , 20) , sep=” : ”)6 egam <− ( l og (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)7 E( g ) $ c o l o r <− rgb ( . 5 , . 5 , 0 , egam)8 E( g ) $width <− egam9 s e t . seed (3152)

10 l ayout2 <− l ayout . fruchterman . r e i n g o l d ( g )11 p lo t ( g , layout=layout2 )12 ###############sentiment a n a l y s i s #############13 # harves t some tweets14 some tweets = searchTwitte r ( ”#p r a y f o r p a r i s ” , n=10000 , lang=”en ”)15 # get the text16 some txt = sapply ( some tweets , f unc t i on ( x ) x$getText ( ) )17 # remove retweet e n t i t i e s18 some txt = gsub ( ”(RT| v ia ) ( ( ? : \ \ b\\W∗@\\w+)+)” , ”” , some txt )19 # remove at people20 some txt = gsub ( ”@\\w+” , ”” , some txt )21 # remove punctuat ion22 some txt = gsub ( ” [ [ : punct : ] ] ” , ”” , some txt )23 # remove numbers24 some txt = gsub ( ” [ [ : d i g i t : ] ] ” , ”” , some txt )25 # remove html l i n k s26 some txt = gsub ( ”http \\w+” , ”” , some txt )27 # remove unnecessary spaces28 some txt = gsub ( ” [ \ t ]{2 ,} ” , ”” , some txt )29 some txt = gsub ( ”ˆ\\ s +|\\ s+$ ” , ”” , some txt )30 # d e f i n e ”to lower e r r o r handl ing ” func t i on31 t ry . e r r o r = func t i on ( x )32 {33 # c r e a t e miss ing value34 y = NA35 # tryCatch e r r o r36 t r y e r r o r = tryCatch ( to lower ( x ) , e r r o r=func t i on ( e ) e )37 # i f not an e r r o r38 i f ( ! i n h e r i t s ( t r y e r r o r , ” e r r o r ” ) )39 y = to lower ( x )40 # r e s u l t41 re turn ( y )42 }43 # lower case us ing try . e r r o r with sapply44 some txt = sapply ( some txt , t ry . e r r o r )45 # remove NAs in some txt46 some txt = some txt [ ! i s . na ( some txt ) ]47 names ( some txt ) = NULL48 # c l a s s i f y emotion49 c las s emo = c l a s s i f y e m o t i o n ( some txt , a lgor i thm=”bayes ” , p r i o r =1.0)50 # get emotion best f i t51 emotion = class emo [ , 7 ]52 # s u b s t i t u t e NA’ s by ”unknown ”53 emotion [ i s . na ( emotion ) ] = ”unknown ”54 # c l a s s i f y p o l a r i t y55 c l a s s p o l = c l a s s i f y p o l a r i t y ( some txt , a lgor i thm=”bayes ”)56 # get p o l a r i t y best f i t57 p o l a r i t y = c l a s s p o l [ , 4 ]58 # data frame with r e s u l t s59 s e n t d f = data . frame ( text=some txt , emotion=emotion ,60 p o l a r i t y=po l a r i t y , s t r i ng sAsFac to r s=FALSE)61 # s o r t data frame62 s e n t d f = with in ( sent d f ,63 +emotion <− f a c t o r ( emotion ,64 +l e v e l s=names ( s o r t ( t ab l e ( emotion ) , de c r ea s ing=TRUE) ) ) )

1

2 # plo t d i s t r i b u t i o n o f emotions3 ggp lot ( s ent d f , aes ( x=emotion ) ) +4 geom bar ( aes ( y =. . count . . , f i l l =emotion ) ) +5 s c a l e f i l l b r e w e r ( p a l e t t e=”Dark2 ”) +6 l ab s ( x=”emotion c a t e g o r i e s ” , y=”number o f tweets ”) +7 l ab s ( t i t l e = ”Sentiment Ana lys i s o f Tweets about8 +Starbucks \n( c l a s s i f i c a t i o n by emotion ) ” ,9 +plo t . t i t l e = e l ement text ( s i z e =12))

10 # plo t d i s t r i b u t i o n o f p o l a r i t y11 ggp lot ( s ent d f , aes ( x=p o l a r i t y ) ) +12 geom bar ( aes ( y =. . count . . , f i l l =p o l a r i t y ) ) +13 s c a l e f i l l b r e w e r ( p a l e t t e=”RdGy”) +14 l ab s ( x=” p o l a r i t y c a t e g o r i e s ” , y=”number o f tweets ”) +15 l ab s ( t i t l e = ”Sentiment Ana lys i s o f Tweets about16 +#p r a y f o r p a r i s \n( c l a s s i f i c a t i o n by p o l a r i t y ) ” ,17 +plo t . t i t l e = e l ement text ( s i z e =12))18 # separa t ing text by emotion19 emos = l e v e l s ( f a c t o r ( sent d f$emot ion ) )20 nemo = length ( emos )21 emo . docs = rep ( ”” , nemo)22 f o r ( i in 1 : nemo)23 {24 tmp = some txt [ emotion == emos [ i ] ]25 emo . docs [ i ] = paste (tmp , c o l l a p s e=” ”)26 }27 # remove stopwords28 emo . docs = removeWords (emo . docs , stopwords ( ” e n g l i s h ”) )29 # c r e a t e corpus30 corpus = Corpus ( VectorSource (emo . docs ) )31 tdm = TermDocumentMatrix ( corpus )32 tdm = as . matrix (tdm)33 colnames (tdm) = emos34 # comparison word cloud35 comparison . c loud (tdm , c o l o r s = brewer . pa l (nemo , ”Dark2 ”) ,36 +s c a l e = c ( 3 , . 5 ) , random . order = FALSE, t i t l e . s i z e = 1 . 5 )

Text mining on Twitter information based on R platform

Documents

Transcript of Text mining on Twitter information based on R platform