Social Data Science - Text as Data · resources...
Transcript of Social Data Science - Text as Data · resources...
![Page 1: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/1.jpg)
social data scienceText as Data
Sebastian BarfortAugust 18, 2016
University of CopenhagenDepartment of Economics
1/60
![Page 2: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/2.jpg)
text all over the place
The widespread use of the Internet has led to anastronomical amount of digitized textual dataaccumulating every second through email, websites, andsocial media. The analysis of blog sites and social mediaposts can give new insights into human behaviors andopinions. At the same time, large-scale efforts to digitizepreviously published articles, books, and governmentdocuments have been underway, providing excitingopportunities for social scientists.
Imai (2016).
We need to learn how to think about and work with these kinds ofnew data
2/60
![Page 3: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/3.jpg)
resources
Names (selective): Will Lowe, Justin Grimmer, Kenneth Benoit,Margaret E. Roberts, Sven-Oliver Proksch, Suresh Naidy
R packages: tm, quanteda, stm, stringr, tidytext
3/60
![Page 4: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/4.jpg)
is your favorite professor biased?
Jelveh, Zubin, Bruce Kogut, and Suresh Naidu. “Detecting LatentIdeology in Expert Text: Evidence From Academic Papers inEconomics”. EMNLP. 2014.
Previous work on extracting ideology from text has focusedon domains where expression of political views is expected,but it’s unclear if current technology can work in domainswhere displays of ideology are considered inappropriate.We present a supervised ensemble n-gram model forideology extraction with topic adjustments and apply it toone such domain: research papers written by academiceconomists. We show economists’ political leanings can becorrectly predicted, that our predictions generalize to newdomains, and that they correlate with public policy-relevantresearch findings. We also present evidence thatunsupervised models can underperform in domains whereideological expression is discouraged. 4/60
![Page 5: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/5.jpg)
validation
We emphasize that the complexity of language implies thatautomated content analysis methods will never replacecareful and close reading of texts. Rather, the methods thatwe profile here are best thought of as amplifying andaugmenting careful reading and thoughtful analysis.Further, automated content methods are incorrect modelsof language. This means that the performance of any onemethod on a new data set cannot be guaranteed, andtherefore validation is essential when applying automatedcontent methods.
Grimmer and Stewart (2013).
5/60
![Page 6: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/6.jpg)
6/60
![Page 7: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/7.jpg)
bag of words
Bag of Words: The ordering and grammar of words does not informthe analysis.
Easy to construct sample sentences where word orderfundamentally changes the nature of the sentence, but for mostcommon tasks like measuring sentiment, topic modeling, etc. theydo not seem to matter (Grimmer and Stewart 2013)
7/60
![Page 8: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/8.jpg)
pre-processing
Stemming: Dimensionality reduction. Removes the ends of words toreduce the total number of unique words in the data.
Ex: family, families, families’, etc. all become famili.Stop words: Words that do not convey meaning but primarily servegrammatical purposes.
Uncommon Words: Typically, words that appear very often or veryrarely are excluded.
Also typically discard punctuation (although not always!),capitalization, etc.
8/60
![Page 9: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/9.jpg)
Classifying Documents into KnownCategories
9/60
![Page 10: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/10.jpg)
introduction
Inferring and assigning text to categories is perhaps most commonuse of contant analysis in the social sciences
Ex: Classifying ads as positive/negative, is legislation aboutenviorenment, etc.
Two broad approaches:
Dictionary Methods: Use relative frequency of key words to measurepresence of category in a given text
Supervised Learning: Build on and extend familiar manual codingtasks using algorithms
10/60
![Page 11: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/11.jpg)
dictionary methods
Perhaps the most simple and intuitive automated text classificationmethod
Use the rate at which key words appear in a text to classifydocuments into categories or to measure extent to which documentsbelong to particular category
Dictionary: a list of words that classify a particular collection ofwords
Note: For dictionary methods to work well, the scores attached toeach words must closely align with how the words are used in aparticular context
Dictionaries are rarely validated
11/60
![Page 12: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/12.jpg)
supervised learning methods
Dictionary methods require that we are able to apriori identify wordsthat separate classes
This can be wrong and/or inefficient
Supervised learning models are designed to automate the handcoding of documents
Supervised learning models: Human coders categorize a set ofdocuments by hand. The algorithm then “learns” how to sort thedocuments into categories using these training data and apply itspredictions to new unlabeled texts
12/60
![Page 13: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/13.jpg)
approach
1. Construct a training set2. Apply the supervised learning method using cross-validation3. Decide on “best” model and classify the remaining documents
13/60
![Page 14: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/14.jpg)
Classification with Unknown Categories
14/60
![Page 15: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/15.jpg)
introduction
Supervised and dictionary methods assume a well-defined set ofcategories
Often, this set of categories is difficult to derive beforehand
Is must be discovered from the text itself
Unsupervised Learning: Try to learn underlying features of textwithout explicitly imposing categories of interest
1. Estimate set of categories2. Assign documents (or part of documents) to those categories
Often: topic-models
15/60
![Page 16: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/16.jpg)
Measuring Latent Features in Texts
16/60
![Page 17: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/17.jpg)
introduction
Can we locate actors (politicians, newspapers, researchers) in anideological space using text data?
Assumption: Ideological dominance. Actors’ ideological preferencesdetermine what they discuss in texts.
Wordscores: Supervised learning approach. Special case ofdictionary method.
Wordfish: Unsupervised learning approach. Discover words thatdistinguish locations on a policy scale.
17/60
![Page 18: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/18.jpg)
wordscore
1. Select reference texts that define the position in the policyspace (e.g. a conservative and liberal politician)
2. Use training data to determine relative frequency of words.Creates a measure of how well various words separate thecategories
3. Use these word scores to scale remaining texts.
Disadvantage: Conflates policy dominance with stylistic differences
18/60
![Page 19: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/19.jpg)
Predicting Yelp Reviews
19/60
![Page 20: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/20.jpg)
introduction
David Robinson: Does sentiment analysis work? A tidy analysis ofYelp reviews
Sentiment analysis is often used by companies to quantify generalsocial media opinion (for example, using tweets about severalbrands to compare customer satisfaction).
One of the simplest and most common sentiment analysis methodsis to classify words as “positive” or “negative”, then to average thevalues of each word to categorize the entire document.
Can we use this approach to predict Yelp reviews?
20/60
![Page 21: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/21.jpg)
data
Can be downloaded from here
library(”readr”)library(”dplyr”)
infile = ”../nopub/yelp_academic_dataset_review.json”review_lines = read_lines(infile,
n_max = 50000,progress = FALSE)
21/60
![Page 22: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/22.jpg)
from json to data frame
library(”stringr”)library(”jsonlite”)
reviews_combined = str_c(”[”,str_c(review_lines,
collapse = ”, ”),”]”)
reviews = fromJSON(reviews_combined) %>%flatten() %>%tbl_df()
22/60
![Page 23: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/23.jpg)
tidy text data
Right now, there is one row for each review.
Remember bag of words assumption: predictors are at the word, notsentence level
-> We need to tidy the data
23/60
![Page 24: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/24.jpg)
library(”tidytext”)review_words = reviews %>%select(review_id, business_id, stars, text) %>%unnest_tokens(word, text)
review_words %>% dim
## [1] 5930037 4
24/60
![Page 25: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/25.jpg)
remove stop words
review_words = review_words %>%filter(!word %in% stop_words$word) %>%filter(str_detect(word, ”^[a-z’]+$”))
25/60
![Page 26: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/26.jpg)
review_id business_id stars word
Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 hoagieYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 institutionYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 walkingYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 throwbackYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 ago
26/60
![Page 27: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/27.jpg)
afinn
AFINN = sentiments %>%filter(lexicon == ”AFINN”) %>%select(word, afinn_score = score)
27/60
![Page 28: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/28.jpg)
word afinn_score
abandon -2abandoned -2abandons -2abducted -2abduction -2
28/60
![Page 29: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/29.jpg)
reviews_sentiment = review_words %>%inner_join(AFINN, by = ”word”) %>%group_by(review_id, stars) %>%summarize(sentiment = mean(afinn_score))
29/60
![Page 30: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/30.jpg)
review_id stars sentiment
__-r0eC3hZlaejvuliC8zQ 5 4.0000000__77nP3Nf1wsGz5HPs2hdw 5 1.6000000__DK9Vsmyoo0zJQhIl5cbg 1 -2.1000000__ELCJ0wzDM2QNRfVUq26Q 5 3.5000000__esH_kgJZeS8k3i6HaG7Q 5 0.2142857
30/60
![Page 31: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/31.jpg)
df.review = reviews_sentiment %>%group_by(stars) %>%summarise(m.sentiment = mean(sentiment))
31/60
![Page 32: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/32.jpg)
1
2
3
4
5
1
2
3
4
5
0 1Sentiment Score (mean)
Num
ber
of S
tars
on
Yelp
32/60
![Page 33: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/33.jpg)
word frequency
review_words_counted = review_words %>%count(review_id, business_id, stars, word) %>%ungroup()
33/60
![Page 34: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/34.jpg)
review_id business_id stars word n
__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 amazing 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 attentive 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 breakfast 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 cheese 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 chili 1
34/60
![Page 35: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/35.jpg)
word_summaries = review_words_counted %>%group_by(word) %>%summarize(businesses = n_distinct(business_id),
reviews = n(),uses = sum(n),average_stars = mean(stars)) %>%
ungroup() %>%arrange(reviews)
35/60
![Page 36: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/36.jpg)
0
5000
10000
15000
20000
10 1000reviews
coun
t
36/60
![Page 37: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/37.jpg)
word_summaries_filtered = word_summaries %>%filter(reviews >= 200, businesses >= 10)
37/60
![Page 38: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/38.jpg)
word businesses reviews uses average_stars
lives 162 200 205 3.785000regret 162 200 206 3.915000bloody 80 201 246 3.621891courses 84 201 242 3.800995crowds 130 201 208 3.766169
38/60
![Page 39: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/39.jpg)
0
50
100
150
200
1000 10000reviews
coun
t
39/60
![Page 40: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/40.jpg)
positive words
df.0 = word_summaries_filtered %>%arrange(-average_stars)
word businesses reviews uses average_stars
gem 272 504 509 4.482143superb 171 250 253 4.460000incredible 268 519 554 4.458574amazing 927 3696 4240 4.391775highly 736 1660 1729 4.388554
40/60
![Page 41: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/41.jpg)
negative words
df.1 = word_summaries_filtered %>%arrange(average_stars)
word businesses reviews uses average_stars
refused 178 205 226 1.604878worst 667 1215 1321 1.650206disgusting 252 321 347 1.735202rude 576 1005 1162 1.833831horrible 582 938 1043 1.835821
41/60
![Page 42: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/42.jpg)
gemsuperb amazinghighlyknowledgeablegardens perfectfavoritespersonable deliciousbeautiful
refused worstdisgusting
rudeawful terribletasteless
poorly waste poorchargedstale dirty managerattitudeexcuse
foodservice
nice
friendly
staff
love
peoplerestaurant
menu
delicious
bad
fresh
amazing
2
3
4
1000 10000# of reviews
Ave
rage
Sta
rs
42/60
![Page 43: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/43.jpg)
statistical analysis
1
2
3
4
5
−2.5 0.0 2.5 5.0sentiment
star
s
43/60
![Page 44: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/44.jpg)
cross validation
library(”purrr”)library(”modelr”)gen_crossv = function(pol,
data = reviews_sentiment){data %>%
crossv_mc(200) %>%mutate(mod = map(train,
~ lm(stars ~ poly(sentiment, pol),
data = .)),rmse.test = map2_dbl(mod, test, rmse),rmse.train = map2_dbl(mod, train, rmse)
)}
44/60
![Page 45: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/45.jpg)
set.seed(3000)df.cv = 1:8 %>%map_df(gen_crossv, .id = ”degree”)
45/60
![Page 46: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/46.jpg)
46/60
![Page 47: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/47.jpg)
Text Analysis of Donald Trump’s Tweets
47/60
![Page 48: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/48.jpg)
file = paste0(”http://varianceexplained.org/”,”files/”,”trump_tweets_df.rda”)
load(url(file))
48/60
![Page 49: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/49.jpg)
question
David Robinson: Text analysis of Trump’s tweets confirms he writesonly the (angrier) Android half
Who writes Donald Trump’s tweets?
They are written from two different devices: an iPhone and anAndroid
Can we examine quantitatively whether a tweet is written by DonalTrump himself or from someone on his staff?
49/60
![Page 50: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/50.jpg)
library(”tidyr”)
tweets = trump_tweets_df %>%select(id, statusSource, text, created) %>%extract(statusSource,
”source”, ”Twitter for (.*?)<”) %>%filter(source %in% c(”iPhone”, ”Android”))
50/60
![Page 51: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/51.jpg)
id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
51/60
![Page 52: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/52.jpg)
library(”tidytext”)library(”stringr”)
reg = ”([^A-Za-z\\d#@’]|’(?![A-Za-z\\d#@]))”tweet_words = tweets %>%filter(!str_detect(text, ’^”’)) %>%mutate(text =
str_replace_all(text,”https://t.co/[A-Za-z\\d]+|&”, ””)) %>%unnest_tokens(word, text, token = ”regex”,
pattern = reg) %>%filter(!word %in% stop_words$word,
str_detect(word, ”[a-z]”))
52/60
![Page 53: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/53.jpg)
id source created word
676494179216805888 iPhone 2015-12-14 20:09:15 record676494179216805888 iPhone 2015-12-14 20:09:15 health676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016676509769562251264 iPhone 2015-12-14 21:11:12 accolade
53/60
![Page 54: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/54.jpg)
trump
bad
cruz
america
people
#makeamericagreatagain
clinton
crooked
#trump2016
hillary
0 50 100 150n
54/60
![Page 55: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/55.jpg)
android_iphone_ratios = tweet_words %>%count(word, source) %>%filter(sum(n) >= 5) %>%spread(source, n, fill = 0) %>%ungroup() %>%mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%mutate(logratio = log2(Android / iPhone))
55/60
![Page 56: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/56.jpg)
#makeamericagreatagain#trump2016
join#americafirst
#votetrump#imwithyou
#crookedhillary#trumppence16
7pmtomorrow
agobrexit
jokemails
strongtalkingspentweakcrazybadly
−5.0 −2.5 0.0 2.5
56/60
![Page 57: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/57.jpg)
sentiment analysis
nrc = sentiments %>%filter(lexicon == ”nrc”) %>%select(word, sentiment)
57/60
![Page 58: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/58.jpg)
word sentiment
abacus trustabandon fearabandon negativeabandon sadnessabandoned anger
58/60
![Page 59: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/59.jpg)
sadness fear anger disgust
surprise anticipation trust joy
0
1
2
3
0
1
2
3
−1
0
1
2
3
0
1
2
0
1
2
−4
−2
0
2
−3−2−1
012
−3
−2
−1
0
1
badl
ycr
azy
lost
wor
sedi
sast
er lieba
dki
lling
unfa
irto
ugh
craz
yfig
htw
orse
disa
ster
bad
killi
ngpo
lice
cour
tch
ange
terr
oris
t
craz
yan
gry
fight hi
tdi
sast
er lieba
dki
lling
unfa
ircr
ime
angr
ym
ess
disa
ster lie
bad
john
unfa
irdi
rty
liar
win
ning
chan
cedi
sast
erde
alho
pevo
teex
citin
gsh
ottr
ump
won
derf
ulte
rror
ist
leav
ew
inni
ng
time
deal
calls
cour
tw
atch
hope
vote top
win
ning
tom
orro
w
sena
teec
onom
ypo
lice
deal
syst
emtr
ade
calls
expl
ain
hono
rre
port
ersa
fe
deal
spec
ial
hope
vote
won
derf
ullo
ve pay
true
win
ning
safe
And
roid
/ iP
hone
log
ratio
Android
iPhone
59/60
![Page 60: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext](https://reader033.fdocuments.us/reader033/viewer/2022052105/6040184c8a28ab6ab935eeb7/html5/thumbnails/60.jpg)
your turn
Continue working with these data in groups
Think of interesting patterns you can explore in the data
What about the timing of tweets? Could everything be included inone large model?
60/60