Text Mining in R - WordPress.com · 3/24/2017 · What is Text Mining? ... mother catherine...
Transcript of Text Mining in R - WordPress.com · 3/24/2017 · What is Text Mining? ... mother catherine...
Text Mining in R
Sara Weston and Debbie Yee
03/24/2017
What is Text Mining?
I Text mining is the process by which textual data is brokendown into pieces for analysis.
I Pieces can be words or phrases.I Pieces can be analyzed as they are or as the sentiment they
represent.I Text mining can be used to test hypotheses or gain descriptive
insight into data.
Necessary packages
library(tm) #for reading in text documentslibrary(tidytext) # for cleaning text and sentimentslibrary(topicmodels) # for topic analysislibrary(janeaustenr) # for free datalibrary(dplyr) # for data manipulationlibrary(tidyr)library(stringr) # for manipulating string/text datalibrary(ggplot2) # for pretty graphslibrary(wordcloud) #duh
Load in fun data
We’re using data from the janeaustenr package, which includes allsix of Jane Austen’s novels.
I Require some preprocessing.I We restructure the data so that each chapter is it’s own“observation”, with data on which book and which chapter itcomes from.
I Code is included in the Rmd file, but not shown here.
## # A tibble: 6 × 3## book chapter## <fctr> <int>## 1 Sense & Sensibility 0## 2 Sense & Sensibility 1## 3 Sense & Sensibility 2## 4 Sense & Sensibility 3## 5 Sense & Sensibility 4## 6 Sense & Sensibility 5## # ... with 1 more variables: text <chr>
Types of information
We can analyze text data in a lot of ways. Today we will talk aboutthree kinds of ways to measure or ‘code’ text data:
I Word frequenicesI Word sentimentI Topics, based on word clusters
Each of these ways of measuring data require that data berestructured in different ways.
I Make use of the dplyr and tidyr packages
Types of data structures
I Long formI Each word gets its own row in a data frame.I Sometimes each word in each document (person).I Columns contain information about the word (and document).
I Short formI Each document (person) gets its own row.I Columns contain information about the documents, plus there is
one column for every unique word in the corpus.
Long formIt’s easy to get the data into a form where each word gets its ownrow.
long_austen <- original_books %>%unnest_tokens(output = word,
input = text,token = "words")
head(long_austen)
## # A tibble: 6 × 3## book chapter word## <fctr> <int> <chr>## 1 Sense & Sensibility 0 sense## 2 Sense & Sensibility 0 and## 3 Sense & Sensibility 0 sensibility## 4 Sense & Sensibility 0 by## 5 Sense & Sensibility 0 jane## 6 Sense & Sensibility 0 austen
A note about stop words.
Stop words are words in the English language that connect otherwords, but often have little or no content.
Examples: - Conjunctions - Articles - Prepositions
You will likely want to remove these words before you proceed.
Remove stop wordshead(stop_words)
## # A tibble: 6 × 2## word lexicon## <chr> <chr>## 1 a SMART## 2 a's SMART## 3 able SMART## 4 about SMART## 5 above SMART## 6 according SMART
long_austen <- long_austen %>%anti_join(stop_words)
## Joining, by = "word"
Short form
And you can use the long-form version to create the short form.
short_austen <- long_austen %>%mutate(bookchap = paste(book, chapter, sep="_")) %>%select(-c(book, chapter)) %>%group_by(bookchap) %>%count(word) %>%cast_dtm(document = bookchap, term = word, value = n)
Back to information
We can use these two data structures to calculate frequenices,match sentiments and estimate topics.
First, frequencies.
Frequienceslong_austen %>%
count(word)
## # A tibble: 13,816 × 2## word n## <chr> <int>## 1 _a_ 2## 2 _accepted_ 1## 3 _accident_ 1## 4 _adair_ 1## 5 _addition_ 1## 6 _advantages_ 1## 7 _affect_ 1## 8 _against_ 1## 9 _agreeable_ 1## 10 _air_ 1## # ... with 13,806 more rows
What can you do with frequencies?
I Summarize your dataI Estimate relationships between single words and other variables
Wordcloud
austen_freq <- long_austen %>%count(word)
wordcloud(words = austen_freq$word, freq = austen_freq$n, #necessary argumentsmin.freq = 200, # fun argumentsrandom.order = FALSE,random.color = F,colors=brewer.pal(6, "Dark2"))
Wordcloud
misstime
fann
y
dearladysir
day
emma
sisterhouse
elizabeth
elinorhopefriend
familymindfather
home
jane
mothercatherine
feelings
happy
mom
ent
half
love
till
crawford
marianne heartfound
heard
anne
pleasure
mor
ning
letter
poor
harrietsubject
woman
brother
worldleft
cried
looked
feel
speak
evening
hear
weston
repl
ied
manner
darcy
happiness
edmundparty
knightley
people
life
captain
told
opinion
spiritssuppose
acquaintanceimmediately
friends
elton
illshort
pass
ed
leave
hourid
ea
deal
eyes
word
attention
bennet
thomas
colonel
comfort
coming
sort
visit
return
brought
john
doub
t
obliged
rest
answerwoodhouse
character
affection
minutes
perf
ectly
walk
aunt
glad
account
pers
on
bingley
elliot
means set
feeling
business
situation
hand
henr
y
days
talked wished
agreeable
pretty
added
town
bertram
daughter
wife
received
talk
door
ladies
wal
ked
carriage
conversation
object
dashwood
unde
rsta
nd
manners
body
mar
y
reason
spoke
continued
called
ready
edward head
london
returned
voice
bath
dare
determined
fine
sist
ers
impossible
marriage
regard
settled
married
read
fairfax
girl
care
frank
power
children
nature
words
change
sense
country
expected
kindness
Estimate relationshipsshort_freq <- long_austen %>%
group_by(book, chapter) %>%count(word) %>%spread(key = word, value = n)
short_freq
## Source: local data frame [275 x 13,818]## Groups: book, chapter [275]#### book chapter `_a_` `_accepted_` `_accident_` `_adair_`## * <fctr> <int> <int> <int> <int> <int>## 1 Sense & Sensibility 0 NA NA NA NA## 2 Sense & Sensibility 1 NA NA NA NA## 3 Sense & Sensibility 2 NA NA NA NA## 4 Sense & Sensibility 3 NA NA NA NA## 5 Sense & Sensibility 4 NA NA NA NA## 6 Sense & Sensibility 5 NA NA NA NA## 7 Sense & Sensibility 6 NA NA NA NA## 8 Sense & Sensibility 7 NA NA NA NA## 9 Sense & Sensibility 8 NA NA NA NA## 10 Sense & Sensibility 9 NA NA NA NA## # ... with 265 more rows, and 13812 more variables: `_addition_` <int>,## # `_advantages_` <int>, `_affect_` <int>, `_against_` <int>,## # `_agreeable_` <int>, `_air_` <int>, `_all_` <int>, `_allow_` <int>,## # `_almost_` <int>, `_alone_` <int>, `_am_` <int>, `_amor_` <int>,## # `_amore_` <int>, `_and_` <int>, `_another` <int>, `_answer_` <int>,## # `_any_` <int>, `_anybody's_` <int>, `_appear_` <int>,## # `_appearance_` <int>, `_appearing_` <int>, `_appropriation_` <int>,## # `_are_` <int>, `_as` <int>, `_as_` <int>, `_assistance_` <int>,## # `_at_` <int>, `_be_` <int>, `_be'd_` <int>, `_been_` <int>,## # `_before_` <int>, `_begin_` <int>, `_being` <int>, `_being_` <int>,## # `_believe_` <int>, `_blunder_` <int>, `_boiled_` <int>, `_bon_` <int>,## # `_both_` <int>, `_boulanger_` <int>, `_bride_` <int>, `_broke_` <int>,## # `_brother_` <int>, `_brotherly_` <int>, `_can_` <int>,## # `_cannot_` <int>, `_caro_` <int>, `_cause_` <int>, `_chaperon_` <int>,## # `_choice_` <int>, `_coming` <int>, `_coming_` <int>,## # `_compassion_` <int>, `_compliments_` <int>, `_con_` <int>,## # `_conditionally_` <int>, `_conduct_` <int>, `_corps_` <int>,## # `_could_` <int>, `_count_` <int>, `_court_` <int>,## # `_courtship_` <int>, `_daughters_` <int>, `_deepest_` <int>,## # `_delightful_` <int>, `_did_` <int>, `_dined_` <int>,## # `_dislike_` <int>, `_dissolved_` <int>, `_dixon_` <int>,## # `_dixons_` <int>, `_do_` <int>, `_does_` <int>, `_double_` <int>,## # `_doubts_` <int>, `_du_` <int>, `_each_` <int>, `_early_` <int>,## # `_eclaircissement_` <int>, `_edmund_` <int>, `_eighteen_` <int>,## # `_eldest_` <int>, `_elegant_` <int>, `_elton_` <int>,## # `_endeavour_` <int>, `_engaged_` <int>, `_engagement_` <int>,## # `_ensemble_` <int>, `_esprit_` <int>, `_etourderie_` <int>,## # `_evening_` <int>, `_every_` <int>, `_exigeant_` <int>,## # `_expression_` <int>, `_family_` <int>, `_father's_` <int>,## # `_feel_` <int>, `_feelings_` <int>, `_felt_` <int>, `_few_` <int>, ...
Estimate relationships
ggplot(short_freq, aes(x = chapter,y = family,fill = book)) +
geom_bar(stat = "identity") +geom_smooth(se=F)+guides(fill=F) +facet_wrap(~book, scales = "free_x") +theme_bw()
Estimate relationships
Emma Northanger Abbey Persuasion
Sense & Sensibility Pride & Prejudice Mansfield Park
0 20 40 10 15 20 25 30 0 5 10 15 20 25
0 10 20 30 40 50 20 40 60 0 10 20 30 40 50
0.0
2.5
5.0
7.5
10.0
0.0
2.5
5.0
7.5
10.0
chapter
mar
riage
Sentiment
Words have sentimental value. There are three ways you canoperationalize the sentimental value of a word.
I Positive or negativeI Numeric (-3 to 3)I Emotion (joy, fear, trust, disgust, etc)
Use the get_sentiments function to get the operationalize you want.
Sentimentget_sentiments("bing")
## # A tibble: 6,788 × 2## word sentiment## <chr> <chr>## 1 2-faced negative## 2 2-faces negative## 3 a+ positive## 4 abnormal negative## 5 abolish negative## 6 abominable negative## 7 abominably negative## 8 abominate negative## 9 abomination negative## 10 abort negative## # ... with 6,778 more rows
Sentimentget_sentiments("afinn")
## # A tibble: 2,476 × 2## word score## <chr> <int>## 1 abandon -2## 2 abandoned -2## 3 abandons -2## 4 abducted -2## 5 abduction -2## 6 abductions -2## 7 abhor -3## 8 abhorred -3## 9 abhorrent -3## 10 abhors -3## # ... with 2,466 more rows
Sentimentget_sentiments("nrc")
## # A tibble: 13,901 × 2## word sentiment## <chr> <chr>## 1 abacus trust## 2 abandon fear## 3 abandon negative## 4 abandon sadness## 5 abandoned anger## 6 abandoned fear## 7 abandoned negative## 8 abandoned sadness## 9 abandonment anger## 10 abandonment fear## # ... with 13,891 more rows
Attaching sentiments
To use these, you can join the sentiments data frame with yourlong-form word data frame.
long_austen <- long_austen %>%inner_join(get_sentiments("afinn"))
Use sentiments as a new variable
ggplot(long_austen, aes(x = chapter, y = score, color = book)) +geom_smooth(se = F) +guides(color = FALSE) +facet_wrap(~book, scales = "free_x") +theme_bw()
Use sentiments as a new variable
Emma Northanger Abbey Persuasion
Sense & Sensibility Pride & Prejudice Mansfield Park
0 20 40 0 10 20 30 0 5 10 15 20 25
0 10 20 30 40 50 0 20 40 60 0 10 20 30 40 50
0.00
0.25
0.50
0.75
0.00
0.25
0.50
0.75
chapter
scor
e
Topics
You can try to infer what topics are coming up in your text data.
I Does require you to make some guesses.I Probably more useful for describing data and synthesizing
comments and pilot data than for inferential stats.I Use short-form data set.
Topics
Latent Dirichlet allocation
books_lda <- LDA(short_austen, k = 6,control = list(seed = 1234))
TopicsI Extract from that the beta matrix.I In this, each word gets one row for each topic.I Beta is the probability of that term being generated from that
topic.
book_topics <- tidy(books_lda, matrix = "beta")head(book_topics)
## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 austen 1.185131e-04## 2 2 austen 5.053303e-320## 3 3 austen 3.720076e-44## 4 4 austen 1.742293e-95## 5 5 austen 3.114776e-05## 6 6 austen 3.114320e-37
Find the top terms in each topicThe best way to work with these data is to find the “top terms” ineach topic, to try and figure out what the topic might be.
top_terms <- book_topics %>%group_by(topic) %>%top_n(10, beta) %>%ungroup() %>%arrange(topic, -beta)
head(top_terms)
## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 emma 0.017916669## 2 1 miss 0.013341364## 3 1 harriet 0.009551112## 4 1 weston 0.008867266## 5 1 knightley 0.008115030## 6 1 elton 0.007271614
Plot top terms
ggplot(top_terms, aes(term, beta, fill=factor(topic))) +geom_bar(stat="identity", show.legend = F)+facet_wrap(~ topic, scales = "free") +coord_flip()
Plot top terms
4 5 6
1 2 3
0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.020
0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015
dashwood
edward
elinor
jennings
marianne
miss
mother
sister
time
willoughby
bertram
crawford
edmund
fanny
miss
norris
rushworth
sir
thomas
time
anne
captain
charles
elliot
lady
miss
sir
time
walter
wentworth
allen
brother
catherine
friend
isabella
miss
morland
thorpe
tilney
time
dear
elton
emma
harriet
jane
knightley
miss
time
weston
woodhouse
bennet
bingley
darcy
dear
elizabeth
jane
lady
miss
sister
time
beta
term
Use in Psychology Research
I Data was generously provided by Alexa Lord.I Study on self-affirmation - “the act of affirming an important,
typically non-threatened, aspect of the self”I Does self-affirmation reduce rejection sensitivity
Experimental manipulation
I Self-affirmation condition: Rank values or traits on importance.Write for five minutes about the trait or value you listed asmost important.
I Control condition: Think of a T.V. character and rank valuesor traits on importance to that character. Write for fiveminutes about why that T.V. character values the number onecharacter or trait.
Load in data
#typical datasaf <- read.csv("saf.csv")#text datadocs <- VCorpus(DirSource("tm data/Text Files"))docs.tidy <- tidy(docs)docs.tidy$ID <- gsub("\\.txt", "", docs.tidy$id)
Tokenize text
docs.tidy2 <- docs.tidy %>%unnest_tokens(output = word,
input = text,token = "words") %>%
anti_join(stop_words) %>%select(ID, word)
saf <- merge(saf, docs.tidy2, by = "ID")
Frequencies
Which words are used more frequently in the SAF condition than inthe control condition?
saf.freq <- saf %>%group_by(SAF) %>%count(word) %>%ungroup()
saf.freq <- saf %>%group_by(SAF) %>%summarize(n.words = n()) %>%ungroup() %>%inner_join(saf.freq) %>%mutate(n.adj = n/n.words)
Plot word frequencies0 1
0.000 0.005 0.010 0.015 0.020 0.00 0.01 0.02 0.03
family
feel
friendly
friends
life
loving
people
person
relationships
respect
character
family
friends
humor
knowledge
life
makes
people
respect
sense
n.adj
wor
d
SentimentWhich sentiments are found in each condition?
saf.freq %>%inner_join(get_sentiments("nrc")) %>%group_by(SAF, sentiment) %>%summarize(m.sent = mean(n.adj, na.rm=T))
## Joining, by = "word"
## Source: local data frame [20 x 3]## Groups: SAF [?]#### SAF sentiment m.sent## <int> <chr> <dbl>## 1 0 anger 0.0003602342## 2 0 anticipation 0.0006829609## 3 0 disgust 0.0004092731## 4 0 fear 0.0004422031## 5 0 joy 0.0007660650## 6 0 negative 0.0003877042## 7 0 positive 0.0007319573## 8 0 sadness 0.0004619886## 9 0 surprise 0.0004357961## 10 0 trust 0.0007097314## 11 1 anger 0.0004942628## 12 1 anticipation 0.0013592024## 13 1 disgust 0.0004737870## 14 1 fear 0.0007217309## 15 1 joy 0.0014757498## 16 1 negative 0.0004641179## 17 1 positive 0.0009689168## 18 1 sadness 0.0005495285## 19 1 surprise 0.0006640335## 20 1 trust 0.0011505550
Plot sentiments
anger
anticipation
disgust
fear
joy
negative
positive
sadness
surprise
trust
0.0000 0.0005 0.0010 0.0015
m.sent
sent
imen
t
Topics
What are people talking about?
short_saf <- saf.freq %>%filter(!grepl("[0-9]", word)) %>%select(SAF, word, n) %>%cast_dtm(document = SAF, term = word, value = n)
saf_lda <- LDA(short_saf, k = 2,control = list(seed = 1234))
Extract topics
saf_topics <- tidy(saf_lda, matrix = "beta")head(saf_topics)
## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 abilities 4.014336e-04## 2 2 abilities 7.568858e-04## 3 1 ability 2.452691e-03## 4 2 ability 1.609864e-03## 5 1 abuse 2.138209e-04## 6 2 abuse 7.667266e-05
Plot topics1 2
0.000 0.005 0.010 0.015 0.020 0.00 0.01 0.02 0.03
family
feel
friends
life
makes
people
person
relationships
respect
sense
care
family
friendly
friends
humor
knowledge
life
loving
people
relationships
beta
term