Text Mining in R - WordPress.com · 3/24/2017 · What is Text Mining? ... mother catherine...

Text Mining in R

Sara Weston and Debbie Yee

03/24/2017

What is Text Mining?

I Text mining is the process by which textual data is brokendown into pieces for analysis.

I Pieces can be words or phrases.I Pieces can be analyzed as they are or as the sentiment they

represent.I Text mining can be used to test hypotheses or gain descriptive

insight into data.

Necessary packages

library(tm) #for reading in text documentslibrary(tidytext) # for cleaning text and sentimentslibrary(topicmodels) # for topic analysislibrary(janeaustenr) # for free datalibrary(dplyr) # for data manipulationlibrary(tidyr)library(stringr) # for manipulating string/text datalibrary(ggplot2) # for pretty graphslibrary(wordcloud) #duh

Load in fun data

We’re using data from the janeaustenr package, which includes allsix of Jane Austen’s novels.

I Require some preprocessing.I We restructure the data so that each chapter is it’s own“observation”, with data on which book and which chapter itcomes from.

I Code is included in the Rmd file, but not shown here.

## # A tibble: 6 × 3## book chapter## <fctr> <int>## 1 Sense & Sensibility 0## 2 Sense & Sensibility 1## 3 Sense & Sensibility 2## 4 Sense & Sensibility 3## 5 Sense & Sensibility 4## 6 Sense & Sensibility 5## # ... with 1 more variables: text <chr>

Types of information

We can analyze text data in a lot of ways. Today we will talk aboutthree kinds of ways to measure or ‘code’ text data:

I Word frequenicesI Word sentimentI Topics, based on word clusters

Each of these ways of measuring data require that data berestructured in different ways.

I Make use of the dplyr and tidyr packages

Types of data structures

I Long formI Each word gets its own row in a data frame.I Sometimes each word in each document (person).I Columns contain information about the word (and document).

I Short formI Each document (person) gets its own row.I Columns contain information about the documents, plus there is

one column for every unique word in the corpus.

Long formIt’s easy to get the data into a form where each word gets its ownrow.

long_austen <- original_books %>%unnest_tokens(output = word,

input = text,token = "words")

head(long_austen)

## # A tibble: 6 × 3## book chapter word## <fctr> <int> <chr>## 1 Sense & Sensibility 0 sense## 2 Sense & Sensibility 0 and## 3 Sense & Sensibility 0 sensibility## 4 Sense & Sensibility 0 by## 5 Sense & Sensibility 0 jane## 6 Sense & Sensibility 0 austen

A note about stop words.

Stop words are words in the English language that connect otherwords, but often have little or no content.

Examples: - Conjunctions - Articles - Prepositions

You will likely want to remove these words before you proceed.

Remove stop wordshead(stop_words)

## # A tibble: 6 × 2## word lexicon## <chr> <chr>## 1 a SMART## 2 a's SMART## 3 able SMART## 4 about SMART## 5 above SMART## 6 according SMART

long_austen <- long_austen %>%anti_join(stop_words)

## Joining, by = "word"

Short form

And you can use the long-form version to create the short form.

short_austen <- long_austen %>%mutate(bookchap = paste(book, chapter, sep="_")) %>%select(-c(book, chapter)) %>%group_by(bookchap) %>%count(word) %>%cast_dtm(document = bookchap, term = word, value = n)

Back to information

We can use these two data structures to calculate frequenices,match sentiments and estimate topics.

First, frequencies.

Frequienceslong_austen %>%

count(word)

## # A tibble: 13,816 × 2## word n## <chr> <int>## 1 _a_ 2## 2 _accepted_ 1## 3 _accident_ 1## 4 _adair_ 1## 5 _addition_ 1## 6 _advantages_ 1## 7 _affect_ 1## 8 _against_ 1## 9 _agreeable_ 1## 10 _air_ 1## # ... with 13,806 more rows

What can you do with frequencies?

I Summarize your dataI Estimate relationships between single words and other variables

Wordcloud

austen_freq <- long_austen %>%count(word)

wordcloud(words = austen_freq$word, freq = austen_freq$n, #necessary argumentsmin.freq = 200, # fun argumentsrandom.order = FALSE,random.color = F,colors=brewer.pal(6, "Dark2"))

Wordcloud

misstime

fann

y

dearladysir

day

emma

sisterhouse

elizabeth

elinorhopefriend

familymindfather

home

jane

mothercatherine

feelings

happy

mom

ent

half

love

till

crawford

marianne heartfound

heard

anne

pleasure

mor

ning

letter

poor

harrietsubject

woman

brother

worldleft

cried

looked

feel

speak

evening

hear

weston

repl

ied

manner

darcy

happiness

edmundparty

knightley

people

life

captain

told

opinion

spiritssuppose

acquaintanceimmediately

friends

elton

illshort

pass

ed

leave

hourid

ea

deal

eyes

word

attention

bennet

thomas

colonel

comfort

coming

sort

visit

return

brought

john

doub

t

obliged

rest

answerwoodhouse

character

affection

minutes

perf

ectly

walk

aunt

glad

account

pers

on

bingley

elliot

means set

feeling

business

situation

hand

henr

y

days

talked wished

agreeable

pretty

added

town

bertram

daughter

wife

received

talk

door

ladies

wal

ked

carriage

conversation

object

dashwood

unde

rsta

nd

manners

body

mar

y

reason

spoke

continued

called

ready

edward head

london

returned

voice

bath

dare

determined

fine

sist

ers

impossible

marriage

regard

settled

married

read

fairfax

girl

care

frank

power

children

nature

words

change

sense

country

expected

kindness

Estimate relationshipsshort_freq <- long_austen %>%

group_by(book, chapter) %>%count(word) %>%spread(key = word, value = n)

short_freq

## Source: local data frame [275 x 13,818]## Groups: book, chapter [275]#### book chapter `_a_` `_accepted_` `_accident_` `_adair_`## * <fctr> <int> <int> <int> <int> <int>## 1 Sense & Sensibility 0 NA NA NA NA## 2 Sense & Sensibility 1 NA NA NA NA## 3 Sense & Sensibility 2 NA NA NA NA## 4 Sense & Sensibility 3 NA NA NA NA## 5 Sense & Sensibility 4 NA NA NA NA## 6 Sense & Sensibility 5 NA NA NA NA## 7 Sense & Sensibility 6 NA NA NA NA## 8 Sense & Sensibility 7 NA NA NA NA## 9 Sense & Sensibility 8 NA NA NA NA## 10 Sense & Sensibility 9 NA NA NA NA## # ... with 265 more rows, and 13812 more variables: `_addition_` <int>,## # `_advantages_` <int>, `_affect_` <int>, `_against_` <int>,## # `_agreeable_` <int>, `_air_` <int>, `_all_` <int>, `_allow_` <int>,## # `_almost_` <int>, `_alone_` <int>, `_am_` <int>, `_amor_` <int>,## # `_amore_` <int>, `_and_` <int>, `_another` <int>, `_answer_` <int>,## # `_any_` <int>, `_anybody's_` <int>, `_appear_` <int>,## # `_appearance_` <int>, `_appearing_` <int>, `_appropriation_` <int>,## # `_are_` <int>, `_as` <int>, `_as_` <int>, `_assistance_` <int>,## # `_at_` <int>, `_be_` <int>, `_be'd_` <int>, `_been_` <int>,## # `_before_` <int>, `_begin_` <int>, `_being` <int>, `_being_` <int>,## # `_believe_` <int>, `_blunder_` <int>, `_boiled_` <int>, `_bon_` <int>,## # `_both_` <int>, `_boulanger_` <int>, `_bride_` <int>, `_broke_` <int>,## # `_brother_` <int>, `_brotherly_` <int>, `_can_` <int>,## # `_cannot_` <int>, `_caro_` <int>, `_cause_` <int>, `_chaperon_` <int>,## # `_choice_` <int>, `_coming` <int>, `_coming_` <int>,## # `_compassion_` <int>, `_compliments_` <int>, `_con_` <int>,## # `_conditionally_` <int>, `_conduct_` <int>, `_corps_` <int>,## # `_could_` <int>, `_count_` <int>, `_court_` <int>,## # `_courtship_` <int>, `_daughters_` <int>, `_deepest_` <int>,## # `_delightful_` <int>, `_did_` <int>, `_dined_` <int>,## # `_dislike_` <int>, `_dissolved_` <int>, `_dixon_` <int>,## # `_dixons_` <int>, `_do_` <int>, `_does_` <int>, `_double_` <int>,## # `_doubts_` <int>, `_du_` <int>, `_each_` <int>, `_early_` <int>,## # `_eclaircissement_` <int>, `_edmund_` <int>, `_eighteen_` <int>,## # `_eldest_` <int>, `_elegant_` <int>, `_elton_` <int>,## # `_endeavour_` <int>, `_engaged_` <int>, `_engagement_` <int>,## # `_ensemble_` <int>, `_esprit_` <int>, `_etourderie_` <int>,## # `_evening_` <int>, `_every_` <int>, `_exigeant_` <int>,## # `_expression_` <int>, `_family_` <int>, `_father's_` <int>,## # `_feel_` <int>, `_feelings_` <int>, `_felt_` <int>, `_few_` <int>, ...

Estimate relationships

ggplot(short_freq, aes(x = chapter,y = family,fill = book)) +

geom_bar(stat = "identity") +geom_smooth(se=F)+guides(fill=F) +facet_wrap(~book, scales = "free_x") +theme_bw()

Estimate relationships

Emma Northanger Abbey Persuasion

Sense & Sensibility Pride & Prejudice Mansfield Park

0 20 40 10 15 20 25 30 0 5 10 15 20 25

0 10 20 30 40 50 20 40 60 0 10 20 30 40 50

0.0

2.5

5.0

7.5

10.0

0.0

2.5

5.0

7.5

10.0

chapter

mar

riage

Sentiment

Words have sentimental value. There are three ways you canoperationalize the sentimental value of a word.

I Positive or negativeI Numeric (-3 to 3)I Emotion (joy, fear, trust, disgust, etc)

Use the get_sentiments function to get the operationalize you want.

Sentimentget_sentiments("bing")

## # A tibble: 6,788 × 2## word sentiment## <chr> <chr>## 1 2-faced negative## 2 2-faces negative## 3 a+ positive## 4 abnormal negative## 5 abolish negative## 6 abominable negative## 7 abominably negative## 8 abominate negative## 9 abomination negative## 10 abort negative## # ... with 6,778 more rows

Sentimentget_sentiments("afinn")

## # A tibble: 2,476 × 2## word score## <chr> <int>## 1 abandon -2## 2 abandoned -2## 3 abandons -2## 4 abducted -2## 5 abduction -2## 6 abductions -2## 7 abhor -3## 8 abhorred -3## 9 abhorrent -3## 10 abhors -3## # ... with 2,466 more rows

Sentimentget_sentiments("nrc")

## # A tibble: 13,901 × 2## word sentiment## <chr> <chr>## 1 abacus trust## 2 abandon fear## 3 abandon negative## 4 abandon sadness## 5 abandoned anger## 6 abandoned fear## 7 abandoned negative## 8 abandoned sadness## 9 abandonment anger## 10 abandonment fear## # ... with 13,891 more rows

Attaching sentiments

To use these, you can join the sentiments data frame with yourlong-form word data frame.

long_austen <- long_austen %>%inner_join(get_sentiments("afinn"))

Use sentiments as a new variable

ggplot(long_austen, aes(x = chapter, y = score, color = book)) +geom_smooth(se = F) +guides(color = FALSE) +facet_wrap(~book, scales = "free_x") +theme_bw()

Use sentiments as a new variable

Emma Northanger Abbey Persuasion

Sense & Sensibility Pride & Prejudice Mansfield Park

0 20 40 0 10 20 30 0 5 10 15 20 25

0 10 20 30 40 50 0 20 40 60 0 10 20 30 40 50

0.00

0.25

0.50

0.75

0.00

0.25

0.50

0.75

chapter

scor

e

Topics

You can try to infer what topics are coming up in your text data.

I Does require you to make some guesses.I Probably more useful for describing data and synthesizing

comments and pilot data than for inferential stats.I Use short-form data set.

Topics

Latent Dirichlet allocation

books_lda <- LDA(short_austen, k = 6,control = list(seed = 1234))

TopicsI Extract from that the beta matrix.I In this, each word gets one row for each topic.I Beta is the probability of that term being generated from that

topic.

book_topics <- tidy(books_lda, matrix = "beta")head(book_topics)

## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 austen 1.185131e-04## 2 2 austen 5.053303e-320## 3 3 austen 3.720076e-44## 4 4 austen 1.742293e-95## 5 5 austen 3.114776e-05## 6 6 austen 3.114320e-37

Find the top terms in each topicThe best way to work with these data is to find the “top terms” ineach topic, to try and figure out what the topic might be.

top_terms <- book_topics %>%group_by(topic) %>%top_n(10, beta) %>%ungroup() %>%arrange(topic, -beta)

head(top_terms)

## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 emma 0.017916669## 2 1 miss 0.013341364## 3 1 harriet 0.009551112## 4 1 weston 0.008867266## 5 1 knightley 0.008115030## 6 1 elton 0.007271614

Plot top terms

ggplot(top_terms, aes(term, beta, fill=factor(topic))) +geom_bar(stat="identity", show.legend = F)+facet_wrap(~ topic, scales = "free") +coord_flip()

Plot top terms

4 5 6

1 2 3

0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.020

0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015

dashwood

edward

elinor

jennings

marianne

miss

mother

sister

time

willoughby

bertram

crawford

edmund

fanny

miss

norris

rushworth

sir

thomas

time

anne

captain

charles

elliot

lady

miss

sir

time

walter

wentworth

allen

brother

catherine

friend

isabella

miss

morland

thorpe

tilney

time

dear

elton

emma

harriet

jane

knightley

miss

time

weston

woodhouse

bennet

bingley

darcy

dear

elizabeth

jane

lady

miss

sister

time

beta

term

Use in Psychology Research

I Data was generously provided by Alexa Lord.I Study on self-affirmation - “the act of affirming an important,

typically non-threatened, aspect of the self”I Does self-affirmation reduce rejection sensitivity

Experimental manipulation

I Self-affirmation condition: Rank values or traits on importance.Write for five minutes about the trait or value you listed asmost important.

I Control condition: Think of a T.V. character and rank valuesor traits on importance to that character. Write for fiveminutes about why that T.V. character values the number onecharacter or trait.

Load in data

#typical datasaf <- read.csv("saf.csv")#text datadocs <- VCorpus(DirSource("tm data/Text Files"))docs.tidy <- tidy(docs)docs.tidy$ID <- gsub("\\.txt", "", docs.tidy$id)

Tokenize text

docs.tidy2 <- docs.tidy %>%unnest_tokens(output = word,

input = text,token = "words") %>%

anti_join(stop_words) %>%select(ID, word)

saf <- merge(saf, docs.tidy2, by = "ID")

Frequencies

Which words are used more frequently in the SAF condition than inthe control condition?

saf.freq <- saf %>%group_by(SAF) %>%count(word) %>%ungroup()

saf.freq <- saf %>%group_by(SAF) %>%summarize(n.words = n()) %>%ungroup() %>%inner_join(saf.freq) %>%mutate(n.adj = n/n.words)

Plot word frequencies0 1

0.000 0.005 0.010 0.015 0.020 0.00 0.01 0.02 0.03

family

feel

friendly

friends

life

loving

people

person

relationships

respect

character

family

friends

humor

knowledge

life

makes

people

respect

sense

n.adj

wor

d

SentimentWhich sentiments are found in each condition?

saf.freq %>%inner_join(get_sentiments("nrc")) %>%group_by(SAF, sentiment) %>%summarize(m.sent = mean(n.adj, na.rm=T))

## Joining, by = "word"

## Source: local data frame [20 x 3]## Groups: SAF [?]#### SAF sentiment m.sent## <int> <chr> <dbl>## 1 0 anger 0.0003602342## 2 0 anticipation 0.0006829609## 3 0 disgust 0.0004092731## 4 0 fear 0.0004422031## 5 0 joy 0.0007660650## 6 0 negative 0.0003877042## 7 0 positive 0.0007319573## 8 0 sadness 0.0004619886## 9 0 surprise 0.0004357961## 10 0 trust 0.0007097314## 11 1 anger 0.0004942628## 12 1 anticipation 0.0013592024## 13 1 disgust 0.0004737870## 14 1 fear 0.0007217309## 15 1 joy 0.0014757498## 16 1 negative 0.0004641179## 17 1 positive 0.0009689168## 18 1 sadness 0.0005495285## 19 1 surprise 0.0006640335## 20 1 trust 0.0011505550

Plot sentiments

anger

anticipation

disgust

fear

joy

negative

positive

sadness

surprise

trust

0.0000 0.0005 0.0010 0.0015

m.sent

sent

imen

t

Topics

What are people talking about?

short_saf <- saf.freq %>%filter(!grepl("[0-9]", word)) %>%select(SAF, word, n) %>%cast_dtm(document = SAF, term = word, value = n)

saf_lda <- LDA(short_saf, k = 2,control = list(seed = 1234))

Extract topics

saf_topics <- tidy(saf_lda, matrix = "beta")head(saf_topics)

## # A tibble: 6 × 3## topic term beta## <int> <chr> <dbl>## 1 1 abilities 4.014336e-04## 2 2 abilities 7.568858e-04## 3 1 ability 2.452691e-03## 4 2 ability 1.609864e-03## 5 1 abuse 2.138209e-04## 6 2 abuse 7.667266e-05

Plot topics1 2

0.000 0.005 0.010 0.015 0.020 0.00 0.01 0.02 0.03

family

feel

friends

life

makes

people

person

relationships

respect

sense

care

family

friendly

friends

humor

knowledge

life

loving

people

relationships

beta

term

Text Mining in R - WordPress.com · 3/24/2017 · What is Text Mining? ... mother catherine...

Documents

Transcript of Text Mining in R - WordPress.com · 3/24/2017 · What is Text Mining? ... mother catherine...