Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package...
Transcript of Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package...
![Page 1: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/1.jpg)
Analysing texts with R (and writing a package to do so)
Adam Obeng
![Page 2: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/2.jpg)
About me: Adam ObengComputational Social Scientist (i.e. Data Scientist, Research Scientist, etc.)
ABD PhD in Sociology at Columbia
Jared taught me R
adamobeng.com
![Page 3: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/3.jpg)
About me: Adam ObengComputational Social Scientist (i.e. Data Scientist, Research Scientist, etc.)
ABD PhD in Sociology at Columbia
Jared taught me R
adamobeng.com
Lucasarts
![Page 4: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/4.jpg)
quanteda and readtextKenneth Benoit [aut, cre], Paul Nulty [aut], Kohei Watanabe [ctb], Benjamin Lauderdale [ctb], Adam Obeng [ctb], Pablo Barberá [ctb], Will Lowe [ctb]
![Page 5: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/5.jpg)
Quantitative Text Analysis
![Page 6: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/6.jpg)
Quantitative Text AnalysisText as data:● Linguistics● Computer science● Social sciences -> QTA
Roberts, Carl W. "A conceptual framework for quantitative text analysis." Quality and Quantity 34.3 (2000): 259-274.
![Page 7: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/7.jpg)
QTA assumptions● Texts reflect characteristics● Texts represented by features● Analysis estimates characteristics
![Page 8: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/8.jpg)
QTA: Documents -> Document-Feature Matrix -> Analysis
Ken Benoit, The Quantitative Analysis of Textual Data (NYU Fall 2014)
![Page 9: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/9.jpg)
Outline● Loading texts (descriptive stats)● Extracting features● Analysis: supervised scaling
+ Digressions about the process of writing an R package
![Page 10: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/10.jpg)
QTA Step 1: Loading textsDemo
![Page 11: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/11.jpg)
Digression #1: how do we make it simple?
● v1.0 API changes to meet ROpenSci guidelines○ namespace collisions
● Introducing readtext
![Page 12: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/12.jpg)
Digression #1: readtext
readtext( file, ignoreMissingFiles = FALSE, textfield = NULL, docvarsfrom = c("metadata", "filenames"), dvsep = "_", docvarnames = NULL, encoding = NULL, ...)
![Page 13: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/13.jpg)
Digression #1: readtext● plaintext● delimited text● doc● docx● pdf● JSON, line-delimited JSON, Twitter
API output● XML● HTML● zip, .tar, and .gz archives● remote files● glob paths
any (possible) combination of those
“any” encoding
> readtext('path/to/whatever')
just works™
![Page 14: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/14.jpg)
Digression #1: listMatchingFilesFrom a pseudo-URI, return all matching files
Given that:
- A URI can resolve to zero or more files (e.g. '/path/to/*.csv', ‘https://example.org/texts.zip’)
- Globbing is platform-dependent (e.g. '/path/to/\*.tsv' escaping)- Recursion
![Page 15: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/15.jpg)
Digression #1 sub-digression #1
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. — jzw
![Page 16: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/16.jpg)
Digression #1 sub-digression #1
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. — jzw
![Page 17: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/17.jpg)
● If it’s a remote file, download it● If it’s an archive, extract it, glob the contents● If it’s a directory, glob the contents
-> Call listMatchingFiles() on the result
Termination condition: was it a glob last time? (a glob cannot resolve to a glob)
https://github.com/kbenoit/readtext/blob/98dbccc9a3ac07f387ef94bcfecab0eb5282dc5b/R/utils.R#L87-L222
Digression #1: listMatchingFiles
![Page 18: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/18.jpg)
QTA Step 2: Extracting featurestext -> dfm
● Feature creation (NLP)○ tokenizing○ removing stopwords○ stemming○ skip-ngrams○ dictionaries
● Feature selection○ Document frequency○ Term frequency○ Purposive selection○ Deliberate disregard
![Page 19: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/19.jpg)
Demo: extracting features
![Page 20: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/20.jpg)
Goal: differentiate document characteristics
e.g. where do they (or their authors) fall on the political spectrum
QTA Step 3: AnalysisSupervised scaling
![Page 21: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/21.jpg)
Like ML classification, but continuous outcome:
● Get training (reference) texts● Generate word scores in training texts● Score test (virgin) texts● Evaluate performance
Wordscores
Laver, Michael, Kenneth Benoit, and John Garry. "Extracting policy positions from political texts using words as data." American Political Science Review 97.02 (2003): 311-331.
QTA Step 3: AnalysisSupervised scaling
![Page 22: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/22.jpg)
QTA Step 3: AnalysisSupervised scaling demo
![Page 23: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/23.jpg)
Digression #2: Testing
“Do you want your results to be correct or plausible?” — Greg Wilson
True for ML and for code
![Page 24: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/24.jpg)
Digression #2: Testing
● Use CI as source of truth, not local tests (even with --as--cran)○ (Still might not match CRAN)
● Enforce test coverage● Test coverage is per-line
https://travis-ci.org/kbenoit/readtexthttps://travis-ci.org/kbenoit/quantedahttps://codecov.io/gh/kbenoit/readtexthttps://codecov.io/gh/kbenoit/quanteda
![Page 25: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/25.jpg)
Digression #2: TestingWe discovered a lot of our own bugs
![Page 26: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/26.jpg)
base::tempfile(): (usually) different filenames within the same session
base::tempdir(): always the same directory name within the same session
readtext::mktemp() behaves like GNU coreutils mktemp
Digression #2: TestingSometimes it’s R’s fault
![Page 27: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/27.jpg)
*crickets*
If you know what’s going on: http://r.789695.n4.nabble.com/readlines-truncates-text-file-with-Codepage-437-encoding-td4721527.html
Digression #2: TestingSometimes it’s R’s fault
![Page 28: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/28.jpg)
Digression #2 sub-digression #1:how to win at GitHub
![Page 29: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/29.jpg)
Digression #2 sub-digression #1:how to win at GitHub
![Page 30: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/30.jpg)
Slides and code: adamobeng.com
References:
● Ken Benoit, The Quantitative Analysis of Textual Data (NYU Fall 2014)● — , Quantitative Text Analysis (TCD)
Thanks!
![Page 31: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/31.jpg)
HERE BE DRAGONS(Additional slides)
![Page 32: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/32.jpg)
QTA Step 3: AnalysisUnsupervised scaling
Problems with Wordscores:
1. “the positions themselves are abstract concepts that cannot be observed directly”
2. the set of words may change over time
Wordfish
Slapin, Jonathan B., and Sven‐Oliver Proksch. "A scaling model for estimating time‐series party positions from texts." American Journal of Political Science 52.3 (2008): 705-722.
![Page 33: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/33.jpg)
Naive Bayes with Poisson distributional assumption
QTA Step 3: AnalysisUnsupervised scaling: Wordfish
![Page 34: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/34.jpg)
QTA Step 3: AnalysisUnsupervised scaling demo
![Page 35: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/35.jpg)
Digression #1: non-breaking spaces
![Page 36: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/36.jpg)
Digression #1: non-breaking spaces
![Page 37: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/37.jpg)
Digression #1: non-breaking spaces ⌥ Opt+3 -> #
⌥ Opt+Space -> \xa0
Solution: pre-commit hook
![Page 38: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/38.jpg)
Back to the demo: loading text and descriptive stats
![Page 39: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/39.jpg)
Digression #4: Git is a literal genie
![Page 40: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/40.jpg)
Digression #4: Git is extremely elegant
Git for Computer Scientists
But the porcelain is equally difficult to use
![Page 41: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/41.jpg)
Digression #4: Git needs additional constraintsDon’t allow commits to master:
git-flow?
![Page 42: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/42.jpg)
DocumentsUsually texts, but also paragraphs, etc.
![Page 43: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/43.jpg)
Features- words- n-grams- skip-grams- dictionaries- phrases- manual coding- etc.
![Page 44: Analysing texts with R - Adam Obeng · 12/6/2016 · Analysing texts with R (and writing a package to do so) Adam Obeng. About me: Adam Obeng ... how to win at GitHub. Digression](https://reader035.fdocuments.us/reader035/viewer/2022062605/5fdacf2873957b78cf6b9e1f/html5/thumbnails/44.jpg)
Analysis● Descriptive stats● Supervised scaling and classification● Unsupervised scaling● Clustering and topic models