High-Level Text Analysis and Techniques
description
Transcript of High-Level Text Analysis and Techniques
![Page 1: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/1.jpg)
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUESAngela ZossData Visualization Coordinator226 Perkins [email protected]
Duke University Libraries, Digital ScholarshipText > Data, October 25
![Page 2: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/2.jpg)
DOCUMENTS AS CONTEXT
![Page 3: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/3.jpg)
ANGELA AS CONTEXTBut first,
![Page 4: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/4.jpg)
How I learned to love the document.B.A. courses: Linguistics, Communication
M.S. courses: Communication, Human-Computer Interaction
Employment: arXiv.org Administrator
Ph.D. courses: •Bibliometrics/Scientometrics•Computer Mediated Discourse Analysis•Latent Structure Analysis•Natural Language Processing
![Page 5: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/5.jpg)
DOCUMENTS AS CONTEXTNow,
![Page 6: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/6.jpg)
Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)
![Page 7: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/7.jpg)
Using documents to learn about language (or other social phenomena)
Analyzing documents as records/proxies of language, social structures, events, etc.
Linguistic studies: morphology, word counts, syntax, etc. …
over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches)
Underwood, T. (2012). Where to start with text mining.
![Page 8: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/8.jpg)
Using documents to learn about language
Historical culturomics of pronoun frequencies
![Page 9: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/9.jpg)
Using documents to learn about language
Universal properties of mythological networks
![Page 10: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/10.jpg)
Using language to learn about documents
Analyzing documents as artifacts themselves, with their own properties and dynamics
Literary, documentary studies:Structural/rhetorical/stylistic analysisDocument categorization, classificationDetecting clusters of document features (topic modeling)
Underwood, T. (2012). Where to start with text mining.
![Page 11: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/11.jpg)
Using language to learn about documents
Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
![Page 12: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/12.jpg)
Using language to learn about documentsUsing Word Clouds for Topic Modeling Results
![Page 13: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/13.jpg)
What are documents?
For this discussion, digital versions of works of spoken or written language
Examples: books, articles, transcripts, emails,
tweets…
![Page 14: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/14.jpg)
Documents as context
Documents have:• form(at)• style• provenance• entities• intentions
![Page 15: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/15.jpg)
STUDIES OF DOCUMENTS
![Page 16: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/16.jpg)
Why study documents?
• Describe a corpus• Compare/organize documents• Locate relevant information/filter out
irrelevant information
![Page 17: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/17.jpg)
Describing a corpus
• Finding regularities/differences across groups of documents
• Developing theories of structure, style, etc. that can then be tested or applied
• May be manual (content analysis) or computer-assisted (statistical)
![Page 18: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/18.jpg)
Example: Storylines
http://xkcd.com/657/
![Page 19: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/19.jpg)
Differences of format, genre, participants…
• Articles may have sections, but these will vary by discipline and type of article
• Books may be fiction or non-fiction (or both)
• Transcripts may refer to multiple speakers, non-text content
• …ad infinitum
![Page 20: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/20.jpg)
Example: Literature Fingerprinting
Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004
![Page 21: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/21.jpg)
Organizing documents
Detect similarity between documents and a known category (or simply among themselves)
Supports browsing, sentiment analysis, authorship detection
![Page 22: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/22.jpg)
Example: Bohemian Bookshelf
Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.
![Page 23: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/23.jpg)
Similarity based on…
• common document attributesauthorship, genre
• common language patternstopics, phrases
• common entity referencescharacters, citations
![Page 24: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/24.jpg)
Example: Quantitative Formalism
Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
![Page 25: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/25.jpg)
Example: Clinton’s DNC Speech
http://b.globe.com/TogUqq
![Page 26: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/26.jpg)
Example: View DHQ
http://digitalliterature.net/viewDHQ/vis3.html
![Page 27: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/27.jpg)
Classification
• assigning an object to a single class• often supervised, using an existing
classification scheme and a tagged corpus
![Page 28: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/28.jpg)
Example: Relative signatures
Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).
![Page 29: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/29.jpg)
Categorization
• assigning documents to one or more categories
• suggestive of unsupervised clustering techniques
• design choices made to fit particular tasks or goals
![Page 30: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/30.jpg)
Example: UCSD Map of Science
Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science . PLoS ONE, 7(7), e39464.
![Page 31: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/31.jpg)
Example: NIH Map Viewer
https://app.nihmaps.org/nih/browser/
![Page 32: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/32.jpg)
Reference systems, infrastructureWhat do we gain by adding structure?
What do we lose?
![Page 33: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/33.jpg)
SUMMARIZING DOCUMENTS
![Page 34: High-Level Text Analysis and Techniques](https://reader035.fdocuments.us/reader035/viewer/2022081514/56813a07550346895da1d310/html5/thumbnails/34.jpg)
Text is only one component of a document.
Research questions often push us to be creative with how we operationalize constructs.
The richness of language and documents is best preserved by using multiple, complementary approaches.