Text Mining
description
Transcript of Text Mining
Text Mining
with R and the tm package
Agenda
MotivationPreliminariesOperationsDemoThoughtsSystem prerequisitesResourcesReferences
Motivation
Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints)
Research categories: Machine learning Data mining Sentiment analysis ...
Preliminaries
Some terminology Document Corpus Term document matrix Dissimilarity matrix
We will see some of these in the demo
Typical TM operations
ImportPreprocessing
Stop words White space Punctuation (to) Lower case Numeric removal ... Other “mappings”
Typical TM Operations (cont’d)
Metadata management per document per corpus
Term document matrix preparationDistance/nearness calculationsPlotting...
DEMO
Thoughts
Package documentationOverlap/misalignment with other
packagesIntegration with “big data” facilities
System Prerequisites
Suggested Weka (for lazy classifiers) GraphViz (for plot()) Snowball (for stemDocument()) Seriation (for dissplot())
Optional Antiword (to read Word documents) pdftotext (to read PDF documents)
Resources
Antiword http://www.winfield.demon.nl/
pdftotext poppler.freedesktop.org
Rgraphviz http://www.bioconductor.org/packages/release/bio
c/html/Rgraphviz.html Seriation
http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=seriation:dissplot
Weka http://sourceforge.net/projects/weka/
References
Ingo Feinerer (2012). tm: Text Mining Package. R package version 0.5-7.1.
Jeff Gentry, Li Long, Robert Gentleman, Seth, Florian Hahne, Deepayan Sarkar and Kasper Hansen (). Rgraphviz: Provides plotting capabilities for R graph objects. R package version 1.32.0.
Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008.
Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.
Contact Information
Kent ManleyGMU STAT 763, Spring 2012