Text Mining

15
Text Mining with R and the tm package

description

Text Mining. with R and the tm package. Agenda. Motivation Preliminaries Operations Demo Thoughts System prerequisites Resources References. Motivation. Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints) - PowerPoint PPT Presentation

Transcript of Text Mining

Page 1: Text Mining

Text Mining

with R and the tm package

Page 2: Text Mining

Agenda

MotivationPreliminariesOperationsDemoThoughtsSystem prerequisitesResourcesReferences

Page 3: Text Mining

Motivation

Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints)

Research categories: Machine learning Data mining Sentiment analysis ...

Page 4: Text Mining

Preliminaries

Some terminology Document Corpus Term document matrix Dissimilarity matrix

We will see some of these in the demo

Page 5: Text Mining

Typical TM operations

ImportPreprocessing

Stop words White space Punctuation (to) Lower case Numeric removal ... Other “mappings”

Page 6: Text Mining

Typical TM Operations (cont’d)

Metadata management per document per corpus

Term document matrix preparationDistance/nearness calculationsPlotting...

Page 7: Text Mining

DEMO

Page 8: Text Mining
Page 9: Text Mining
Page 10: Text Mining
Page 11: Text Mining

Thoughts

Package documentationOverlap/misalignment with other

packagesIntegration with “big data” facilities

Page 12: Text Mining

System Prerequisites

Suggested Weka (for lazy classifiers) GraphViz (for plot()) Snowball (for stemDocument()) Seriation (for dissplot())

Optional Antiword (to read Word documents) pdftotext (to read PDF documents)

Page 13: Text Mining

Resources

Antiword http://www.winfield.demon.nl/

pdftotext poppler.freedesktop.org

Rgraphviz http://www.bioconductor.org/packages/release/bio

c/html/Rgraphviz.html Seriation

http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=seriation:dissplot

Weka http://sourceforge.net/projects/weka/

Page 14: Text Mining

References

Ingo Feinerer (2012). tm: Text Mining Package. R package version 0.5-7.1.

Jeff Gentry, Li Long, Robert Gentleman, Seth, Florian Hahne, Deepayan Sarkar and Kasper Hansen (). Rgraphviz: Provides plotting capabilities for R graph objects. R package version 1.32.0.

Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008.

Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.

Page 15: Text Mining

Contact Information

Kent ManleyGMU STAT 763, Spring 2012