hands on: Text Mining With R

HANDS ON: TEXT MINING WITH R

Jahnab Kumar Deka

Introduction• To learn from collections of text documents like books,

newspapers, emails, etc.

Important Terms: • Tokenization • Tagging (Noun/Verb/…)• Chunking(Noun Phase)• Stemming(-ing/-s/-ed)

Important packages in R• library(tm) # Framework for text mining.• library(SnowballC) # Provides wordStem() for stemming.

• library(qdap) # Quantitative discourse analysis of transcripts.

• library(qdapDictionaries)• library(dplyr) # Data preparation and pipes %>%.• library(RColorBrewer) # Generate palette of colours for plots.

• library(ggplot2) # Plot word frequencies.• library(scales) # Include commas in numbers.• library(Rgraphviz) # Correlation plots.

Corpus• Collection of text

• Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record.

• Any file format can be converted to text file for corpusEg:• PDF to Text File

• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")• Word Document to Text File

• system("for f in *.doc; do antiword $f; done")

Corpus• Consider folder corpus/txt

• List some of file names

Loading Corpus• Loading Corpus

** Using DirSource() the source object is passed on to Corpus() which loads the documents.• In case of PDF Documents

• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF()

• In case of Word Documents• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-

r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included

Exploration of Corpus• inspect()

• Preparing the corpus• Transformation type

• tm map() is used to apply one of this transformation• Other transformations can be implemented using R functions and wrapped

within content_transformer()

Transformation Example• replace “/”, “@” and “\\|” with a space

• Alternate method

• Conversion to toLower Case

• Remove Numbers

• Remove Punctuation

Contd...• Remove English Stop Words

• Remove Own Stop Words

• Strip Whitespace

• Specific Transformations

Contd...• Stemming

• Creating a Document Term Matrix A matrix with documents as the rows

terms as the columnscount of the frequency of words as the cells of the matrix.

• Term frequency

Contd...• Frequency order of item

• ord <- order(freq)• Least Frequent item

• freq[head(ord)]• Most frequent item

• freq[tail(ord)]

• Document Term matrix to CSV• dtm <- DocumentTermMatrix(docs)• m <- as.matrix(dtm)• write.csv(m, file="dtm.csv")

Contd...• Removing Sparse Terms

• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor• the resulting matrix contains only terms with a sparse factor of less than sparse.

• Frequent items and association

** lowfreq = terms that occur at least 1000 times

• Association with word with correlation limit

• // association of “data” with other word• // two words always appear together => correlation would be 1.0

Correlation

• 50 of the more frequent words• With minimum correlation of 0.5• Word occurrences 100

• By default • 20 random terms • With minimum correlation of 0.7

Plotting word frequencies• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)• wf <- data.frame(word=names(freq), freq=freq)• //words that occurs at least 500 times in the corpus

Word cloud

Size of Word & Frequency • For word limitation

• wordcloud(names(freq), freq, max.words=100)• For term frequency limitation

• wordcloud(names(freq), freq, min.freq=100)• Adding Color

• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

Quantitative Analysis of Text (qdap)• Extracting the column names (the terms) and retain those shorter

than 20 characters

• To generate frequencies and percentage

Contd...• Word Length Counts

** vertical line = Mean length of words

Letter and Position Heatmap

hands on: Text Mining With R

Data & Analytics

Transcript of hands on: Text Mining With R

Information Retrieval & Text Mining - Intranet DEIBhome.deib.polimi.it/.../DMTM/DMTM1112_TextMining.pdf · 2012-06-13 · Information Retrieval & Text Mining Data Mining and Text

Text Mining

Text Mining: Natural Language techniques and Text Mining applications · · 2017-08-27Text Mining: Natural Language techniques and Text Mining applications M. Rajman, R. Besan

CSE 634 – Data Mining: Text Mining

Text Mining Webinar - KNIME€¦ · Text Mining Webinar The Textprocessing Extension Rosaria Silipo and Kilian Thiel. KNIME Text Mining Webinar 2 Agenda ... Text Mining Workflow Create

Text Mining & Tools - Graz University of Technologykti.tugraz.at/staff/rkern/courses/kddm2/text-mining-and-tools.pdf · Text Mining & Tools Knowledge Discovery and Data Mining 2 (VU)

Web Mining & Text Mining

Text Mining with Oracle - Text Mining Summit

Introduction to Text Mining - EDBT 2006 · Text Mining Text Mining (Def. Wikipedia) Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in

Introduction to Text Mining - uni-paderborn.de...Introduction to Text Mining Part VII: Text Mining using Clustering Henning Wachsmuth Text Mining VII Text Mining using Clustering ©Wachsmuth

Text mining .

Chapter 5: Text and Web Mining. Learning Objectives Describe text mining and understand the need for text mining Differentiate between text mining, Web.

Text mining & Web mining

Text mining and data mining

Hands-on Data Science with R Text Mining

A Brief Survey of Text Mining · Text Mining = Text Data Mining. Text mining can be also deﬁned — similar to data mining — as the application of algorithms and methods from

Text Mining 4/5: Text Classification

Text Mining for Clementine Improve Insights with Text Mining

Introduction to Text Mining and SAS Text Minersupport.sas.com/publishing/pubcat/chaps/59410.pdf · Introduction to Text Mining and SAS Text Miner Tips for Text Mining 3 The Text Mining

Introduction to Text Mining - uni-paderborn.de · Introduction to Text Mining Part VII: Text Mining using Similarities and Clustering Henning Wachsmuth Text Mining VII Text Mining