Post on 11-Feb-2017
HANDS ON: TEXT MINING WITH R
Jahnab Kumar Deka
Introduction• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms: • Tokenization • Tagging (Noun/Verb/…)• Chunking(Noun Phase)• Stemming(-ing/-s/-ed)
Important packages in R• library(tm) # Framework for text mining.• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of transcripts.
• library(qdapDictionaries)• library(dplyr) # Data preparation and pipes %>%.• library(RColorBrewer) # Generate palette of colours for plots.
• library(ggplot2) # Plot word frequencies.• library(scales) # Include commas in numbers.• library(Rgraphviz) # Correlation plots.
Corpus• Collection of text
• Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record.
• Any file format can be converted to text file for corpusEg:• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
Corpus• Consider folder corpus/txt
• List some of file names
Loading Corpus• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF()
• In case of Word Documents• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-
r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
Exploration of Corpus• inspect()
• Preparing the corpus• Transformation type
• tm map() is used to apply one of this transformation• Other transformations can be implemented using R functions and wrapped
within content_transformer()
Transformation Example• replace “/”, “@” and “\\|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
Contd...• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
Contd...• Stemming
• Creating a Document Term Matrix A matrix with documents as the rows
terms as the columnscount of the frequency of words as the cells of the matrix.
• Term frequency
Contd...• Frequency order of item
• ord <- order(freq)• Least Frequent item
• freq[head(ord)]• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV• dtm <- DocumentTermMatrix(docs)• m <- as.matrix(dtm)• write.csv(m, file="dtm.csv")
Contd...• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word• // two words always appear together => correlation would be 1.0
Correlation
• 50 of the more frequent words• With minimum correlation of 0.5• Word occurrences 100
• By default • 20 random terms • With minimum correlation of 0.7
Plotting word frequencies• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)• wf <- data.frame(word=names(freq), freq=freq)• //words that occurs at least 500 times in the corpus
Word cloud
Size of Word & Frequency • For word limitation
• wordcloud(names(freq), freq, max.words=100)• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Quantitative Analysis of Text (qdap)• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
Contd...• Word Length Counts
** vertical line = Mean length of words
Letter and Position Heatmap