Revolution Confidential
Introduc tion to R for Data Mining
2012 S pring Webinar S eries
J os eph B . R ic kert, R evolution A nalytic sJ une 5, 2012
1
Revolution ConfidentialG oals for Today’s Webinar
2
R is a serious platform for data mining
Seriously, it is not difficult to
learn enough R to do some serious data
mining
To convince you that:
Revolution R Enterprise is
is the platform for serious
data mining
Revolution ConfidentialData Mining
3
Applications
Credit Scoring
Fraud Detection
Ad Optimization
Targeted Marketing
Gene Detection
Recommendation systems
Social Networks
Actions
Acquire Data
Prepare
Classify
Predict
Visualize
Optimize
Interpret
Algorithms
CART
Random Forests
SVM
KMeans
Hierarchical clustering
Ensemble Techniques
Revolution Confidential
R ec ent K DD Nuggets P oll s ugges ts s o are a lot of other s erious data miners
4
What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters]
Software % users in 2012 % users in 2011
R (245) 30.7% 23.3%
Excel (238) 29.8% 21.8%
Rapid-I RapidMiner (213) 26.7% 27.7%
KNIME (174) 21.8% 12.1%
Weka / Pentaho (118) 14.8% 11.8%
StatSoft Statistica (112) 14.0% 8.5%
SAS (101) 12.7% 13.6%
Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011
MATLAB (80) 10.0% 7.2%
IBM SPSS Statistics (62) 7.8% 7.2%
IBM SPSS Modeler (54) 6.8% 8.3%
SAS Enterprise Miner (46) 5.8% 7.1%
Revolution Confidential
WHAT DOE S IT ME A N TO L E A R N R ?
Learning R
5
Revolution ConfidentialWhat does it mean to learn F renc h?
6
To read a Menu
To get around Paris on the Metro
To carry on a conversation
Revolution ConfidentialL earning R
7
Levels of R Skill
R developer
R contributor
R programmer
R user
R aware
Hours of use
10 10,000
The Malcolm Gladwell “Outlier” Scale
Use a GUI
Use R Functions
Write functions
Write an R package
Write production level code
Revolution Confidential
T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING
Productive from the Get go!
8
Revolution ConfidentialR is s et up to compute functions on data
9
lm <- function(x,y){. . . }
lm.modellm.model$assignlm.model$coefficientslm.model$df.residuallm.model$effectslm.model$fitted.values
.
.
.
Revolution ConfidentialA little knowledge goes a long way in R R’s functional design facilitates
performing small tasks For the most part, the output of a
function depends only on the values of its arguments
calling a function multiple times with the same values of its arguments will produce the same result each time
Minimal side effects means it is much easier to understand and predict the behavior of a program
10
The trick is knowing which functions to call
Revolution ConfidentialB as ic Mac hine L earning F unc tions
11
Function Library DescriptionCluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and regression trees
ksvm kernlab Support Vector MachineEnsemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and regression
Revolution ConfidentialNoteworthy Data Mining P ac kages
12
Package Commentrattle A very intuitive GUI for data mining that
produces useful R codecaret Well organized and remarkably complete
collection of functions to facilitate model building for regression and classification problems
Revolution Confidential
T IME TO R UN S OME C ODEDoing a lot with a little R
13
Revolution ConfidentialS c ripts to run
14
Script Some key Functions0 Setup Load libraries1 Explore weather data Read.csv, plot2 Run clustering algorithms kmeans, hclust3 Basic decision tree rpart4 Boosted Tree ada5 Random Forest randomForest6 Support Vector Machine randomForest, varImpPlot7 Big Data Mortgage Default
modelrxLogit, rxKmeans
Revolution ConfidentialB ig Data and R
There are some challenges: All of your data and model code must fit into
memory Big data sets as well as big models (lots of
variables) can run out of memory Parallel computation might be necessary for
models to run in a reasonable time
15
Revolution ConfidentialR evoS caleR in R evolution R E nterpris e
Can help in a number of ways: Manipulate large data sets, and perhaps
aggregating data so that it will fit in memory For example, boiling down time-stamped data
like a web log to form a time series that will fit in memory
Run RevoScaleR Functions directly on big data sets Run R functions in parallel
16
Revolution Confidential
Top R evoS caleR F unctions for Data Miningparallel external memory algorithms
17
Task RevoScaleR functionData processing rxDataStepDescriptive Statistics rxSumaryTables and cubes rxCube, rxCrosstabsCorrelations / covariance rxCovCor, rxCor, rxCov,
rxSSCPLinear Models rxLinModLogistic regressions rxLogitGeneralized linear models rxGlmK means clustering rxKmeansPredictions (scoring) rxPredict
Revolution Confidential
WHE R E TO G O F R OM HE R E ?More than code, R is a community
18
Revolution ConfidentialF inding your way around the R world
Machine Learning Data Mining Visualization Finding Packages
Task Views crantastic.org
Blogs Revolutions R-Bloggers Quick-R
Getting Help StackOverflow @RLangTip Inside-R www.rseek.org
Finding R People User Groups worldwide #rstats
19
Word Cloud for @inside_R
Revolution ConfidentialL ook at s ome more s ophis ticated examples
Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance):
A Direct Marketing In-Flight Forecasting System Jeffrey Breen:
Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring
Market Sentiment and Events (Using R)
20
Revolution ConfidentialR evolution A nalytic s Training
21
http://www.revolutionanalytics.com/products/training/
Revolution ConfidentialR eferenc es
22
Revolution Confidential
Revolution Confidential
23