Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...
Transcript of Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...
A Data Mining Workshop
Excavating Knowledge from DataIntroducing Data Mining using R
Data ScientistAustralian Taxation Office
Adjunct Professor, Australian National UniversityAdjunct Professor, University of Canberra
Fellow, Institute of Analytics Professionals of Australia
http://datamining.togaware.com
Visit: http://onepager.togaware.com for Workshop Noteshttp: // togaware. com Copyright© 2014, [email protected] 1/174
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 2/174
R: A Language for Data Mining
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 3/174
R: A Language for Data Mining Installing R
Installing R and Rattle
First task is to install R
As free/libre open source software (FLOSS or FOSS), R andRattle are available to all, with no limitations on our freedom touse and share the software, except to share and share alike.
Visit CRAN at http://cran.rstudio.com
Visit Rattle at http://rattle.togaware.com
Linux: Install packages (Ubuntu is recommended)$ wajig install r-recommended r-cran-rattle
Windows: Download and install from CRAN
MacOSX: Download and install from CRAN
http: // togaware. com Copyright© 2014, [email protected] 4/174
R: A Language for Data Mining Data Science with R?
Why do Data Science with R?
Most widely used Data Mining and Machine Learning Package
Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research communityGet our algorithms out there and being used—impact!!!
http: // togaware. com Copyright© 2014, [email protected] 5/174
R: A Language for Data Mining Data Science with R?
Why do Data Science with R?
Most widely used Data Mining and Machine Learning Package
Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research communityGet our algorithms out there and being used—impact!!!
http: // togaware. com Copyright© 2014, [email protected] 5/174
R: A Language for Data Mining Data Science with R?
Why do Data Science with R?
Most widely used Data Mining and Machine Learning Package
Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research communityGet our algorithms out there and being used—impact!!!
http: // togaware. com Copyright© 2014, [email protected] 5/174
R: A Language for Data Mining Data Science with R?
Why do Data Science with R?
Most widely used Data Mining and Machine Learning Package
Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research communityGet our algorithms out there and being used—impact!!!
http: // togaware. com Copyright© 2014, [email protected] 5/174
R: A Language for Data Mining Popularity of R?
How Popular is R? Discussion List Traffic
Monthly email traffic on software’s main discussion list.
Source: http://r4stats.com/articles/popularity/http: // togaware. com Copyright© 2014, [email protected] 6/174
R: A Language for Data Mining Popularity of R?
How Popular is R? Discussion Topics
Number of discussions on popular QandA forums 2013.
Source: http://r4stats.com/articles/popularity/
http: // togaware. com Copyright© 2014, [email protected] 7/174
R: A Language for Data Mining Popularity of R?
How Popular is R? R versus SAS
Number of R/SAS related posts to Stack Overflow by week.
Source: http://r4stats.com/articles/popularity/
http: // togaware. com Copyright© 2014, [email protected] 8/174
R: A Language for Data Mining Popularity of R?
How Popular is R? Professional Forums
Registered for the main discussion group for each software.
Source: http://r4stats.com/articles/popularity/
http: // togaware. com Copyright© 2014, [email protected] 9/174
R: A Language for Data Mining Popularity of R?
How Popular is R? Used in Analytics
Competitions
Software used in data analysis competitions in 2011.
Source: http://r4stats.com/articles/popularity/
http: // togaware. com Copyright© 2014, [email protected] 10/174
R: A Language for Data Mining Popularity of R?
How Popular is R? User Survey
Rexer Analytics Survey 2010 results for data mining/analytic tools.
Source: http://r4stats.com/articles/popularity/
http: // togaware. com Copyright© 2014, [email protected] 11/174
R: A Language for Data Mining Popularity of R?
What is R?
R — The Video
A 90 Second Promo from Revolution Analytics
http://www.revolutionanalytics.com/what-is-open-source-r/
http: // togaware. com Copyright© 2014, [email protected] 12/174
Data Mining, Rattle, and R
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 13/174
Data Mining, Rattle, and R Definition
Data Mining
A data driven analysis to uncover otherwise unknown but usefulpatterns in large datasets, to discover new knowledge and to developpredictive models, turning data and information into knowledge and(one day perhaps) wisdom, in a timely manner.
http: // togaware. com Copyright© 2014, [email protected] 14/174
Data Mining, Rattle, and R Definition
Data Mining
Application of
Machine LearningStatisticsSoftware Engineering and Programming with DataEffective Communications and Intuition
. . . to Datasets that vary byVolume, Velocity, Variety, Value, Veracity
. . . to discover new knowledge
. . . to improve business outcomes
. . . to deliver better tailored services
http: // togaware. com Copyright© 2014, [email protected] 15/174
Data Mining, Rattle, and R Definition
Data Mining in Research
Health ResearchAdverse reactions using linked Pharmaceutical, GeneralPractitioner, Hospital, Pathology datasets.
AstronomyMicrolensing events in the Large Magellanic Cloud of severalmillion observed stars (out of 10 billion).
PsychologyInvestigation of age-of-onset for Alzheimer’s disease from 75variables for 800 people.
Social SciencesSurvey evaluation. Social network analysis - identifying keyinfluencers.
http: // togaware. com Copyright© 2014, [email protected] 16/174
Data Mining, Rattle, and R Definition
Data Mining in Government
Australian Taxation OfficeLodgment ($110M)Tax Havens ($150M)Tax Fraud ($250M)
Immigration and Border ControlCheck passengers before boarding
Health and Human ServicesDoctor shoppersOver servicing
http: // togaware. com Copyright© 2014, [email protected] 17/174
Data Mining, Rattle, and R Definition
The Business of Data Mining
SAS has annual revenues of $3B (2013)
IBM bought SPSS for $1.2B (2009)
Analytics is >$100B business and >$320B by 2020
Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . .
Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . .
http: // togaware. com Copyright© 2014, [email protected] 18/174
Data Mining, Rattle, and R Algorithms
Basic Tools: Data Mining Algorithms
Cluster Analysis (kmeans, wskm)
Association Analysis (arules)
Linear Discriminant Analysis (lda)
Logistic Regression (glm)
Decision Trees (rpart, wsrpart)
Random Forests (randomForest, wsrf)
Boosted Stumps (ada)
Neural Networks (nnet)
Support Vector Machines (kernlab)
. . .
That’s a lot of tools to learn in R!Many with different interfaces and options.
http: // togaware. com Copyright© 2014, [email protected] 19/174
Data Mining, Rattle, and R A GUI for Data Mining
Why a GUI?
Statistics can be complex and traps await
So many tools in R to deliver insights
Effective analyses should be scripted
Scripting also required for repeatability
R is a language for programming with data
How to remember how to do all of this in R?How to skill up 150 data analysts with Data Mining?
http: // togaware. com Copyright© 2014, [email protected] 20/174
Data Mining, Rattle, and R A GUI for Data Mining
Users of Rattle
Today, Rattle is used world wide in many industries
Health analytics
Customer segmentation and marketing
Fraud detection
Government
It is used by
Universities to teach Data Mining
Within research projects for basic analyses
Consultants and Analytics Teams across business
It is and will remain freely available.
CRAN and http://rattle.togaware.com
http: // togaware. com Copyright© 2014, [email protected] 21/174
Data Mining, Rattle, and R Setting Things Up
Installation
Rattle is built using R
Need to download and install R from cran.r-project.org
Recommend also install RStudio from www.rstudio.org
Then start up RStudio and install Rattle:
install.packages("rattle")
Then we can start up Rattle:
rattle()
Required packages are loaded as needed.
http: // togaware. com Copyright© 2014, [email protected] 22/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Startup
http: // togaware. com Copyright© 2014, [email protected] 23/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Loading Data
http: // togaware. com Copyright© 2014, [email protected] 24/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Explore Distribution
http: // togaware. com Copyright© 2014, [email protected] 25/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Explore Correlations
http: // togaware. com Copyright© 2014, [email protected] 26/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Hierarchical Cluster
http: // togaware. com Copyright© 2014, [email protected] 27/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Decision Tree
http: // togaware. com Copyright© 2014, [email protected] 28/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Decision Tree Plot
http: // togaware. com Copyright© 2014, [email protected] 29/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Random Forest
http: // togaware. com Copyright© 2014, [email protected] 30/174
Data Mining, Rattle, and R Tour
A Tour Thru Rattle: Risk Chart
0.10.20.30.40.50.60.70.80.9
Risk Scores
22%
1
2
3
4
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Per
form
ance
(%
)
RainTomorrow (92%)
Rain in MM (97%)
Precision
Risk Chart Random Forest weather.csv [test] RainTomorrow
http: // togaware. com Copyright© 2014, [email protected] 31/174
Data Mining, Rattle, and R Programming with Data
Data Scientists are Programmers of Data
But. . .
Data scientists are programmers of data
A GUI can only do so much
R is a powerful statistical language
Data Scientists Desire. . .
Scripting
Transparency
Repeatability
Sharing
http: // togaware. com Copyright© 2014, [email protected] 32/174
Data Mining, Rattle, and R Programming with Data
From GUI to CLI — Rattle’s Log Tab
http: // togaware. com Copyright© 2014, [email protected] 33/174
Data Mining, Rattle, and R Programming with Data
From GUI to CLI — Rattle’s Log Tab
http: // togaware. com Copyright© 2014, [email protected] 34/174
Data Mining, Rattle, and R Programming with Data
Interface Notes
Work through the tabs from left to right
After setting up a tab we need to Execute it
Projects save the current Rattle state
Projects can be restored at a later time
http: // togaware. com Copyright© 2014, [email protected] 35/174
Data Mining, Rattle, and R The Power of Free/Libre and Open Source Software
Tools
Ubuntu GNU/Linux operating system
Feature rich toolkit, up-to-date, easy to install, FLOSS
RStudio
Easy to use integrated development environment, FLOSSPowerful alternative is Emacs (Speaks Statistics), FLOSS
R Statistical Software Language
Extensive, powerful, thousands of contributors, FLOSS
KnitR and LATEX
Produce beautiful documents, easily reproducible, FLOSS
http: // togaware. com Copyright© 2014, [email protected] 36/174
Data Mining, Rattle, and R Ubuntu
Using Ubuntu
Desktop Operating System (GNU/Linux)
Replacing Windows and OSX
The GNU Tool Suite based on Unix — significant heritage
Multiple specialised single task tools, working well together
Compared to single application trying to do it all
Powerful data processing from the command line:grep, awk, head, tail, wc, sed, perl, python, most, diff, make,paste, join, patch, . . .
For interacting with R — start up RStudio from the Dash
http: // togaware. com Copyright© 2014, [email protected] 37/174
Data Mining, Rattle, and R RStudio Interface
RStudio—The Default Three Panels
http: // togaware. com Copyright© 2014, [email protected] 38/174
Data Mining, Rattle, and R RStudio Interface
RStudio—With R Script File—Editor Panel
http: // togaware. com Copyright© 2014, [email protected] 39/174
Data Mining, Rattle, and R Simple Plots
Scatterplot—R Code
Our first little bit of R code:
Load a couple of packages into the R library
library(rattle) # Provides the weather dataset
library(ggplot2) # Provides the qplot() function
Then produce a quick plot using qplot()
ds <- weather
qplot(MinTemp, MaxTemp, data=ds)
http: // togaware. com Copyright© 2014, [email protected] 40/174
Data Mining, Rattle, and R Simple Plots
Scatterplot—Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
0 10 20MinTemp
Max
Tem
p
http: // togaware. com Copyright© 2014, [email protected] 41/174
Data Mining, Rattle, and R Simple Plots
Scatterplot—RStudio
http: // togaware. com Copyright© 2014, [email protected] 42/174
Data Mining, Rattle, and R Installing Packages
Missing Packages–Tools→Install Packages. . .
http: // togaware. com Copyright© 2014, [email protected] 43/174
Data Mining, Rattle, and R Installing Packages
RStudio—Installing ggplot2
http: // togaware. com Copyright© 2014, [email protected] 44/174
Data Mining, Rattle, and R RStudio Shortcuts
RStudio—Keyboard Shortcuts
These will become very useful!
Editor:
Ctrl-Enter will send the line of code to the R consoleCtrl-2 will move the cursor to the Console
Console:
UpArrow will cycle through previous commandsCtrl-UpArrow will search previous commandsTab will complete function names and list the argumentsCtrl-1 will move the cursor to the Editor
http: // togaware. com Copyright© 2014, [email protected] 45/174
Data Mining, Rattle, and R Basic R Commands
Basic R
library(rattle) # Load the weather dataset.
head(weather) # First 6 observations of the dataset.
## Date Location MinTemp MaxTemp Rainfall Evapora...
## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...
## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...
## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...
....
str(weather) # Struncture of the variables in the dataset.
## 'data.frame': 366 obs. of 24 variables:
## $ Date : Date, format: "2007-11-01" "2007-11-...
## $ Location : Factor w/ 46 levels "Adelaide","Alba...
## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...
....
http: // togaware. com Copyright© 2014, [email protected] 46/174
Data Mining, Rattle, and R Basic R Commands
Basic R
summary(weather) # Univariate summary of the variables.
## Date Location MinTemp ...
## Min. :2007-11-01 Canberra :366 Min. :-5.30 ...
## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ...
## Median :2008-05-01 Albany : 0 Median : 7.45 ...
## Mean :2008-05-01 Albury : 0 Mean : 7.27 ...
## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ...
## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ...
## (Other) : 0 ...
## Rainfall Evaporation Sunshine WindGust...
## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ...
## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ...
## Median : 0.00 Median : 4.20 Median : 8.60 E : ...
## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ...
## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ...
....
http: // togaware. com Copyright© 2014, [email protected] 47/174
Data Mining, Rattle, and R Visualising Data
Visual Summaries—Add A Little Colour
qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds)
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
995
1005
1015
1025
1035
25 50 75 100Humidity3pm
Pres
sure
3pm
RainTomorrow
●
●
No
Yes
http: // togaware. com Copyright© 2014, [email protected] 48/174
Data Mining, Rattle, and R Visualising Data
Visual Summaries—Careful with Categorics
qplot(WindGustDir, Pressure3pm, data=ds)
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
995
1005
1015
1025
1035
N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir
Pres
sure
3pm
http: // togaware. com Copyright© 2014, [email protected] 49/174
Data Mining, Rattle, and R Visualising Data
Visual Summaries—Add A Little Jitter
qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
995
1005
1015
1025
1035
N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir
Pres
sure
3pm
http: // togaware. com Copyright© 2014, [email protected] 50/174
Data Mining, Rattle, and R Visualising Data
Visual Summaries—And Some Colour
qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
1005
1015
1025
1035
N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir
Pres
sure
3pm
http: // togaware. com Copyright© 2014, [email protected] 51/174
Data Mining, Rattle, and R Help
Getting Help—Precede Command with ?
http: // togaware. com Copyright© 2014, [email protected] 52/174
Loading, Cleaning, Exploring Data in Rattle
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 53/174
Loading, Cleaning, Exploring Data in Rattle
Loading Data
http: // togaware. com Copyright© 2014, [email protected] 54/174
Loading, Cleaning, Exploring Data in Rattle
Exploring Data
http: // togaware. com Copyright© 2014, [email protected] 55/174
Loading, Cleaning, Exploring Data in Rattle
Test Data
http: // togaware. com Copyright© 2014, [email protected] 56/174
Loading, Cleaning, Exploring Data in Rattle
Transform Data
http: // togaware. com Copyright© 2014, [email protected] 57/174
Descriptive Data Mining
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 58/174
Descriptive Data Mining Cluster Analysis
What is Cluster Analysis?
Cluster: a collection of observations
Similar to one another within the same clusterDissimilar to the observations in other clusters
Cluster analysisGrouping a set of data observations into classes
Clustering is unsupervised classification: no predefinedclasses—descriptive data mining.
Typical applications
As a stand-alone tool to get insight into data distributionAs a preprocessing step for other algorithms
http: // togaware. com Copyright© 2014, [email protected] 59/174
Descriptive Data Mining Cluster Analysis
What Is Good Clustering?
High Quality:
high intra-class similaritylow inter-class similarity
The Quality depends on:
similarity measurealgorithm for searching
Depends on the opinion of the user, and the algorithm’sability to discover hidden patterns that are of interest to theuser.
http: // togaware. com Copyright© 2014, [email protected] 60/174
Descriptive Data Mining Cluster Analysis
Major Clustering Approaches
Partitioning algorithms (kmeans, pam, clara, fanny): Constructvarious partitions and then evaluate them by some criterion. Afixed number of clusters, k, is generated. Start with an initial(perhaps random) cluster.
Hierarchical algorithms: (hclust, agnes, diana) Create ahierarchical decomposition of the set of observations using somecriterion
Density-based algorithms: based on connectivity and densityfunctions
Grid-based algorithms: based on a multiple-level granularitystructure
Model-based algorithms: (mclust for mixture of Gaussians) Amodel is hypothesized for each of the clusters and the idea is tofind the best fit of that model
http: // togaware. com Copyright© 2014, [email protected] 61/174
Descriptive Data Mining Cluster Analysis
KMeans Clustering
http: // togaware. com Copyright© 2014, [email protected] 62/174
Descriptive Data Mining Association Rules
Association Rule Mining
An unsupervised learning algorithm—descriptive data mining.
Identify items (patterns) that occur frequently together in agiven set of data.
Patterns = associations, correlations, causal structures (Rules).
Data = sets of items in . . .
transactional databaserelational databasecomplex information repositories
Rule: Body → Head [support, confidence]
http: // togaware. com Copyright© 2014, [email protected] 63/174
Descriptive Data Mining Association Rules
Typical Applications
Link analysis
Market basket analysis
Cross marketing
Customers who purchase . . .
http: // togaware. com Copyright© 2014, [email protected] 64/174
Descriptive Data Mining Association Rules
Examples
Friday ∩ Nappies → Beer[0.5%, 60%]
Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ]→ MP3Player[2%, 60%]
Maths ∩ CS → HDinCS[1%, 75%]
Gladiator ∩ Patriot → Sixth Sense[0.1%, 90%]
Statins ∩ Peritonitis → Chronic Renal Failure[0.1%, 32%]
http: // togaware. com Copyright© 2014, [email protected] 65/174
Descriptive Data Mining Health Insurance Commission
Health Insurance Commision
Associations on episode database for pathology services
6.8 million records X 120 attributes (3.5GB)15 months preprocessing then 2 weeks data mining
Goal: find associations between tests
cmin = 50% and smin = 1%, 0.5%, 0.25%(1% of 6.8 million = 68,000)Unexpected/unnecessary combination of services
Refuse cover saves $550,000 per year
http: // togaware. com Copyright© 2014, [email protected] 66/174
Descriptive Data Mining Health Insurance Commission
Association Rules
http: // togaware. com Copyright© 2014, [email protected] 67/174
Predictive Data Mining: Decision Trees
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 68/174
Predictive Data Mining: Decision Trees Basics
Predictive Modelling: Classification
Goal of classification is to build models (sentences) in aknowledge representation (language) from examples of pastdecisions.
The model is to be used on unseen cases to make decisions.
Often referred to as supervised learning.
Common approaches: decision trees; neural networks; logisticregression; support vector machines.
http: // togaware. com Copyright© 2014, [email protected] 69/174
Predictive Data Mining: Decision Trees Basics
Language: Decision Trees
Knowledge representation: A flow-chart-like tree structure
Internal nodes denotes a test on a variable
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Y
Y N
Age
Gender
Male Female
< 42 > 42
http: // togaware. com Copyright© 2014, [email protected] 70/174
Predictive Data Mining: Decision Trees Basics
Tree Construction: Divide and Conquer
Decision tree induction is an example of a recursive partitioningalgorithm: divide and conquer.
At start, all the training examples are at the root
Partition examples recursively based on selected variables
+ +
+
+
+
+
−
−
+
+
++
++
+ −
_
−
−
+
−−
−
−
+
<42
>42
+
+
−
Males
Females
+
+
+−
−+ +
++
+
−
−
−
−−
+
+
−+
+
http: // togaware. com Copyright© 2014, [email protected] 71/174
Predictive Data Mining: Decision Trees Algorithm
Algorithm for Decision Tree Induction
A greedy algorithm: takes the best immediate (local) decisionwhile building the overall model
Tree constructed top-down, recursive, divide-and-conquer
Begin with all training examples at the root
Data is partitioned recursively based on selected variables
Select variables on basis of a measure
Stop partitioning when?
All samples for a given node belong to the same classThere are no remaining variables for further partitioning –majority voting is employed for classifying the leafThere are no samples left
http: // togaware. com Copyright© 2014, [email protected] 72/174
Predictive Data Mining: Decision Trees Algorithm
Basic Motivation: Entropy
We are trying to predict output Y (e.g., Yes/No) from input X .
A random data set may have high entropy:
Y is from a uniform distributiona frequency distribution would be flat!a sample will include uniformly random values of Y
A data set with low entropy:
Y ’s distribution will be very skeweda frequency distribution will have a single peaka sample will predominately contain just Yes or just No
Work towards reducing the amount of entropy in the data!
http: // togaware. com Copyright© 2014, [email protected] 73/174
Predictive Data Mining: Decision Trees Algorithm
Variable Selection Measure: Entropy
Information gain (ID3/C4.5)
Select the variable with the highest information gain
Assume there are two classes: P and N
Let the data S contain p elements of class P and n elements ofclass N
The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as
IE (p, n) = − p
p + nlog2
p
p + n− n
p + nlog2
n
p + n
http: // togaware. com Copyright© 2014, [email protected] 74/174
Predictive Data Mining: Decision Trees Algorithm
Variable Selection Measure: Gini
Gini index of impurity – traditional statistical measure – CART
Measure how often a randomly chosen observation is incorrectlyclassified if it were randomly classified in proportion to the actualclasses.
Calculated as the sum of the probability of each observationbeing chosen times the probability of incorrect classification,equivalently:
IG (p, n) = 1− (p2 + (1− p)2)
As with Entropy, the Gini measure is maximal when the classesare equally distributed and minimal when all observations are inone class or the other.
http: // togaware. com Copyright© 2014, [email protected] 75/174
Predictive Data Mining: Decision Trees Algorithm
Variable Selection Measure
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Proportion of Positives
Mea
sure Formula
Info
Gini
Variable Importance Measure
http: // togaware. com Copyright© 2014, [email protected] 76/174
Predictive Data Mining: Decision Trees Algorithm
Information Gain
Now use variable A to partition S into v cells: {S1,S2, . . . ,Sv}If Si contains pi examples of P and ni examples of N, theinformation now needed to classify objects in all subtrees Si is:
E (A) =v∑
i=1
pi + ni
p + nI (pi , ni )
So, the information gained by branching on A is:
Gain(A) = I (p, n)− E (A)
So choose the variable A which results in the greatest gain ininformation.
http: // togaware. com Copyright© 2014, [email protected] 77/174
Predictive Data Mining: Decision Trees Building Decision Trees
Startup Rattle
library(rattle)
rattle()
http: // togaware. com Copyright© 2014, [email protected] 78/174
Predictive Data Mining: Decision Trees Building Decision Trees
Load Example Weather Dataset
Click on the Execute button and an example dataset is offered.
Click on Yes to load the weather dataset.
http: // togaware. com Copyright© 2014, [email protected] 79/174
Predictive Data Mining: Decision Trees Building Decision Trees
Summary of the Weather Dataset
A summary of the weather dataset is displayed.
http: // togaware. com Copyright© 2014, [email protected] 80/174
Predictive Data Mining: Decision Trees Building Decision Trees
Model Tab — Decision Tree
Click on the Model tab to display the modelling options.
http: // togaware. com Copyright© 2014, [email protected] 81/174
Predictive Data Mining: Decision Trees Building Decision Trees
Build Tree to Predict RainTomorrow
Decision Tree is the default model type—simply click Execute.
http: // togaware. com Copyright© 2014, [email protected] 82/174
Predictive Data Mining: Decision Trees Building Decision Trees
Decision Tree Predicting RainTomorrow
Click the Draw button to display a tree(Settings → Advanced Graphics).
http: // togaware. com Copyright© 2014, [email protected] 83/174
Predictive Data Mining: Decision Trees Building Decision Trees
Evaluate Decision Tree
Click Evaluate tab—options to evaluate model performance.
http: // togaware. com Copyright© 2014, [email protected] 84/174
Predictive Data Mining: Decision Trees Building Decision Trees
Evaluate Decision Tree—Error Matrix
Click Execute to display simple error matrix.Identify the True/False Positives/Negatives.
http: // togaware. com Copyright© 2014, [email protected] 85/174
Predictive Data Mining: Decision Trees Building Decision Trees
Decision Tree Risk Chart
Click the Risk type and then Execute.
http: // togaware. com Copyright© 2014, [email protected] 86/174
Predictive Data Mining: Decision Trees Building Decision Trees
Decision Tree ROC Curve
Click the ROC type and then Execute.
http: // togaware. com Copyright© 2014, [email protected] 87/174
Predictive Data Mining: Decision Trees Building Decision Trees
Score a Dataset
Click the Score type to score a new dataset using model.
http: // togaware. com Copyright© 2014, [email protected] 88/174
Predictive Data Mining: Decision Trees Building Decision Trees
Log of R Commands
Click the Log tab for a history of all your interactions.Save the log contents as a script to repeat what we did.
http: // togaware. com Copyright© 2014, [email protected] 89/174
Predictive Data Mining: Decision Trees Building Decision Trees
Log of R Commands—rpart()
Here we see the call to rpart() to build the model.Click on the Export button to save the script to file.
http: // togaware. com Copyright© 2014, [email protected] 90/174
Predictive Data Mining: Decision Trees Building Decision Trees
Help → Model → Tree
Rattle provides some basic help—click Yes for R help.
http: // togaware. com Copyright© 2014, [email protected] 91/174
Predictive Data Mining: Decision Trees Doing It In R
Weather Dataset - Inputs
ds <- weather
head(ds, 4)
## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3
## 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7
## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3
## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1
....
summary(ds[c(3:5,23)])
## MinTemp MaxTemp Rainfall RISK_MM
## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.30 1st Qu.:15.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7.45 Median :19.6 Median : 0.00 Median : 0.00
## Mean : 7.27 Mean :20.6 Mean : 1.43 Mean : 1.43
....
http: // togaware. com Copyright© 2014, [email protected] 92/174
Predictive Data Mining: Decision Trees Doing It In R
Weather Dataset - Target
target <- "RainTomorrow"
summary(ds[target])
## RainTomorrow
## No :300
## Yes: 66
(form <- formula(paste(target, "~ .")))
## RainTomorrow ~ .
(vars <- names(ds)[-c(1, 2, 23)])
## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation"
## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am"
## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am"
## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am"
## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday"
## [21] "RainTomorrow"
http: // togaware. com Copyright© 2014, [email protected] 93/174
Predictive Data Mining: Decision Trees Doing It In R
Simple Train/Test Paradigm
set.seed(1421)
train <- c(sample(1:nrow(ds), 0.70*nrow(ds))) # Training dataset
head(train)
## [1] 288 298 363 107 70 232
length(train)
## [1] 256
test <- setdiff(1:nrow(ds), train) # Testing dataset
length(test)
## [1] 110
http: // togaware. com Copyright© 2014, [email protected] 94/174
Predictive Data Mining: Decision Trees Doing It In R
Display the Model
model <- rpart(form, ds[train, vars])
model
## n= 256
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 256 44 No (0.82812 0.17188)
## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813)
## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863)
## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) *
## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951)
## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) *
## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) *
....
Notice the legend to help interpret the tree.
http: // togaware. com Copyright© 2014, [email protected] 95/174
Predictive Data Mining: Decision Trees Doing It In R
Performance on Test Dataset
The predict() function is used to score new data.
head(predict(model, ds[test,], type="class"))
## 2 4 6 8 11 12
## No No No No No No
## Levels: No Yes
table(predict(model, ds[test,], type="class"), ds[test, target])
##
## No Yes
## No 77 14
## Yes 11 8
http: // togaware. com Copyright© 2014, [email protected] 96/174
Predictive Data Mining: Decision Trees Doing It In R
Example DTree Plot using Rattle
yes no
1
2
4
8
9
18 19 5
3
6 7
Humidity3pm < 60
WindGustSpeed < 64
Cloud3pm < 6.5
Temp3pm < 26
Pressure3pm >= 1015
No.83 .17100%
No.90 .10
84%
No.93 .07
80%
No.97 .03
64%
No.78 .22
16%
No.88 .12
13%
Yes.29 .71
3%
Yes.30 .70
4%
Yes.45 .55
16%
No.79 .21
7%
Yes.17 .83
9%
yes no
1
2
4
8
9
18 19 5
3
6 7
Rattle 2014−Mar−01 07:25:21 gjw
http: // togaware. com Copyright© 2014, [email protected] 97/174
Predictive Data Mining: Decision Trees Doing It In R
An R Scripting Hint
Notice the use of variables ds, target, vars.
Change these variables, and the remaining script is unchanged.
Simplifies script writing and reuse of scripts.
ds <- iris
target <- "Species"
vars <- names(ds)
Then repeat the rest of the script, without change.
http: // togaware. com Copyright© 2014, [email protected] 98/174
Predictive Data Mining: Decision Trees Doing It In R
An R Scripting Hint — Unchanged Code
This code remains the same to build the decision tree.
form <- formula(paste(target, "~ ."))
train <- c(sample(1:nrow(ds), 0.70*nrow(ds)))
test <- setdiff(1:nrow(ds), train)
model <- rpart(form, ds[train, vars])
model
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 69 setosa (0.34286 0.32381 0.33333)
## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) *
## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725)
## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) *
## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *
http: // togaware. com Copyright© 2014, [email protected] 99/174
Predictive Data Mining: Decision Trees Doing It In R
An R Scripting Hint — Unchanged Code
Similarly for the predictions.
head(predict(model, ds[test,], type="class"))
## 3 8 9 10 11 12
## setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
table(predict(model, ds[test,], type="class"), ds[test, target])
##
## setosa versicolor virginica
## setosa 14 0 0
## versicolor 0 15 4
## virginica 0 1 11
http: // togaware. com Copyright© 2014, [email protected] 100/174
Predictive Data Mining: Decision Trees Summary
Summary
Most widely deployed machine learning algorithm.
Simple idea, powerful learner.
Available in R through the rpart package.
Related packages include party, Cubist, C50, RWeka (J48).
http: // togaware. com Copyright© 2014, [email protected] 101/174
Predictive Data Mining: Ensembles
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 102/174
Predictive Data Mining: Ensembles
Building Multiple Models
General idea developed in Multiple Inductive Learning algorithm(Williams 1987).
Ideas were developed (ACJ 1987, PhD 1990) in the context of:
observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.
Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation
Two approaches covered: Boosting and Random Forests.
Meta learners.
http: // togaware. com Copyright© 2014, [email protected] 103/174
Predictive Data Mining: Ensembles Boosting
Boosting Algorithms
Basic idea: boost observations that are “hard to model.”Algorithm: iteratively build weak models using a poor learner:
Build an initial model;
Identify mis-classified cases in the training dataset;
Boost (over-represent) training observations modelled incorrectly;
Build a new model on the boosted training dataset;
Repeat.
The result is an ensemble of weighted models.
Best off the shelf model builder. (Leo Brieman)
http: // togaware. com Copyright© 2014, [email protected] 104/174
Predictive Data Mining: Ensembles Boosting
Example: Ada on Weather Data
head(weather[c(1:5, 23, 24)], 3)
## Date Location MinTemp MaxTemp Rainfall RISK_MM...
## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...
## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...
## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...
....
set.seed(42)
train <- sample(1:nrow(weather), 0.7 * nrow(weather))
(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))
## Call:
## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])
##
## Loss: exponential Method: discrete Iteration: 50
....
http: // togaware. com Copyright© 2014, [email protected] 105/174
Predictive Data Mining: Ensembles Boosting
Example: Error Rate
Notice error rate decreases quickly then flattens.
plot(m)
0 10 20 30 40 50
0.06
0.08
0.10
0.12
0.14
Iteration 1 to 50
Err
or
50
1
1
1
1
1
Training Error
1 Train
http: // togaware. com Copyright© 2014, [email protected] 106/174
Predictive Data Mining: Ensembles Boosting
Example: Variable Importance
Helps understand the knowledge captured.
varplot(m)
RainToday
WindDir9am
Rainfall
WindDir3pm
WindGustDir
Pressure3pm
Sunshine
WindGustSpeed
Cloud3pm
WindSpeed9am
WindSpeed3pm
Humidity9am
Cloud9am
Evaporation
MaxTemp
Temp9am
Humidity3pm
MinTemp
Pressure9am
Temp3pm
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.04 0.05 0.06 0.07 0.08 0.09 0.10
Variable Importance Plot
Score
http: // togaware. com Copyright© 2014, [email protected] 107/174
Predictive Data Mining: Ensembles Boosting
Example: Sample Trees
There are 50 trees in all. Here’s the first 3.
fancyRpartPlot(m$model$trees[[1]])
fancyRpartPlot(m$model$trees[[2]])
fancyRpartPlot(m$model$trees[[3]])
yes no
1
2
4
5
10 11 3
Cloud3pm < 7.5
Pressure3pm >= 1012
Humidity3pm < 42
−1.96 .04100%
−1.97 .03
94%
−1.99 .01
76%
−1.87 .13
18%
−1.96 .04
9%
1.72 .28
9%
1.42 .58
6%
yes no
1
2
4
5
10 11 3
Rattle 2014−Mar−01 07:25:22 gjw
yes no
1
2
3
6 7
Pressure3pm >= 1012
Sunshine >= 11
−1.96 .04100%
−1.98 .02
80%
1.82 .18
20%
−11.00 .00
6%
1.63 .37
14%
yes no
1
2
3
6 7
Rattle 2014−Mar−01 07:25:23 gjw
yes no
1
2
3
6 7
Pressure3pm >= 1012
MaxTemp >= 27
−1.97 .03100%
−1.99 .01
82%
1.83 .17
18%
−1.97 .03
6%
1.67 .33
12%
yes no
1
2
3
6 7
Rattle 2014−Mar−01 07:25:23 gjw
http: // togaware. com Copyright© 2014, [email protected] 108/174
Predictive Data Mining: Ensembles Boosting
Example: Performance
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
0.10.20.30.40.50.60.70.80.9
Risk Scores
23%
1
2
3
4
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Perfo
rman
ce (%
)
Recall (88%)
Risk (94%)
Precision
http: // togaware. com Copyright© 2014, [email protected] 109/174
Predictive Data Mining: Ensembles Boosting
Example Applications
ATO Application: What life events affect compliance?
First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status
Boosted Neural Networks
OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993
http: // togaware. com Copyright© 2014, [email protected] 110/174
Predictive Data Mining: Ensembles Boosting
Summary
1 Boosting is implemented in R in the ada library
2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)
3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)
4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.
5 Can be proved to converge to a perfect model if the learners arealways better than chance.
http: // togaware. com Copyright© 2014, [email protected] 111/174
Predictive Data Mining: Ensembles Random Forests
Random Forests
Original idea from Leo Brieman and Adele Cutler.
The name is Licensed to Salford Systems!
Hence, R package is randomForest.
Typically presented in context of decision trees.
Random Multinomial Logit uses multiple multinomial logitmodels.
http: // togaware. com Copyright© 2014, [email protected] 112/174
Predictive Data Mining: Ensembles Random Forests
Random Forests
Build many decision trees (e.g., 500).
For each tree:
Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)
Classify a new entity using every decision tree:
Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.
http: // togaware. com Copyright© 2014, [email protected] 113/174
Predictive Data Mining: Ensembles Random Forests
Example: RF on Weather Data
set.seed(42)
(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],
na.action=na.roughfix,
importance=TRUE))
##
## Call:
## randomForest(formula=RainTomorrow ~ ., data=weath...
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 13.67%
## Confusion matrix:
## No Yes class.error
## No 211 4 0.0186
## Yes 31 10 0.7561
http: // togaware. com Copyright© 2014, [email protected] 114/174
Predictive Data Mining: Ensembles Random Forests
Example: Error Rate
Error rate decreases quickly then flattens over the 500 trees.
plot(m)
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
m
trees
Err
or
http: // togaware. com Copyright© 2014, [email protected] 115/174
Predictive Data Mining: Ensembles Random Forests
Example: Variable Importance
Helps understand the knowledge captured.
varImpPlot(m, main="Variable Importance")
RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15MeanDecreaseAccuracy
RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8MeanDecreaseGini
Variable Importance
http: // togaware. com Copyright© 2014, [email protected] 116/174
Predictive Data Mining: Ensembles Random Forests
Example: Sample Trees
There are 500 trees in all. Here’s some rules from the first tree.
## Random Forest Model 1
##
## ------------------------------------------------------...
## Tree 1 Rule 1 Node 30 Decision No
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
## 3: Cloud3pm <= 2.5
## 4: WindDir9am IN ("NNE")
## 5: Sunshine <= 10.25
## 6: Temp3pm <= 17.55
## ------------------------------------------------------...
## Tree 1 Rule 2 Node 31 Decision Yes
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
....
http: // togaware. com Copyright© 2014, [email protected] 117/174
Predictive Data Mining: Ensembles Random Forests
Example: Performance
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
0.10.20.30.40.50.60.70.80.9
Risk Scores
22%
1
2
3
4
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Perfo
rman
ce (%
)
Recall (92%)
Risk (97%)
Precision
http: // togaware. com Copyright© 2014, [email protected] 118/174
Predictive Data Mining: Ensembles Random Forests
Features of Random Forests: By Brieman
Most accurate of current algorithms.
Runs efficiently on large data sets.
Can handle thousands of input variables.
Gives estimates of variable importance.
http: // togaware. com Copyright© 2014, [email protected] 119/174
Predictive Data Mining: Ensembles Random Forests
Random Forests
Ensemble: Multiple models working together
Often better than a single model
Variance and bias of the model are reduced
The best available models today - accurate and robust
In daily use in very many areas of application
http: // togaware. com Copyright© 2014, [email protected] 120/174
Moving into R and Scripting our Analyses
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 121/174
Moving into R and Scripting our Analyses
Data Scientists are Programmers of Data
But. . .
Data scientists are programmers of data
A GUI can only do so much
R is a powerful statistical language
Data Scientists Desire. . .
Scripting
Transparency
Repeatability
Sharing
http: // togaware. com Copyright© 2014, [email protected] 122/174
Moving into R and Scripting our Analyses
From GUI to CLI — Rattle’s Log Tab
http: // togaware. com Copyright© 2014, [email protected] 123/174
Moving into R and Scripting our Analyses
From GUI to CLI — Rattle’s Log Tab
http: // togaware. com Copyright© 2014, [email protected] 124/174
Moving into R and Scripting our Analyses
Step 1: Load the Dataset
dsname <- "weather"
ds <- get(dsname)
dim(ds)
## [1] 366 24
names(ds)
## [1] "Date" "Location" "MinTemp" "...
## [5] "Rainfall" "Evaporation" "Sunshine" "...
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "...
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "...
....
http: // togaware. com Copyright© 2014, [email protected] 125/174
Moving into R and Scripting our Analyses
Step 2: Observe the Data — Observations
head(ds)
## Date Location MinTemp MaxTemp Rainfall Evapora...
## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...
## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...
## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...
....
tail(ds)
## Date Location MinTemp MaxTemp Rainfall Evapo...
## 361 2008-10-26 Canberra 7.9 26.1 0 ...
## 362 2008-10-27 Canberra 9.0 30.7 0 ...
## 363 2008-10-28 Canberra 7.1 28.4 0 ...
....
http: // togaware. com Copyright© 2014, [email protected] 126/174
Moving into R and Scripting our Analyses
Step 2: Observe the Data — Structure
str(ds)
## 'data.frame': 366 obs. of 24 variables:
## $ Date : Date, format: "2007-11-01" "2007-11-...
## $ Location : Factor w/ 46 levels "Adelaide","Alba...
## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...
## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 1...
## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16...
## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6...
## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4....
## $ WindGustDir : Ord.factor w/ 16 levels "N"<"NNE"<"N...
## $ WindGustSpeed: num 30 39 85 54 50 44 43 41 48 31 ...
## $ WindDir9am : Ord.factor w/ 16 levels "N"<"NNE"<"N...
## $ WindDir3pm : Ord.factor w/ 16 levels "N"<"NNE"<"N...
....
http: // togaware. com Copyright© 2014, [email protected] 127/174
Moving into R and Scripting our Analyses
Step 2: Observe the Data — Summary
summary(ds)
## Date Location MinTemp ...
## Min. :2007-11-01 Canberra :366 Min. :-5.3...
## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3...
## Median :2008-05-01 Albany : 0 Median : 7.4...
## Mean :2008-05-01 Albury : 0 Mean : 7.2...
## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5...
## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9...
## (Other) : 0 ...
## Rainfall Evaporation Sunshine Wind...
## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ...
## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ...
## Median : 0.00 Median : 4.20 Median : 8.60 E ...
....
http: // togaware. com Copyright© 2014, [email protected] 128/174
Moving into R and Scripting our Analyses
Step 2: Observe the Data — Variables
id <- c("Date", "Location")
target <- "RainTomorrow"
risk <- "RISK_MM"
(ignore <- union(id, risk))
## [1] "Date" "Location" "RISK_MM"
(vars <- setdiff(names(ds), ignore))
## [1] "MinTemp" "MaxTemp" "Rainfall" "...
## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "...
## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "...
## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "...
....
http: // togaware. com Copyright© 2014, [email protected] 129/174
Moving into R and Scripting our Analyses
Step 3: Clean the Data — Remove Missing
dim(ds)
## [1] 366 24
sum(is.na(ds[vars]))
## [1] 47
ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]
http: // togaware. com Copyright© 2014, [email protected] 130/174
Moving into R and Scripting our Analyses
Step 3: Clean the Data — Remove Missing
dim(ds)
## [1] 328 24
sum(is.na(ds[vars]))
## [1] 0
http: // togaware. com Copyright© 2014, [email protected] 131/174
Moving into R and Scripting our Analyses
Step 3: Clean the Data—Target as Categoric
summary(ds[target])
## RainTomorrow
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.183
## 3rd Qu.:0.000
## Max. :1.000
....
ds[target] <- as.factor(ds[[target]])
levels(ds[target]) <- c("No", "Yes")
http: // togaware. com Copyright© 2014, [email protected] 132/174
Moving into R and Scripting our Analyses
Step 3: Clean the Data—Target as Categoric
summary(ds[target])
## RainTomorrow
## 0:268
## 1: 60
0
100
200
0 1RainTomorrow
coun
t
http: // togaware. com Copyright© 2014, [email protected] 133/174
Moving into R and Scripting our Analyses
Step 4: Prepare for Modelling
(form <- formula(paste(target, "~ .")))
## RainTomorrow ~ .
(nobs <- nrow(ds))
## [1] 328
train <- sample(nobs, 0.70*nobs)
length(train)
## [1] 229
test <- setdiff(1:nobs, train)
length(test)
## [1] 99
http: // togaware. com Copyright© 2014, [email protected] 134/174
Moving into R and Scripting our Analyses
Step 5: Build the Model—Random Forest
library(randomForest)
model <- randomForest(form, ds[train, vars], na.action=na.omit)
model
##
## Call:
## randomForest(formula=form, data=ds[train, vars], ...
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
....
http: // togaware. com Copyright© 2014, [email protected] 135/174
Moving into R and Scripting our Analyses
Step 6: Evaluate the Model—Risk Chart
pr <- predict(model, ds[test,], type="prob")[,2]
riskchart(pr, ds[test, target], ds[test, risk],
title="Random Forest - Risk Chart",
risk=risk, recall=target, thresholds=c(0.35, 0.15))
0.10.20.30.40.50.60.70.80.9Risk Scores
15%1
2
3
4
5
6
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Per
form
ance
(%
)
RainTomorrow (99%)
RISK_MM (100%)
Precision
Random Forest − Risk Chart
http: // togaware. com Copyright© 2014, [email protected] 136/174
Literate Data Mining in R
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 137/174
Literate Data Mining in R Motivation
Why is Reproducibility Important?
Your Research Leader or Executive drops by and asks:
“Remember that research you did last year? I’ve heard there isan update on the data that you used. Can you add the new datain and repeat the same analysis?”
“Jo Bloggs did a great analysis of the company returns data justbefore she left. Can you get someone else to analyse the newdata set using the same methods, and so produce an updatedreport that we can present to the Exec next week?”
“The fraud case you provided an analysis of last year has finallyreached the courts. We need to ensure we have a clear trail of thedata sources, the analyses performed, and the results obtained,to stand up in court. Could you document these please.”
http: // togaware. com Copyright© 2014, [email protected] 138/174
Literate Data Mining in R Motivation
Literate Data Mining Overview
One document to intermix the analysis, code, and results
Authors productive with narrative and code in one document
Sweave (Leisch 2002) and now KnitR (Yihui 2011)
Embed R code into LATEX documents for typesetting
KnitR also supports publishing to the web
http: // togaware. com Copyright© 2014, [email protected] 139/174
Literate Data Mining in R Motivation
Why Reproducible Data Mining?
Automatically regenerate documents when code, data, orassumptions change.
Eliminate errors that occur when transcribing results intodocuments.
Record the context for the analysis and decisions made about thetype of analysis to perform in the one place.
Document the processes to provide integrity for the conclusionsof the analysis.
Share approach with others for peer review and for learning fromeach other—engender a continuous learning environment.
http: // togaware. com Copyright© 2014, [email protected] 140/174
Literate Data Mining in R Motivation
Prime Objective: Trustworthy Software
Those who receive the results of modern data analysis havelimited opportunity to verify the results by direct observation.Users of the analysis have no option but to trust the analysis,and by extension the software that produced it. This places anobligation on all creators of software to program in such a waythat the computations can be understood and trusted. Thisobligation I label the Prime Directive.
John Chambers (2008)Software for Data Analysis: Programming with R
http: // togaware. com Copyright© 2014, [email protected] 141/174
Literate Data Mining in R Motivation
Beautiful Output by KnitR
KnitR combined with LATEX will
Intermix analysis and results of analysis
Automatically generate graphics and tables
Support reproducible and transparent analysis
Produce the best looking reports.
http: // togaware. com Copyright© 2014, [email protected] 142/174
Literate Data Mining in R Motivation
Beautiful Output by Default
The reader wants to read the document and easily do so!
Code highlighting is done automatically
Default theme is carefully designed
Many other themes are available
R Code is “properly” reformatted
Analyses (Graphs and Tables) automatically included.
http: // togaware. com Copyright© 2014, [email protected] 143/174
Literate Data Mining in R Using RStudio
Supporting Technology
A suite of Free and Open Source Software — FLOSS
RStudio — Creating, managing, compiling documents
LATEX — Markup language for typesetting a document
R — Statistical analysis language
KnitR — Integrator of typesetting and analysis
http: // togaware. com Copyright© 2014, [email protected] 144/174
Literate Data Mining in R Using RStudio
Using RStudio
Simplified interaction with R, LATEX, and KnitR
Executes R code one line at a time
Formats LATEX documents and provides and spell checking
A single click compile to PDF and synchronised views
Demonstrate: Startup and explore RStudio.
http: // togaware. com Copyright© 2014, [email protected] 145/174
Literate Data Mining in R Using RStudio
Introducing LATEX
A text markup language rather than a WYSIWYG.
Based on TEX from 1977 — very stable and powerful.
LATEX is easier to use macro package built on TEX.
Ensures consistent style (layout, fonts, tables, maths, etc.)
Automatic indexes, footnotes and references.
Documents are well structured and are clear text.
Has a learning curve.
http: // togaware. com Copyright© 2014, [email protected] 146/174
Literate Data Mining in R Using RStudio
Basic LATEX Usage
\documentclass{article}
\begin{document}
\end{document}
Demonstrate Create a new Sweave document in RStudio
http: // togaware. com Copyright© 2014, [email protected] 147/174
Literate Data Mining in R Using RStudio
Structures
\documentclass{article}
\begin{document}
\section{Introduction}
...
\subsection{Concepts}
...
\end{document}
http: // togaware. com Copyright© 2014, [email protected] 148/174
Literate Data Mining in R Using RStudio
Formats
\documentclass{article}
\begin{document}
\begin{itemize}
\item ABC
\item DEF
\end{itemize}
This if \textbf{bold} text or \textbf{italic} text, ...
\end{document}
http: // togaware. com Copyright© 2014, [email protected] 149/174
Literate Data Mining in R Using RStudio
RStudio Support for LATEX
RStudio provides excellent support for working with LATEX documents
Helps to avoid having to know too much abuot LATEX
Best illustrated through a demonstration
Format menu
Section commandsFont commandsList commandsVerbatim/Block commands
Spell Checker
Compile PDF
Demonstrate: Start a new document, add contents, format to PDF.
http: // togaware. com Copyright© 2014, [email protected] 150/174
Literate Data Mining in R Incorporating R Code
Incorporating R Code
We insert R code in a Chunk starting with << >>=
We terminate the Chunk with @
Save LATEX with extension Rnw
This Chunk<<simple_example>>=
x <- sum(1:10)
x
@
Produces
x <- sum(1:10)
x
## [1] 55
Demonstrate: Do this in RStudio
http: // togaware. com Copyright© 2014, [email protected] 151/174
Literate Data Mining in R Incorporating R Code
Making You Look Good
<<format_example>>=
for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)}
@
for(i in 1:5)
{j <- cos(sin(i)*i^2)+3
print(j-5)
}
## [1] -1.334
## [1] -2.88
## [1] -1.704
## [1] -1.103
....
http: // togaware. com Copyright© 2014, [email protected] 152/174
Literate Data Mining in R Incorporating R Code
R Within the Text
Include information about data within the narrative.
We can do that with \Sexpr{...}.
Our dataset has \Sexpr{nrow(ds)} observations of\Sexpr{ncol(ds)} variables.
Becomes
Our dataset has 328 observations of 24 variables.
Better Still: \Sexpr{format(nrow(ds), big.mark=",")}
Our dataset has 328 observations of 24 variables.
http: // togaware. com Copyright© 2014, [email protected] 153/174
Literate Data Mining in R Incorporating R Code
A Simple Table
library(xtable)
obs <- sample(1:nrow(weatherAUS), 8)
vars <- 2:6
xtable(weatherAUS[obs, vars])
Location MinTemp MaxTemp Rainfall Evaporation
79696 AliceSprings 5.90 21.70 0.00 4.0032428 Ballarat 2.00 10.60 0.2058707 MountGambier 8.50 19.70 2.20 1.6022420 WaggaWagga 1.30 17.40 0.00 3.6050382 Cairns 24.00 31.80 0.00 4.60
1441 Albury 11.20 32.20 0.0078992 AliceSprings 13.80 30.90 0.00 5.8066576 PearceRAAF 5.60 22.90 0.00
http: // togaware. com Copyright© 2014, [email protected] 154/174
Literate Data Mining in R Incorporating R Code
Table: Exclude Row Names
print(xtable(weatherAUS[obs, vars]),
include.rownames=FALSE)
Location MinTemp MaxTemp Rainfall Evaporation
AliceSprings 5.90 21.70 0.00 4.00Ballarat 2.00 10.60 0.20MountGambier 8.50 19.70 2.20 1.60WaggaWagga 1.30 17.40 0.00 3.60Cairns 24.00 31.80 0.00 4.60Albury 11.20 32.20 0.00AliceSprings 13.80 30.90 0.00 5.80PearceRAAF 5.60 22.90 0.00
http: // togaware. com Copyright© 2014, [email protected] 155/174
Literate Data Mining in R Incorporating R Code
Table: Limit Number of Digits
print(xtable(weatherAUS[obs, vars],
digits=1),
include.rownames=FALSE)
Location MinTemp MaxTemp Rainfall Evaporation
AliceSprings 5.9 21.7 0.0 4.0Ballarat 2.0 10.6 0.2MountGambier 8.5 19.7 2.2 1.6WaggaWagga 1.3 17.4 0.0 3.6Cairns 24.0 31.8 0.0 4.6Albury 11.2 32.2 0.0AliceSprings 13.8 30.9 0.0 5.8PearceRAAF 5.6 22.9 0.0
http: // togaware. com Copyright© 2014, [email protected] 156/174
Literate Data Mining in R Incorporating R Code
Table: Tiny Font
vars <- 2:8
print(xtable(weatherAUS[obs, vars],
digits=0),
size="tiny",
include.rownames=FALSE)
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir
AliceSprings 6 22 0 4 11 SEBallarat 2 11 0 WMountGambier 8 20 2 2 3 SWWaggaWagga 1 17 0 4 11 SSWCairns 24 32 0 5 9 NEAlbury 11 32 0 NNEAliceSprings 14 31 0 6 11 NPearceRAAF 6 23 0 11 S
http: // togaware. com Copyright© 2014, [email protected] 157/174
Literate Data Mining in R Incorporating R Code
Table: Column Alignment
vars <- 2:8
print(xtable(weatherAUS[obs, vars],
digits=0,
align="rlrrrrrr"),
size="tiny")
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir
79696 AliceSprings 6 22 0 4 11 SE32428 Ballarat 2 11 0 W58707 MountGambier 8 20 2 2 3 SW22420 WaggaWagga 1 17 0 4 11 SSW50382 Cairns 24 32 0 5 9 NE
1441 Albury 11 32 0 NNE78992 AliceSprings 14 31 0 6 11 N66576 PearceRAAF 6 23 0 11 S
http: // togaware. com Copyright© 2014, [email protected] 158/174
Literate Data Mining in R Incorporating R Code
Table: Caption
print(xtable(weatherAUS[obs, vars],
digits=1,
caption="This is the table caption."),
size="tiny")
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir
79696 AliceSprings 5.9 21.7 0.0 4.0 10.8 SE32428 Ballarat 2.0 10.6 0.2 W58707 MountGambier 8.5 19.7 2.2 1.6 2.8 SW22420 WaggaWagga 1.3 17.4 0.0 3.6 10.8 SSW50382 Cairns 24.0 31.8 0.0 4.6 8.8 NE
1441 Albury 11.2 32.2 0.0 NNE78992 AliceSprings 13.8 30.9 0.0 5.8 11.2 N66576 PearceRAAF 5.6 22.9 0.0 11.0 S
Table : This is the table caption.
http: // togaware. com Copyright© 2014, [email protected] 159/174
Literate Data Mining in R Incorporating R Code
Plots
library(ggplot2)
cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")
ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm))
g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location))
g <- g + geom_density(alpha = 0.55)
print(g)
0.00
0.05
0.10
0.15
0.20
10 20 30 40Temp3pm
dens
ity
Location
Canberra
Darwin
Melbourne
Sydney
http: // togaware. com Copyright© 2014, [email protected] 160/174
Literate Data Mining in R Our First KnitR Document
Create a KnitR Document: New→R Sweave
http: // togaware. com Copyright© 2014, [email protected] 161/174
Literate Data Mining in R Our First KnitR Document
Setup KnitR
We wish to use KnitR rather than the older Sweave processor
In RStudio we can configure the options to use knitr:
Select Tools→Options
Choose the Sweave group
Choose knitr for Weave Rnw files using:
The remaining defaults should be okay
Click Apply and then OK
http: // togaware. com Copyright© 2014, [email protected] 162/174
Literate Data Mining in R Our First KnitR Document
Simple KnitR Document
Insert the following into your new KnitR document:
\title{Sample KnitR Document}
\author{Graham Williams}
\maketitle
\section*{My First Section}
This is some text that is automatically typeset
by the LaTeX processor to produce well formatted
quality output as PDF.
http: // togaware. com Copyright© 2014, [email protected] 163/174
Literate Data Mining in R Our First KnitR Document
Simple KnitR Document
http: // togaware. com Copyright© 2014, [email protected] 164/174
Literate Data Mining in R Our First KnitR Document
Simple KnitR Document—Resulting PDF
Result of Compile PDF
http: // togaware. com Copyright© 2014, [email protected] 165/174
Literate Data Mining in R Including R Commands in KnitR
KnitR: Add R Commands
R code can be used to generate results into the document:
<<echo=FALSE, message=FALSE>>=
library(rattle) # Provides the weather dataset
library(ggplot2) # Provides the qplot() function
ds <- weather
qplot(MinTemp, MaxTemp, data=ds)
@
http: // togaware. com Copyright© 2014, [email protected] 166/174
Literate Data Mining in R Including R Commands in KnitR
KnitR Document With R Code
http: // togaware. com Copyright© 2014, [email protected] 167/174
Literate Data Mining in R Including R Commands in KnitR
Simple KnitR Document—PDF with Plot
Result of Compile PDF
http: // togaware. com Copyright© 2014, [email protected] 168/174
Literate Data Mining in R Basics Cheat Sheet
LaTeX Basics
\subsection*{...} % Introduce a Sub Section
\subsubsection*{...} % Introduce a Sub Sub Section
\textbf{...} % Bold font
\textit{...} % Italic font
\begin{itemize} % A bullet list
\item ...
\item ...
\end{itemize}
Plus an extensive collection of other markup and capabilities.
http: // togaware. com Copyright© 2014, [email protected] 169/174
Literate Data Mining in R Basics Cheat Sheet
KnitR Basics
echo=FALSE # Do not display the R code
eval=TRUE # Evaluate the R code
results="hide" # Hide the results of the R commands
fig.width=10 # Extend figure width from 7 to 10 inches
fig.height=8 # Extend figure height from 7 to 8 inches
out.width="0.8\\textwidth" # Fit figure 80% page width
out.height="0.5\\textheight" # Fit figure 50% page height
Plus an extensive collection of other options.
http: // togaware. com Copyright© 2014, [email protected] 170/174
Literate Data Mining in R Basics Cheat Sheet
Workshop Overview
1 R: A Language for Data Mining
2 Data Mining, Rattle, and R
3 Loading, Cleaning, Exploring Data in Rattle
4 Descriptive Data Mining
5 Predictive Data Mining: Decision Trees
6 Predictive Data Mining: Ensembles
7 Moving into R and Scripting our Analyses
8 Literate Data Mining in R
http: // togaware. com Copyright© 2014, [email protected] 171/174
Literate Data Mining in R Resources
Resources: One Page R
A new series of guides to using R.
http://onepager.togaware.com
Data Preparation template:http://onepager.togaware.com/DataO.pdf
Model Building template:http://onepager.togaware.com/ModelO.pdf
Plots using GGPlot2:http://onepager.togaware.com/GGPlot2O.pdf
Test Mining in R:http://onepager.togaware.com/TextMiningO.pdf
http: // togaware. com Copyright© 2014, [email protected] 172/174
Literate Data Mining in R Resources
Resources and References
OnePageR: http://onepager.togaware.com – Tutorial Notes
Rattle: http://rattle.togaware.com
Guides: http://datamining.togaware.com
Practise: http://analystfirst.com
Book: Data Mining using Rattle/R
Chapter: Rattle and Other Tales
Paper: A Data Mining GUI for R — R Journal, Volume 1(2)
http: // togaware. com Copyright© 2014, [email protected] 173/174
That Is All Folks Time For Questions
Thank You
This document, sourced from CUNYWorkshop.Rnw revision 315, was processed by KnitR version 1.5 of 2013-09-28 and took 11.4 seconds toprocess. It was generated by gjw on theano running Ubuntu 13.10 with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB
of RAM. It completed the processing 2014-03-01 07:25:28.
http: // togaware. com Copyright© 2014, [email protected] 174/174