Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...

Post on 15-Oct-2020

1 views 0 download

Transcript of Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...

A Data Mining Workshop

Excavating Knowledge from DataIntroducing Data Mining using R

Graham.Williams@togaware.com

Data ScientistAustralian Taxation Office

Adjunct Professor, Australian National UniversityAdjunct Professor, University of Canberra

Fellow, Institute of Analytics Professionals of Australia

Graham.Williams@togaware.com

http://datamining.togaware.com

Visit: http://onepager.togaware.com for Workshop Noteshttp: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 1/174

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 2/174

R: A Language for Data Mining

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 3/174

R: A Language for Data Mining Installing R

Installing R and Rattle

First task is to install R

As free/libre open source software (FLOSS or FOSS), R andRattle are available to all, with no limitations on our freedom touse and share the software, except to share and share alike.

Visit CRAN at http://cran.rstudio.com

Visit Rattle at http://rattle.togaware.com

Linux: Install packages (Ubuntu is recommended)$ wajig install r-recommended r-cran-rattle

Windows: Download and install from CRAN

MacOSX: Download and install from CRAN

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 4/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 5/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 5/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 5/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 5/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Discussion List Traffic

Monthly email traffic on software’s main discussion list.

Source: http://r4stats.com/articles/popularity/http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 6/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Discussion Topics

Number of discussions on popular QandA forums 2013.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 7/174

R: A Language for Data Mining Popularity of R?

How Popular is R? R versus SAS

Number of R/SAS related posts to Stack Overflow by week.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 8/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Professional Forums

Registered for the main discussion group for each software.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 9/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Used in Analytics

Competitions

Software used in data analysis competitions in 2011.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 10/174

R: A Language for Data Mining Popularity of R?

How Popular is R? User Survey

Rexer Analytics Survey 2010 results for data mining/analytic tools.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 11/174

R: A Language for Data Mining Popularity of R?

What is R?

R — The Video

A 90 Second Promo from Revolution Analytics

http://www.revolutionanalytics.com/what-is-open-source-r/

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 12/174

Data Mining, Rattle, and R

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 13/174

Data Mining, Rattle, and R Definition

Data Mining

A data driven analysis to uncover otherwise unknown but usefulpatterns in large datasets, to discover new knowledge and to developpredictive models, turning data and information into knowledge and(one day perhaps) wisdom, in a timely manner.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 14/174

Data Mining, Rattle, and R Definition

Data Mining

Application of

Machine LearningStatisticsSoftware Engineering and Programming with DataEffective Communications and Intuition

. . . to Datasets that vary byVolume, Velocity, Variety, Value, Veracity

. . . to discover new knowledge

. . . to improve business outcomes

. . . to deliver better tailored services

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 15/174

Data Mining, Rattle, and R Definition

Data Mining in Research

Health ResearchAdverse reactions using linked Pharmaceutical, GeneralPractitioner, Hospital, Pathology datasets.

AstronomyMicrolensing events in the Large Magellanic Cloud of severalmillion observed stars (out of 10 billion).

PsychologyInvestigation of age-of-onset for Alzheimer’s disease from 75variables for 800 people.

Social SciencesSurvey evaluation. Social network analysis - identifying keyinfluencers.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 16/174

Data Mining, Rattle, and R Definition

Data Mining in Government

Australian Taxation OfficeLodgment ($110M)Tax Havens ($150M)Tax Fraud ($250M)

Immigration and Border ControlCheck passengers before boarding

Health and Human ServicesDoctor shoppersOver servicing

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 17/174

Data Mining, Rattle, and R Definition

The Business of Data Mining

SAS has annual revenues of $3B (2013)

IBM bought SPSS for $1.2B (2009)

Analytics is >$100B business and >$320B by 2020

Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . .

Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . .

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 18/174

Data Mining, Rattle, and R Algorithms

Basic Tools: Data Mining Algorithms

Cluster Analysis (kmeans, wskm)

Association Analysis (arules)

Linear Discriminant Analysis (lda)

Logistic Regression (glm)

Decision Trees (rpart, wsrpart)

Random Forests (randomForest, wsrf)

Boosted Stumps (ada)

Neural Networks (nnet)

Support Vector Machines (kernlab)

. . .

That’s a lot of tools to learn in R!Many with different interfaces and options.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 19/174

Data Mining, Rattle, and R A GUI for Data Mining

Why a GUI?

Statistics can be complex and traps await

So many tools in R to deliver insights

Effective analyses should be scripted

Scripting also required for repeatability

R is a language for programming with data

How to remember how to do all of this in R?How to skill up 150 data analysts with Data Mining?

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 20/174

Data Mining, Rattle, and R A GUI for Data Mining

Users of Rattle

Today, Rattle is used world wide in many industries

Health analytics

Customer segmentation and marketing

Fraud detection

Government

It is used by

Universities to teach Data Mining

Within research projects for basic analyses

Consultants and Analytics Teams across business

It is and will remain freely available.

CRAN and http://rattle.togaware.com

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 21/174

Data Mining, Rattle, and R Setting Things Up

Installation

Rattle is built using R

Need to download and install R from cran.r-project.org

Recommend also install RStudio from www.rstudio.org

Then start up RStudio and install Rattle:

install.packages("rattle")

Then we can start up Rattle:

rattle()

Required packages are loaded as needed.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 22/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Startup

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 23/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Loading Data

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 24/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Explore Distribution

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 25/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Explore Correlations

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 26/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Hierarchical Cluster

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 27/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Decision Tree

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 28/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Decision Tree Plot

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 29/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Random Forest

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 30/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Risk Chart

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (92%)

Rain in MM (97%)

Precision

Risk Chart Random Forest weather.csv [test] RainTomorrow

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 31/174

Data Mining, Rattle, and R Programming with Data

Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 32/174

Data Mining, Rattle, and R Programming with Data

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 33/174

Data Mining, Rattle, and R Programming with Data

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 34/174

Data Mining, Rattle, and R Programming with Data

Interface Notes

Work through the tabs from left to right

After setting up a tab we need to Execute it

Projects save the current Rattle state

Projects can be restored at a later time

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 35/174

Data Mining, Rattle, and R The Power of Free/Libre and Open Source Software

Tools

Ubuntu GNU/Linux operating system

Feature rich toolkit, up-to-date, easy to install, FLOSS

RStudio

Easy to use integrated development environment, FLOSSPowerful alternative is Emacs (Speaks Statistics), FLOSS

R Statistical Software Language

Extensive, powerful, thousands of contributors, FLOSS

KnitR and LATEX

Produce beautiful documents, easily reproducible, FLOSS

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 36/174

Data Mining, Rattle, and R Ubuntu

Using Ubuntu

Desktop Operating System (GNU/Linux)

Replacing Windows and OSX

The GNU Tool Suite based on Unix — significant heritage

Multiple specialised single task tools, working well together

Compared to single application trying to do it all

Powerful data processing from the command line:grep, awk, head, tail, wc, sed, perl, python, most, diff, make,paste, join, patch, . . .

For interacting with R — start up RStudio from the Dash

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 37/174

Data Mining, Rattle, and R RStudio Interface

RStudio—The Default Three Panels

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 38/174

Data Mining, Rattle, and R RStudio Interface

RStudio—With R Script File—Editor Panel

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 39/174

Data Mining, Rattle, and R Simple Plots

Scatterplot—R Code

Our first little bit of R code:

Load a couple of packages into the R library

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

Then produce a quick plot using qplot()

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 40/174

Data Mining, Rattle, and R Simple Plots

Scatterplot—Plot

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

● ●

●●●

●●

● ●●

● ●

●●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

10

20

30

0 10 20MinTemp

Max

Tem

p

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 41/174

Data Mining, Rattle, and R Simple Plots

Scatterplot—RStudio

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 42/174

Data Mining, Rattle, and R Installing Packages

Missing Packages–Tools→Install Packages. . .

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 43/174

Data Mining, Rattle, and R Installing Packages

RStudio—Installing ggplot2

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 44/174

Data Mining, Rattle, and R RStudio Shortcuts

RStudio—Keyboard Shortcuts

These will become very useful!

Editor:

Ctrl-Enter will send the line of code to the R consoleCtrl-2 will move the cursor to the Console

Console:

UpArrow will cycle through previous commandsCtrl-UpArrow will search previous commandsTab will complete function names and list the argumentsCtrl-1 will move the cursor to the Editor

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 45/174

Data Mining, Rattle, and R Basic R Commands

Basic R

library(rattle) # Load the weather dataset.

head(weather) # First 6 observations of the dataset.

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

str(weather) # Struncture of the variables in the dataset.

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 46/174

Data Mining, Rattle, and R Basic R Commands

Basic R

summary(weather) # Univariate summary of the variables.

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.30 ...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ...

## Median :2008-05-01 Albany : 0 Median : 7.45 ...

## Mean :2008-05-01 Albury : 0 Mean : 7.27 ...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine WindGust...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ...

## Median : 0.00 Median : 4.20 Median : 8.60 E : ...

## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ...

## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 47/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Add A Little Colour

qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

25 50 75 100Humidity3pm

Pres

sure

3pm

RainTomorrow

No

Yes

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 48/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Careful with Categorics

qplot(WindGustDir, Pressure3pm, data=ds)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 49/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Add A Little Jitter

qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 50/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—And Some Colour

qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 51/174

Data Mining, Rattle, and R Help

Getting Help—Precede Command with ?

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 52/174

Loading, Cleaning, Exploring Data in Rattle

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 53/174

Loading, Cleaning, Exploring Data in Rattle

Loading Data

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 54/174

Loading, Cleaning, Exploring Data in Rattle

Exploring Data

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 55/174

Loading, Cleaning, Exploring Data in Rattle

Test Data

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 56/174

Loading, Cleaning, Exploring Data in Rattle

Transform Data

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 57/174

Descriptive Data Mining

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 58/174

Descriptive Data Mining Cluster Analysis

What is Cluster Analysis?

Cluster: a collection of observations

Similar to one another within the same clusterDissimilar to the observations in other clusters

Cluster analysisGrouping a set of data observations into classes

Clustering is unsupervised classification: no predefinedclasses—descriptive data mining.

Typical applications

As a stand-alone tool to get insight into data distributionAs a preprocessing step for other algorithms

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 59/174

Descriptive Data Mining Cluster Analysis

What Is Good Clustering?

High Quality:

high intra-class similaritylow inter-class similarity

The Quality depends on:

similarity measurealgorithm for searching

Depends on the opinion of the user, and the algorithm’sability to discover hidden patterns that are of interest to theuser.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 60/174

Descriptive Data Mining Cluster Analysis

Major Clustering Approaches

Partitioning algorithms (kmeans, pam, clara, fanny): Constructvarious partitions and then evaluate them by some criterion. Afixed number of clusters, k, is generated. Start with an initial(perhaps random) cluster.

Hierarchical algorithms: (hclust, agnes, diana) Create ahierarchical decomposition of the set of observations using somecriterion

Density-based algorithms: based on connectivity and densityfunctions

Grid-based algorithms: based on a multiple-level granularitystructure

Model-based algorithms: (mclust for mixture of Gaussians) Amodel is hypothesized for each of the clusters and the idea is tofind the best fit of that model

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 61/174

Descriptive Data Mining Cluster Analysis

KMeans Clustering

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 62/174

Descriptive Data Mining Association Rules

Association Rule Mining

An unsupervised learning algorithm—descriptive data mining.

Identify items (patterns) that occur frequently together in agiven set of data.

Patterns = associations, correlations, causal structures (Rules).

Data = sets of items in . . .

transactional databaserelational databasecomplex information repositories

Rule: Body → Head [support, confidence]

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 63/174

Descriptive Data Mining Association Rules

Typical Applications

Link analysis

Market basket analysis

Cross marketing

Customers who purchase . . .

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 64/174

Descriptive Data Mining Association Rules

Examples

Friday ∩ Nappies → Beer[0.5%, 60%]

Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ]→ MP3Player[2%, 60%]

Maths ∩ CS → HDinCS[1%, 75%]

Gladiator ∩ Patriot → Sixth Sense[0.1%, 90%]

Statins ∩ Peritonitis → Chronic Renal Failure[0.1%, 32%]

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 65/174

Descriptive Data Mining Health Insurance Commission

Health Insurance Commision

Associations on episode database for pathology services

6.8 million records X 120 attributes (3.5GB)15 months preprocessing then 2 weeks data mining

Goal: find associations between tests

cmin = 50% and smin = 1%, 0.5%, 0.25%(1% of 6.8 million = 68,000)Unexpected/unnecessary combination of services

Refuse cover saves $550,000 per year

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 66/174

Descriptive Data Mining Health Insurance Commission

Association Rules

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 67/174

Predictive Data Mining: Decision Trees

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 68/174

Predictive Data Mining: Decision Trees Basics

Predictive Modelling: Classification

Goal of classification is to build models (sentences) in aknowledge representation (language) from examples of pastdecisions.

The model is to be used on unseen cases to make decisions.

Often referred to as supervised learning.

Common approaches: decision trees; neural networks; logisticregression; support vector machines.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 69/174

Predictive Data Mining: Decision Trees Basics

Language: Decision Trees

Knowledge representation: A flow-chart-like tree structure

Internal nodes denotes a test on a variable

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Y

Y N

Age

Gender

Male Female

< 42 > 42

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 70/174

Predictive Data Mining: Decision Trees Basics

Tree Construction: Divide and Conquer

Decision tree induction is an example of a recursive partitioningalgorithm: divide and conquer.

At start, all the training examples are at the root

Partition examples recursively based on selected variables

+ +

+

+

+

+

+

+

++

++

+ −

_

+

−−

+

<42

>42

+

+

Males

Females

+

+

+−

−+ +

++

+

−−

+

+

−+

+

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 71/174

Predictive Data Mining: Decision Trees Algorithm

Algorithm for Decision Tree Induction

A greedy algorithm: takes the best immediate (local) decisionwhile building the overall model

Tree constructed top-down, recursive, divide-and-conquer

Begin with all training examples at the root

Data is partitioned recursively based on selected variables

Select variables on basis of a measure

Stop partitioning when?

All samples for a given node belong to the same classThere are no remaining variables for further partitioning –majority voting is employed for classifying the leafThere are no samples left

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 72/174

Predictive Data Mining: Decision Trees Algorithm

Basic Motivation: Entropy

We are trying to predict output Y (e.g., Yes/No) from input X .

A random data set may have high entropy:

Y is from a uniform distributiona frequency distribution would be flat!a sample will include uniformly random values of Y

A data set with low entropy:

Y ’s distribution will be very skeweda frequency distribution will have a single peaka sample will predominately contain just Yes or just No

Work towards reducing the amount of entropy in the data!

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 73/174

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure: Entropy

Information gain (ID3/C4.5)

Select the variable with the highest information gain

Assume there are two classes: P and N

Let the data S contain p elements of class P and n elements ofclass N

The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as

IE (p, n) = − p

p + nlog2

p

p + n− n

p + nlog2

n

p + n

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 74/174

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure: Gini

Gini index of impurity – traditional statistical measure – CART

Measure how often a randomly chosen observation is incorrectlyclassified if it were randomly classified in proportion to the actualclasses.

Calculated as the sum of the probability of each observationbeing chosen times the probability of incorrect classification,equivalently:

IG (p, n) = 1− (p2 + (1− p)2)

As with Entropy, the Gini measure is maximal when the classesare equally distributed and minimal when all observations are inone class or the other.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 75/174

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Proportion of Positives

Mea

sure Formula

Info

Gini

Variable Importance Measure

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 76/174

Predictive Data Mining: Decision Trees Algorithm

Information Gain

Now use variable A to partition S into v cells: {S1,S2, . . . ,Sv}If Si contains pi examples of P and ni examples of N, theinformation now needed to classify objects in all subtrees Si is:

E (A) =v∑

i=1

pi + ni

p + nI (pi , ni )

So, the information gained by branching on A is:

Gain(A) = I (p, n)− E (A)

So choose the variable A which results in the greatest gain ininformation.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 77/174

Predictive Data Mining: Decision Trees Building Decision Trees

Startup Rattle

library(rattle)

rattle()

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 78/174

Predictive Data Mining: Decision Trees Building Decision Trees

Load Example Weather Dataset

Click on the Execute button and an example dataset is offered.

Click on Yes to load the weather dataset.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 79/174

Predictive Data Mining: Decision Trees Building Decision Trees

Summary of the Weather Dataset

A summary of the weather dataset is displayed.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 80/174

Predictive Data Mining: Decision Trees Building Decision Trees

Model Tab — Decision Tree

Click on the Model tab to display the modelling options.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 81/174

Predictive Data Mining: Decision Trees Building Decision Trees

Build Tree to Predict RainTomorrow

Decision Tree is the default model type—simply click Execute.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 82/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree Predicting RainTomorrow

Click the Draw button to display a tree(Settings → Advanced Graphics).

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 83/174

Predictive Data Mining: Decision Trees Building Decision Trees

Evaluate Decision Tree

Click Evaluate tab—options to evaluate model performance.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 84/174

Predictive Data Mining: Decision Trees Building Decision Trees

Evaluate Decision Tree—Error Matrix

Click Execute to display simple error matrix.Identify the True/False Positives/Negatives.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 85/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree Risk Chart

Click the Risk type and then Execute.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 86/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree ROC Curve

Click the ROC type and then Execute.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 87/174

Predictive Data Mining: Decision Trees Building Decision Trees

Score a Dataset

Click the Score type to score a new dataset using model.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 88/174

Predictive Data Mining: Decision Trees Building Decision Trees

Log of R Commands

Click the Log tab for a history of all your interactions.Save the log contents as a script to repeat what we did.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 89/174

Predictive Data Mining: Decision Trees Building Decision Trees

Log of R Commands—rpart()

Here we see the call to rpart() to build the model.Click on the Export button to save the script to file.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 90/174

Predictive Data Mining: Decision Trees Building Decision Trees

Help → Model → Tree

Rattle provides some basic help—click Yes for R help.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 91/174

Predictive Data Mining: Decision Trees Doing It In R

Weather Dataset - Inputs

ds <- weather

head(ds, 4)

## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3

## 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7

## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3

## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1

....

summary(ds[c(3:5,23)])

## MinTemp MaxTemp Rainfall RISK_MM

## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00

## 1st Qu.: 2.30 1st Qu.:15.0 1st Qu.: 0.00 1st Qu.: 0.00

## Median : 7.45 Median :19.6 Median : 0.00 Median : 0.00

## Mean : 7.27 Mean :20.6 Mean : 1.43 Mean : 1.43

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 92/174

Predictive Data Mining: Decision Trees Doing It In R

Weather Dataset - Target

target <- "RainTomorrow"

summary(ds[target])

## RainTomorrow

## No :300

## Yes: 66

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(vars <- names(ds)[-c(1, 2, 23)])

## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation"

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am"

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am"

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am"

## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday"

## [21] "RainTomorrow"

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 93/174

Predictive Data Mining: Decision Trees Doing It In R

Simple Train/Test Paradigm

set.seed(1421)

train <- c(sample(1:nrow(ds), 0.70*nrow(ds))) # Training dataset

head(train)

## [1] 288 298 363 107 70 232

length(train)

## [1] 256

test <- setdiff(1:nrow(ds), train) # Testing dataset

length(test)

## [1] 110

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 94/174

Predictive Data Mining: Decision Trees Doing It In R

Display the Model

model <- rpart(form, ds[train, vars])

model

## n= 256

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 256 44 No (0.82812 0.17188)

## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813)

## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863)

## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) *

## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951)

## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) *

## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) *

....

Notice the legend to help interpret the tree.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 95/174

Predictive Data Mining: Decision Trees Doing It In R

Performance on Test Dataset

The predict() function is used to score new data.

head(predict(model, ds[test,], type="class"))

## 2 4 6 8 11 12

## No No No No No No

## Levels: No Yes

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## No Yes

## No 77 14

## Yes 11 8

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 96/174

Predictive Data Mining: Decision Trees Doing It In R

Example DTree Plot using Rattle

yes no

1

2

4

8

9

18 19 5

3

6 7

Humidity3pm < 60

WindGustSpeed < 64

Cloud3pm < 6.5

Temp3pm < 26

Pressure3pm >= 1015

No.83 .17100%

No.90 .10

84%

No.93 .07

80%

No.97 .03

64%

No.78 .22

16%

No.88 .12

13%

Yes.29 .71

3%

Yes.30 .70

4%

Yes.45 .55

16%

No.79 .21

7%

Yes.17 .83

9%

yes no

1

2

4

8

9

18 19 5

3

6 7

Rattle 2014−Mar−01 07:25:21 gjw

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 97/174

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint

Notice the use of variables ds, target, vars.

Change these variables, and the remaining script is unchanged.

Simplifies script writing and reuse of scripts.

ds <- iris

target <- "Species"

vars <- names(ds)

Then repeat the rest of the script, without change.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 98/174

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint — Unchanged Code

This code remains the same to build the decision tree.

form <- formula(paste(target, "~ ."))

train <- c(sample(1:nrow(ds), 0.70*nrow(ds)))

test <- setdiff(1:nrow(ds), train)

model <- rpart(form, ds[train, vars])

model

## n= 105

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 105 69 setosa (0.34286 0.32381 0.33333)

## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) *

## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725)

## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) *

## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 99/174

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint — Unchanged Code

Similarly for the predictions.

head(predict(model, ds[test,], type="class"))

## 3 8 9 10 11 12

## setosa setosa setosa setosa setosa setosa

## Levels: setosa versicolor virginica

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## setosa versicolor virginica

## setosa 14 0 0

## versicolor 0 15 4

## virginica 0 1 11

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 100/174

Predictive Data Mining: Decision Trees Summary

Summary

Most widely deployed machine learning algorithm.

Simple idea, powerful learner.

Available in R through the rpart package.

Related packages include party, Cubist, C50, RWeka (J48).

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 101/174

Predictive Data Mining: Ensembles

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 102/174

Predictive Data Mining: Ensembles

Building Multiple Models

General idea developed in Multiple Inductive Learning algorithm(Williams 1987).

Ideas were developed (ACJ 1987, PhD 1990) in the context of:

observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.

Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation

Two approaches covered: Boosting and Random Forests.

Meta learners.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 103/174

Predictive Data Mining: Ensembles Boosting

Boosting Algorithms

Basic idea: boost observations that are “hard to model.”Algorithm: iteratively build weak models using a poor learner:

Build an initial model;

Identify mis-classified cases in the training dataset;

Boost (over-represent) training observations modelled incorrectly;

Build a new model on the boosted training dataset;

Repeat.

The result is an ensemble of weighted models.

Best off the shelf model builder. (Leo Brieman)

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 104/174

Predictive Data Mining: Ensembles Boosting

Example: Ada on Weather Data

head(weather[c(1:5, 23, 24)], 3)

## Date Location MinTemp MaxTemp Rainfall RISK_MM...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...

....

set.seed(42)

train <- sample(1:nrow(weather), 0.7 * nrow(weather))

(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))

## Call:

## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])

##

## Loss: exponential Method: discrete Iteration: 50

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 105/174

Predictive Data Mining: Ensembles Boosting

Example: Error Rate

Notice error rate decreases quickly then flattens.

plot(m)

0 10 20 30 40 50

0.06

0.08

0.10

0.12

0.14

Iteration 1 to 50

Err

or

50

1

1

1

1

1

Training Error

1 Train

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 106/174

Predictive Data Mining: Ensembles Boosting

Example: Variable Importance

Helps understand the knowledge captured.

varplot(m)

RainToday

WindDir9am

Rainfall

WindDir3pm

WindGustDir

Pressure3pm

Sunshine

WindGustSpeed

Cloud3pm

WindSpeed9am

WindSpeed3pm

Humidity9am

Cloud9am

Evaporation

MaxTemp

Temp9am

Humidity3pm

MinTemp

Pressure9am

Temp3pm

0.04 0.05 0.06 0.07 0.08 0.09 0.10

Variable Importance Plot

Score

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 107/174

Predictive Data Mining: Ensembles Boosting

Example: Sample Trees

There are 50 trees in all. Here’s the first 3.

fancyRpartPlot(m$model$trees[[1]])

fancyRpartPlot(m$model$trees[[2]])

fancyRpartPlot(m$model$trees[[3]])

yes no

1

2

4

5

10 11 3

Cloud3pm < 7.5

Pressure3pm >= 1012

Humidity3pm < 42

−1.96 .04100%

−1.97 .03

94%

−1.99 .01

76%

−1.87 .13

18%

−1.96 .04

9%

1.72 .28

9%

1.42 .58

6%

yes no

1

2

4

5

10 11 3

Rattle 2014−Mar−01 07:25:22 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

Sunshine >= 11

−1.96 .04100%

−1.98 .02

80%

1.82 .18

20%

−11.00 .00

6%

1.63 .37

14%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

MaxTemp >= 27

−1.97 .03100%

−1.99 .01

82%

1.83 .17

18%

−1.97 .03

6%

1.67 .33

12%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 108/174

Predictive Data Mining: Ensembles Boosting

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

23%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (88%)

Risk (94%)

Precision

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 109/174

Predictive Data Mining: Ensembles Boosting

Example Applications

ATO Application: What life events affect compliance?

First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status

Boosted Neural Networks

OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 110/174

Predictive Data Mining: Ensembles Boosting

Summary

1 Boosting is implemented in R in the ada library

2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)

3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)

4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.

5 Can be proved to converge to a perfect model if the learners arealways better than chance.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 111/174

Predictive Data Mining: Ensembles Random Forests

Random Forests

Original idea from Leo Brieman and Adele Cutler.

The name is Licensed to Salford Systems!

Hence, R package is randomForest.

Typically presented in context of decision trees.

Random Multinomial Logit uses multiple multinomial logitmodels.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 112/174

Predictive Data Mining: Ensembles Random Forests

Random Forests

Build many decision trees (e.g., 500).

For each tree:

Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)

Classify a new entity using every decision tree:

Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 113/174

Predictive Data Mining: Ensembles Random Forests

Example: RF on Weather Data

set.seed(42)

(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],

na.action=na.roughfix,

importance=TRUE))

##

## Call:

## randomForest(formula=RainTomorrow ~ ., data=weath...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

##

## OOB estimate of error rate: 13.67%

## Confusion matrix:

## No Yes class.error

## No 211 4 0.0186

## Yes 31 10 0.7561

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 114/174

Predictive Data Mining: Ensembles Random Forests

Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees.

plot(m)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

m

trees

Err

or

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 115/174

Predictive Data Mining: Ensembles Random Forests

Example: Variable Importance

Helps understand the knowledge captured.

varImpPlot(m, main="Variable Importance")

RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine

0 5 10 15MeanDecreaseAccuracy

RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm

0 2 4 6 8MeanDecreaseGini

Variable Importance

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 116/174

Predictive Data Mining: Ensembles Random Forests

Example: Sample Trees

There are 500 trees in all. Here’s some rules from the first tree.

## Random Forest Model 1

##

## ------------------------------------------------------...

## Tree 1 Rule 1 Node 30 Decision No

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

## 3: Cloud3pm <= 2.5

## 4: WindDir9am IN ("NNE")

## 5: Sunshine <= 10.25

## 6: Temp3pm <= 17.55

## ------------------------------------------------------...

## Tree 1 Rule 2 Node 31 Decision Yes

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 117/174

Predictive Data Mining: Ensembles Random Forests

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (92%)

Risk (97%)

Precision

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 118/174

Predictive Data Mining: Ensembles Random Forests

Features of Random Forests: By Brieman

Most accurate of current algorithms.

Runs efficiently on large data sets.

Can handle thousands of input variables.

Gives estimates of variable importance.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 119/174

Predictive Data Mining: Ensembles Random Forests

Random Forests

Ensemble: Multiple models working together

Often better than a single model

Variance and bias of the model are reduced

The best available models today - accurate and robust

In daily use in very many areas of application

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 120/174

Moving into R and Scripting our Analyses

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 121/174

Moving into R and Scripting our Analyses

Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 122/174

Moving into R and Scripting our Analyses

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 123/174

Moving into R and Scripting our Analyses

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 124/174

Moving into R and Scripting our Analyses

Step 1: Load the Dataset

dsname <- "weather"

ds <- get(dsname)

dim(ds)

## [1] 366 24

names(ds)

## [1] "Date" "Location" "MinTemp" "...

## [5] "Rainfall" "Evaporation" "Sunshine" "...

## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "...

## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 125/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Observations

head(ds)

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

tail(ds)

## Date Location MinTemp MaxTemp Rainfall Evapo...

## 361 2008-10-26 Canberra 7.9 26.1 0 ...

## 362 2008-10-27 Canberra 9.0 30.7 0 ...

## 363 2008-10-28 Canberra 7.1 28.4 0 ...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 126/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Structure

str(ds)

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 1...

## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16...

## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6...

## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4....

## $ WindGustDir : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindGustSpeed: num 30 39 85 54 50 44 43 41 48 31 ...

## $ WindDir9am : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindDir3pm : Ord.factor w/ 16 levels "N"<"NNE"<"N...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 127/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Summary

summary(ds)

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.3...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3...

## Median :2008-05-01 Albany : 0 Median : 7.4...

## Mean :2008-05-01 Albury : 0 Mean : 7.2...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine Wind...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ...

## Median : 0.00 Median : 4.20 Median : 8.60 E ...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 128/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Variables

id <- c("Date", "Location")

target <- "RainTomorrow"

risk <- "RISK_MM"

(ignore <- union(id, risk))

## [1] "Date" "Location" "RISK_MM"

(vars <- setdiff(names(ds), ignore))

## [1] "MinTemp" "MaxTemp" "Rainfall" "...

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "...

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "...

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "...

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 129/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 366 24

sum(is.na(ds[vars]))

## [1] 47

ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 130/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 328 24

sum(is.na(ds[vars]))

## [1] 0

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 131/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## Min. :0.000

## 1st Qu.:0.000

## Median :0.000

## Mean :0.183

## 3rd Qu.:0.000

## Max. :1.000

....

ds[target] <- as.factor(ds[[target]])

levels(ds[target]) <- c("No", "Yes")

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 132/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## 0:268

## 1: 60

0

100

200

0 1RainTomorrow

coun

t

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 133/174

Moving into R and Scripting our Analyses

Step 4: Prepare for Modelling

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(nobs <- nrow(ds))

## [1] 328

train <- sample(nobs, 0.70*nobs)

length(train)

## [1] 229

test <- setdiff(1:nobs, train)

length(test)

## [1] 99

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 134/174

Moving into R and Scripting our Analyses

Step 5: Build the Model—Random Forest

library(randomForest)

model <- randomForest(form, ds[train, vars], na.action=na.omit)

model

##

## Call:

## randomForest(formula=form, data=ds[train, vars], ...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 135/174

Moving into R and Scripting our Analyses

Step 6: Evaluate the Model—Risk Chart

pr <- predict(model, ds[test,], type="prob")[,2]

riskchart(pr, ds[test, target], ds[test, risk],

title="Random Forest - Risk Chart",

risk=risk, recall=target, thresholds=c(0.35, 0.15))

0.10.20.30.40.50.60.70.80.9Risk Scores

15%1

2

3

4

5

6

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (99%)

RISK_MM (100%)

Precision

Random Forest − Risk Chart

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 136/174

Literate Data Mining in R

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 137/174

Literate Data Mining in R Motivation

Why is Reproducibility Important?

Your Research Leader or Executive drops by and asks:

“Remember that research you did last year? I’ve heard there isan update on the data that you used. Can you add the new datain and repeat the same analysis?”

“Jo Bloggs did a great analysis of the company returns data justbefore she left. Can you get someone else to analyse the newdata set using the same methods, and so produce an updatedreport that we can present to the Exec next week?”

“The fraud case you provided an analysis of last year has finallyreached the courts. We need to ensure we have a clear trail of thedata sources, the analyses performed, and the results obtained,to stand up in court. Could you document these please.”

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 138/174

Literate Data Mining in R Motivation

Literate Data Mining Overview

One document to intermix the analysis, code, and results

Authors productive with narrative and code in one document

Sweave (Leisch 2002) and now KnitR (Yihui 2011)

Embed R code into LATEX documents for typesetting

KnitR also supports publishing to the web

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 139/174

Literate Data Mining in R Motivation

Why Reproducible Data Mining?

Automatically regenerate documents when code, data, orassumptions change.

Eliminate errors that occur when transcribing results intodocuments.

Record the context for the analysis and decisions made about thetype of analysis to perform in the one place.

Document the processes to provide integrity for the conclusionsof the analysis.

Share approach with others for peer review and for learning fromeach other—engender a continuous learning environment.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 140/174

Literate Data Mining in R Motivation

Prime Objective: Trustworthy Software

Those who receive the results of modern data analysis havelimited opportunity to verify the results by direct observation.Users of the analysis have no option but to trust the analysis,and by extension the software that produced it. This places anobligation on all creators of software to program in such a waythat the computations can be understood and trusted. Thisobligation I label the Prime Directive.

John Chambers (2008)Software for Data Analysis: Programming with R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 141/174

Literate Data Mining in R Motivation

Beautiful Output by KnitR

KnitR combined with LATEX will

Intermix analysis and results of analysis

Automatically generate graphics and tables

Support reproducible and transparent analysis

Produce the best looking reports.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 142/174

Literate Data Mining in R Motivation

Beautiful Output by Default

The reader wants to read the document and easily do so!

Code highlighting is done automatically

Default theme is carefully designed

Many other themes are available

R Code is “properly” reformatted

Analyses (Graphs and Tables) automatically included.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 143/174

Literate Data Mining in R Using RStudio

Supporting Technology

A suite of Free and Open Source Software — FLOSS

RStudio — Creating, managing, compiling documents

LATEX — Markup language for typesetting a document

R — Statistical analysis language

KnitR — Integrator of typesetting and analysis

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 144/174

Literate Data Mining in R Using RStudio

Using RStudio

Simplified interaction with R, LATEX, and KnitR

Executes R code one line at a time

Formats LATEX documents and provides and spell checking

A single click compile to PDF and synchronised views

Demonstrate: Startup and explore RStudio.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 145/174

Literate Data Mining in R Using RStudio

Introducing LATEX

A text markup language rather than a WYSIWYG.

Based on TEX from 1977 — very stable and powerful.

LATEX is easier to use macro package built on TEX.

Ensures consistent style (layout, fonts, tables, maths, etc.)

Automatic indexes, footnotes and references.

Documents are well structured and are clear text.

Has a learning curve.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 146/174

Literate Data Mining in R Using RStudio

Basic LATEX Usage

\documentclass{article}

\begin{document}

\end{document}

Demonstrate Create a new Sweave document in RStudio

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 147/174

Literate Data Mining in R Using RStudio

Structures

\documentclass{article}

\begin{document}

\section{Introduction}

...

\subsection{Concepts}

...

\end{document}

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 148/174

Literate Data Mining in R Using RStudio

Formats

\documentclass{article}

\begin{document}

\begin{itemize}

\item ABC

\item DEF

\end{itemize}

This if \textbf{bold} text or \textbf{italic} text, ...

\end{document}

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 149/174

Literate Data Mining in R Using RStudio

RStudio Support for LATEX

RStudio provides excellent support for working with LATEX documents

Helps to avoid having to know too much abuot LATEX

Best illustrated through a demonstration

Format menu

Section commandsFont commandsList commandsVerbatim/Block commands

Spell Checker

Compile PDF

Demonstrate: Start a new document, add contents, format to PDF.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 150/174

Literate Data Mining in R Incorporating R Code

Incorporating R Code

We insert R code in a Chunk starting with << >>=

We terminate the Chunk with @

Save LATEX with extension Rnw

This Chunk<<simple_example>>=

x <- sum(1:10)

x

@

Produces

x <- sum(1:10)

x

## [1] 55

Demonstrate: Do this in RStudio

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 151/174

Literate Data Mining in R Incorporating R Code

Making You Look Good

<<format_example>>=

for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)}

@

for(i in 1:5)

{j <- cos(sin(i)*i^2)+3

print(j-5)

}

## [1] -1.334

## [1] -2.88

## [1] -1.704

## [1] -1.103

....

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 152/174

Literate Data Mining in R Incorporating R Code

R Within the Text

Include information about data within the narrative.

We can do that with \Sexpr{...}.

Our dataset has \Sexpr{nrow(ds)} observations of\Sexpr{ncol(ds)} variables.

Becomes

Our dataset has 328 observations of 24 variables.

Better Still: \Sexpr{format(nrow(ds), big.mark=",")}

Our dataset has 328 observations of 24 variables.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 153/174

Literate Data Mining in R Incorporating R Code

A Simple Table

library(xtable)

obs <- sample(1:nrow(weatherAUS), 8)

vars <- 2:6

xtable(weatherAUS[obs, vars])

Location MinTemp MaxTemp Rainfall Evaporation

79696 AliceSprings 5.90 21.70 0.00 4.0032428 Ballarat 2.00 10.60 0.2058707 MountGambier 8.50 19.70 2.20 1.6022420 WaggaWagga 1.30 17.40 0.00 3.6050382 Cairns 24.00 31.80 0.00 4.60

1441 Albury 11.20 32.20 0.0078992 AliceSprings 13.80 30.90 0.00 5.8066576 PearceRAAF 5.60 22.90 0.00

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 154/174

Literate Data Mining in R Incorporating R Code

Table: Exclude Row Names

print(xtable(weatherAUS[obs, vars]),

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation

AliceSprings 5.90 21.70 0.00 4.00Ballarat 2.00 10.60 0.20MountGambier 8.50 19.70 2.20 1.60WaggaWagga 1.30 17.40 0.00 3.60Cairns 24.00 31.80 0.00 4.60Albury 11.20 32.20 0.00AliceSprings 13.80 30.90 0.00 5.80PearceRAAF 5.60 22.90 0.00

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 155/174

Literate Data Mining in R Incorporating R Code

Table: Limit Number of Digits

print(xtable(weatherAUS[obs, vars],

digits=1),

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation

AliceSprings 5.9 21.7 0.0 4.0Ballarat 2.0 10.6 0.2MountGambier 8.5 19.7 2.2 1.6WaggaWagga 1.3 17.4 0.0 3.6Cairns 24.0 31.8 0.0 4.6Albury 11.2 32.2 0.0AliceSprings 13.8 30.9 0.0 5.8PearceRAAF 5.6 22.9 0.0

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 156/174

Literate Data Mining in R Incorporating R Code

Table: Tiny Font

vars <- 2:8

print(xtable(weatherAUS[obs, vars],

digits=0),

size="tiny",

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

AliceSprings 6 22 0 4 11 SEBallarat 2 11 0 WMountGambier 8 20 2 2 3 SWWaggaWagga 1 17 0 4 11 SSWCairns 24 32 0 5 9 NEAlbury 11 32 0 NNEAliceSprings 14 31 0 6 11 NPearceRAAF 6 23 0 11 S

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 157/174

Literate Data Mining in R Incorporating R Code

Table: Column Alignment

vars <- 2:8

print(xtable(weatherAUS[obs, vars],

digits=0,

align="rlrrrrrr"),

size="tiny")

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

79696 AliceSprings 6 22 0 4 11 SE32428 Ballarat 2 11 0 W58707 MountGambier 8 20 2 2 3 SW22420 WaggaWagga 1 17 0 4 11 SSW50382 Cairns 24 32 0 5 9 NE

1441 Albury 11 32 0 NNE78992 AliceSprings 14 31 0 6 11 N66576 PearceRAAF 6 23 0 11 S

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 158/174

Literate Data Mining in R Incorporating R Code

Table: Caption

print(xtable(weatherAUS[obs, vars],

digits=1,

caption="This is the table caption."),

size="tiny")

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

79696 AliceSprings 5.9 21.7 0.0 4.0 10.8 SE32428 Ballarat 2.0 10.6 0.2 W58707 MountGambier 8.5 19.7 2.2 1.6 2.8 SW22420 WaggaWagga 1.3 17.4 0.0 3.6 10.8 SSW50382 Cairns 24.0 31.8 0.0 4.6 8.8 NE

1441 Albury 11.2 32.2 0.0 NNE78992 AliceSprings 13.8 30.9 0.0 5.8 11.2 N66576 PearceRAAF 5.6 22.9 0.0 11.0 S

Table : This is the table caption.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 159/174

Literate Data Mining in R Incorporating R Code

Plots

library(ggplot2)

cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")

ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm))

g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location))

g <- g + geom_density(alpha = 0.55)

print(g)

0.00

0.05

0.10

0.15

0.20

10 20 30 40Temp3pm

dens

ity

Location

Canberra

Darwin

Melbourne

Sydney

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 160/174

Literate Data Mining in R Our First KnitR Document

Create a KnitR Document: New→R Sweave

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 161/174

Literate Data Mining in R Our First KnitR Document

Setup KnitR

We wish to use KnitR rather than the older Sweave processor

In RStudio we can configure the options to use knitr:

Select Tools→Options

Choose the Sweave group

Choose knitr for Weave Rnw files using:

The remaining defaults should be okay

Click Apply and then OK

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 162/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document

Insert the following into your new KnitR document:

\title{Sample KnitR Document}

\author{Graham Williams}

\maketitle

\section*{My First Section}

This is some text that is automatically typeset

by the LaTeX processor to produce well formatted

quality output as PDF.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 163/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 164/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document—Resulting PDF

Result of Compile PDF

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 165/174

Literate Data Mining in R Including R Commands in KnitR

KnitR: Add R Commands

R code can be used to generate results into the document:

<<echo=FALSE, message=FALSE>>=

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)

@

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 166/174

Literate Data Mining in R Including R Commands in KnitR

KnitR Document With R Code

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 167/174

Literate Data Mining in R Including R Commands in KnitR

Simple KnitR Document—PDF with Plot

Result of Compile PDF

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 168/174

Literate Data Mining in R Basics Cheat Sheet

LaTeX Basics

\subsection*{...} % Introduce a Sub Section

\subsubsection*{...} % Introduce a Sub Sub Section

\textbf{...} % Bold font

\textit{...} % Italic font

\begin{itemize} % A bullet list

\item ...

\item ...

\end{itemize}

Plus an extensive collection of other markup and capabilities.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 169/174

Literate Data Mining in R Basics Cheat Sheet

KnitR Basics

echo=FALSE # Do not display the R code

eval=TRUE # Evaluate the R code

results="hide" # Hide the results of the R commands

fig.width=10 # Extend figure width from 7 to 10 inches

fig.height=8 # Extend figure height from 7 to 8 inches

out.width="0.8\\textwidth" # Fit figure 80% page width

out.height="0.5\\textheight" # Fit figure 50% page height

Plus an extensive collection of other options.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 170/174

Literate Data Mining in R Basics Cheat Sheet

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 171/174

Literate Data Mining in R Resources

Resources: One Page R

A new series of guides to using R.

http://onepager.togaware.com

Data Preparation template:http://onepager.togaware.com/DataO.pdf

Model Building template:http://onepager.togaware.com/ModelO.pdf

Plots using GGPlot2:http://onepager.togaware.com/GGPlot2O.pdf

Test Mining in R:http://onepager.togaware.com/TextMiningO.pdf

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 172/174

Literate Data Mining in R Resources

Resources and References

OnePageR: http://onepager.togaware.com – Tutorial Notes

Rattle: http://rattle.togaware.com

Guides: http://datamining.togaware.com

Practise: http://analystfirst.com

Book: Data Mining using Rattle/R

Chapter: Rattle and Other Tales

Paper: A Data Mining GUI for R — R Journal, Volume 1(2)

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 173/174

That Is All Folks Time For Questions

Thank You

This document, sourced from CUNYWorkshop.Rnw revision 315, was processed by KnitR version 1.5 of 2013-09-28 and took 11.4 seconds toprocess. It was generated by gjw on theano running Ubuntu 13.10 with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB

of RAM. It completed the processing 2014-03-01 07:25:28.

http: // togaware. com Copyright© 2014, Graham.Williams@togaware.com 174/174