Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...

A Data Mining Workshop

Excavating Knowledge from DataIntroducing Data Mining using R

[email protected]

Data ScientistAustralian Taxation Office

Adjunct Professor, Australian National UniversityAdjunct Professor, University of Canberra

Fellow, Institute of Analytics Professionals of Australia

[email protected]

http://datamining.togaware.com

Visit: http://onepager.togaware.com for Workshop Noteshttp: // togaware. com Copyright© 2014, [email protected] 1/174

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 2/174

R: A Language for Data Mining

Workshop Overview










R: A Language for Data Mining Installing R

Installing R and Rattle

First task is to install R

As free/libre open source software (FLOSS or FOSS), R andRattle are available to all, with no limitations on our freedom touse and share the software, except to share and share alike.

Visit CRAN at http://cran.rstudio.com

Visit Rattle at http://rattle.togaware.com

Linux: Install packages (Ubuntu is recommended)$ wajig install r-recommended r-cran-rattle

Windows: Download and install from CRAN

MacOSX: Download and install from CRAN


R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!
































R: A Language for Data Mining Popularity of R?

How Popular is R? Discussion List Traffic

Monthly email traffic on software’s main discussion list.

Source: http://r4stats.com/articles/popularity/http: // togaware. com Copyright© 2014, [email protected] 6/174


How Popular is R? Discussion Topics

Number of discussions on popular QandA forums 2013.

Source: http://r4stats.com/articles/popularity/



How Popular is R? R versus SAS

Number of R/SAS related posts to Stack Overflow by week.




How Popular is R? Professional Forums

Registered for the main discussion group for each software.




How Popular is R? Used in Analytics

Competitions

Software used in data analysis competitions in 2011.




How Popular is R? User Survey

Rexer Analytics Survey 2010 results for data mining/analytic tools.




What is R?

R — The Video

A 90 Second Promo from Revolution Analytics

http://www.revolutionanalytics.com/what-is-open-source-r/


Data Mining, Rattle, and R

Workshop Overview










Data Mining, Rattle, and R Definition

Data Mining

A data driven analysis to uncover otherwise unknown but usefulpatterns in large datasets, to discover new knowledge and to developpredictive models, turning data and information into knowledge and(one day perhaps) wisdom, in a timely manner.



Data Mining

Application of

Machine LearningStatisticsSoftware Engineering and Programming with DataEffective Communications and Intuition

. . . to Datasets that vary byVolume, Velocity, Variety, Value, Veracity

. . . to discover new knowledge

. . . to improve business outcomes

. . . to deliver better tailored services



Data Mining in Research

Health ResearchAdverse reactions using linked Pharmaceutical, GeneralPractitioner, Hospital, Pathology datasets.

AstronomyMicrolensing events in the Large Magellanic Cloud of severalmillion observed stars (out of 10 billion).

PsychologyInvestigation of age-of-onset for Alzheimer’s disease from 75variables for 800 people.

Social SciencesSurvey evaluation. Social network analysis - identifying keyinfluencers.



Data Mining in Government

Australian Taxation OfficeLodgment ($110M)Tax Havens ($150M)Tax Fraud ($250M)

Immigration and Border ControlCheck passengers before boarding

Health and Human ServicesDoctor shoppersOver servicing



The Business of Data Mining

SAS has annual revenues of $3B (2013)

IBM bought SPSS for $1.2B (2009)

Analytics is >$100B business and >$320B by 2020

Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . .

Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . .


Data Mining, Rattle, and R Algorithms

Basic Tools: Data Mining Algorithms

Cluster Analysis (kmeans, wskm)

Association Analysis (arules)

Linear Discriminant Analysis (lda)

Logistic Regression (glm)

Decision Trees (rpart, wsrpart)

Random Forests (randomForest, wsrf)

Boosted Stumps (ada)

Neural Networks (nnet)

Support Vector Machines (kernlab)

. . .

That’s a lot of tools to learn in R!Many with different interfaces and options.


Data Mining, Rattle, and R A GUI for Data Mining

Why a GUI?

Statistics can be complex and traps await

So many tools in R to deliver insights

Effective analyses should be scripted

Scripting also required for repeatability

R is a language for programming with data

How to remember how to do all of this in R?How to skill up 150 data analysts with Data Mining?


Data Mining, Rattle, and R A GUI for Data Mining

Users of Rattle

Today, Rattle is used world wide in many industries

Health analytics

Customer segmentation and marketing

Fraud detection

Government

It is used by

Universities to teach Data Mining

Within research projects for basic analyses

Consultants and Analytics Teams across business

It is and will remain freely available.

CRAN and http://rattle.togaware.com


Data Mining, Rattle, and R Setting Things Up

Installation

Rattle is built using R

Need to download and install R from cran.r-project.org

Recommend also install RStudio from www.rstudio.org

Then start up RStudio and install Rattle:

install.packages("rattle")

Then we can start up Rattle:

rattle()

Required packages are loaded as needed.


Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Startup



A Tour Thru Rattle: Loading Data



A Tour Thru Rattle: Explore Distribution



A Tour Thru Rattle: Explore Correlations



A Tour Thru Rattle: Hierarchical Cluster



A Tour Thru Rattle: Decision Tree



A Tour Thru Rattle: Decision Tree Plot



A Tour Thru Rattle: Random Forest



A Tour Thru Rattle: Risk Chart

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (92%)

Rain in MM (97%)

Precision

Risk Chart Random Forest weather.csv [test] RainTomorrow


Data Mining, Rattle, and R Programming with Data

Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing



From GUI to CLI — Rattle’s Log Tab






Interface Notes

Work through the tabs from left to right

After setting up a tab we need to Execute it

Projects save the current Rattle state

Projects can be restored at a later time


Data Mining, Rattle, and R The Power of Free/Libre and Open Source Software

Tools

Ubuntu GNU/Linux operating system

Feature rich toolkit, up-to-date, easy to install, FLOSS

RStudio

Easy to use integrated development environment, FLOSSPowerful alternative is Emacs (Speaks Statistics), FLOSS

R Statistical Software Language

Extensive, powerful, thousands of contributors, FLOSS

KnitR and LATEX

Produce beautiful documents, easily reproducible, FLOSS


Data Mining, Rattle, and R Ubuntu

Using Ubuntu

Desktop Operating System (GNU/Linux)

Replacing Windows and OSX

The GNU Tool Suite based on Unix — significant heritage

Multiple specialised single task tools, working well together

Compared to single application trying to do it all

Powerful data processing from the command line:grep, awk, head, tail, wc, sed, perl, python, most, diff, make,paste, join, patch, . . .

For interacting with R — start up RStudio from the Dash


Data Mining, Rattle, and R RStudio Interface

RStudio—The Default Three Panels


Data Mining, Rattle, and R RStudio Interface

RStudio—With R Script File—Editor Panel


Data Mining, Rattle, and R Simple Plots

Scatterplot—R Code

Our first little bit of R code:

Load a couple of packages into the R library

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

Then produce a quick plot using qplot()

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)



Scatterplot—Plot

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

● ●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●●

● ●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

0 10 20MinTemp

Max

Tem

p



Scatterplot—RStudio


Data Mining, Rattle, and R Installing Packages

Missing Packages–Tools→Install Packages. . .


Data Mining, Rattle, and R Installing Packages

RStudio—Installing ggplot2


Data Mining, Rattle, and R RStudio Shortcuts

RStudio—Keyboard Shortcuts

These will become very useful!

Editor:

Ctrl-Enter will send the line of code to the R consoleCtrl-2 will move the cursor to the Console

Console:

UpArrow will cycle through previous commandsCtrl-UpArrow will search previous commandsTab will complete function names and list the argumentsCtrl-1 will move the cursor to the Editor


Data Mining, Rattle, and R Basic R Commands

Basic R

library(rattle) # Load the weather dataset.

head(weather) # First 6 observations of the dataset.

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

str(weather) # Struncture of the variables in the dataset.

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

....


Data Mining, Rattle, and R Basic R Commands

Basic R

summary(weather) # Univariate summary of the variables.

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.30 ...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ...

## Median :2008-05-01 Albany : 0 Median : 7.45 ...

## Mean :2008-05-01 Albury : 0 Mean : 7.27 ...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine WindGust...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ...

## Median : 0.00 Median : 4.20 Median : 8.60 E : ...

## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ...

## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ...

....


Data Mining, Rattle, and R Visualising Data

Visual Summaries—Add A Little Colour

qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds)

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

● ●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

995

1005

1015

1025

1035

25 50 75 100Humidity3pm

Pres

sure

3pm

RainTomorrow

●

●

No

Yes



Visual Summaries—Careful with Categorics

qplot(WindGustDir, Pressure3pm, data=ds)

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

995

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm



Visual Summaries—Add A Little Jitter

qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

995

1005

1015

1025

1035


Pres

sure

3pm



Visual Summaries—And Some Colour

qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

1005

1015

1025

1035


Pres

sure

3pm


Data Mining, Rattle, and R Help

Getting Help—Precede Command with ?


Loading, Cleaning, Exploring Data in Rattle

Workshop Overview











Loading Data



Exploring Data



Test Data



Transform Data


Descriptive Data Mining

Workshop Overview










Descriptive Data Mining Cluster Analysis

What is Cluster Analysis?

Cluster: a collection of observations

Similar to one another within the same clusterDissimilar to the observations in other clusters

Cluster analysisGrouping a set of data observations into classes

Clustering is unsupervised classification: no predefinedclasses—descriptive data mining.

Typical applications

As a stand-alone tool to get insight into data distributionAs a preprocessing step for other algorithms



What Is Good Clustering?

High Quality:

high intra-class similaritylow inter-class similarity

The Quality depends on:

similarity measurealgorithm for searching

Depends on the opinion of the user, and the algorithm’sability to discover hidden patterns that are of interest to theuser.



Major Clustering Approaches

Partitioning algorithms (kmeans, pam, clara, fanny): Constructvarious partitions and then evaluate them by some criterion. Afixed number of clusters, k, is generated. Start with an initial(perhaps random) cluster.

Hierarchical algorithms: (hclust, agnes, diana) Create ahierarchical decomposition of the set of observations using somecriterion

Density-based algorithms: based on connectivity and densityfunctions

Grid-based algorithms: based on a multiple-level granularitystructure

Model-based algorithms: (mclust for mixture of Gaussians) Amodel is hypothesized for each of the clusters and the idea is tofind the best fit of that model



KMeans Clustering


Descriptive Data Mining Association Rules

Association Rule Mining

An unsupervised learning algorithm—descriptive data mining.

Identify items (patterns) that occur frequently together in agiven set of data.

Patterns = associations, correlations, causal structures (Rules).

Data = sets of items in . . .

transactional databaserelational databasecomplex information repositories

Rule: Body → Head [support, confidence]



Typical Applications

Link analysis

Market basket analysis

Cross marketing

Customers who purchase . . .



Examples

Friday ∩ Nappies → Beer[0.5%, 60%]

Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ]→ MP3Player[2%, 60%]

Maths ∩ CS → HDinCS[1%, 75%]

Gladiator ∩ Patriot → Sixth Sense[0.1%, 90%]

Statins ∩ Peritonitis → Chronic Renal Failure[0.1%, 32%]


Descriptive Data Mining Health Insurance Commission

Health Insurance Commision

Associations on episode database for pathology services

6.8 million records X 120 attributes (3.5GB)15 months preprocessing then 2 weeks data mining

Goal: find associations between tests

cmin = 50% and smin = 1%, 0.5%, 0.25%(1% of 6.8 million = 68,000)Unexpected/unnecessary combination of services

Refuse cover saves $550,000 per year


Descriptive Data Mining Health Insurance Commission

Association Rules


Predictive Data Mining: Decision Trees

Workshop Overview










Predictive Data Mining: Decision Trees Basics

Predictive Modelling: Classification

Goal of classification is to build models (sentences) in aknowledge representation (language) from examples of pastdecisions.

The model is to be used on unseen cases to make decisions.

Often referred to as supervised learning.

Common approaches: decision trees; neural networks; logisticregression; support vector machines.



Language: Decision Trees

Knowledge representation: A flow-chart-like tree structure

Internal nodes denotes a test on a variable

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Y

Y N

Age

Gender

Male Female

< 42 > 42



Tree Construction: Divide and Conquer

Decision tree induction is an example of a recursive partitioningalgorithm: divide and conquer.

At start, all the training examples are at the root

Partition examples recursively based on selected variables

+ +

+

+

+

+

−

−

+

+

++

++

+ −

_

−

−

+

−−

−

−

+

<42

>42

+

+

−

Males

Females

+

+

+−

−+ +

++

+

−

−

−

−−

+

+

−+

+


Predictive Data Mining: Decision Trees Algorithm

Algorithm for Decision Tree Induction

A greedy algorithm: takes the best immediate (local) decisionwhile building the overall model

Tree constructed top-down, recursive, divide-and-conquer

Begin with all training examples at the root

Data is partitioned recursively based on selected variables

Select variables on basis of a measure

Stop partitioning when?

All samples for a given node belong to the same classThere are no remaining variables for further partitioning –majority voting is employed for classifying the leafThere are no samples left



Basic Motivation: Entropy

We are trying to predict output Y (e.g., Yes/No) from input X .

A random data set may have high entropy:

Y is from a uniform distributiona frequency distribution would be flat!a sample will include uniformly random values of Y

A data set with low entropy:

Y ’s distribution will be very skeweda frequency distribution will have a single peaka sample will predominately contain just Yes or just No

Work towards reducing the amount of entropy in the data!



Variable Selection Measure: Entropy

Information gain (ID3/C4.5)

Select the variable with the highest information gain

Assume there are two classes: P and N

Let the data S contain p elements of class P and n elements ofclass N

The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as

IE (p, n) = − p

p + nlog2

p

p + n− n

p + nlog2

n

p + n



Variable Selection Measure: Gini

Gini index of impurity – traditional statistical measure – CART

Measure how often a randomly chosen observation is incorrectlyclassified if it were randomly classified in proportion to the actualclasses.

Calculated as the sum of the probability of each observationbeing chosen times the probability of incorrect classification,equivalently:

IG (p, n) = 1− (p2 + (1− p)2)

As with Entropy, the Gini measure is maximal when the classesare equally distributed and minimal when all observations are inone class or the other.



Variable Selection Measure

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Proportion of Positives

Mea

sure Formula

Info

Gini

Variable Importance Measure



Information Gain

Now use variable A to partition S into v cells: {S1,S2, . . . ,Sv}If Si contains pi examples of P and ni examples of N, theinformation now needed to classify objects in all subtrees Si is:

E (A) =v∑

i=1

pi + ni

p + nI (pi , ni )

So, the information gained by branching on A is:

Gain(A) = I (p, n)− E (A)

So choose the variable A which results in the greatest gain ininformation.


Predictive Data Mining: Decision Trees Building Decision Trees

Startup Rattle

library(rattle)

rattle()



Load Example Weather Dataset

Click on the Execute button and an example dataset is offered.

Click on Yes to load the weather dataset.



Summary of the Weather Dataset

A summary of the weather dataset is displayed.



Model Tab — Decision Tree

Click on the Model tab to display the modelling options.



Build Tree to Predict RainTomorrow

Decision Tree is the default model type—simply click Execute.



Decision Tree Predicting RainTomorrow

Click the Draw button to display a tree(Settings → Advanced Graphics).



Evaluate Decision Tree

Click Evaluate tab—options to evaluate model performance.



Evaluate Decision Tree—Error Matrix

Click Execute to display simple error matrix.Identify the True/False Positives/Negatives.



Decision Tree Risk Chart

Click the Risk type and then Execute.



Decision Tree ROC Curve

Click the ROC type and then Execute.



Score a Dataset

Click the Score type to score a new dataset using model.



Log of R Commands

Click the Log tab for a history of all your interactions.Save the log contents as a script to repeat what we did.



Log of R Commands—rpart()

Here we see the call to rpart() to build the model.Click on the Export button to save the script to file.



Help → Model → Tree

Rattle provides some basic help—click Yes for R help.


Predictive Data Mining: Decision Trees Doing It In R

Weather Dataset - Inputs

ds <- weather

head(ds, 4)

## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3

## 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7

## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3

## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1

....

summary(ds[c(3:5,23)])

## MinTemp MaxTemp Rainfall RISK_MM

## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00

## 1st Qu.: 2.30 1st Qu.:15.0 1st Qu.: 0.00 1st Qu.: 0.00

## Median : 7.45 Median :19.6 Median : 0.00 Median : 0.00

## Mean : 7.27 Mean :20.6 Mean : 1.43 Mean : 1.43

....



Weather Dataset - Target

target <- "RainTomorrow"

summary(ds[target])

## RainTomorrow

## No :300

## Yes: 66

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(vars <- names(ds)[-c(1, 2, 23)])

## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation"

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am"

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am"

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am"

## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday"

## [21] "RainTomorrow"



Simple Train/Test Paradigm

set.seed(1421)

train <- c(sample(1:nrow(ds), 0.70*nrow(ds))) # Training dataset

head(train)

## [1] 288 298 363 107 70 232

length(train)

## [1] 256

test <- setdiff(1:nrow(ds), train) # Testing dataset

length(test)

## [1] 110



Display the Model

model <- rpart(form, ds[train, vars])

model

## n= 256

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 256 44 No (0.82812 0.17188)

## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813)

## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863)

## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) *

## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951)

## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) *

## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) *

....

Notice the legend to help interpret the tree.



Performance on Test Dataset

The predict() function is used to score new data.

head(predict(model, ds[test,], type="class"))

## 2 4 6 8 11 12

## No No No No No No

## Levels: No Yes

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## No Yes

## No 77 14

## Yes 11 8



Example DTree Plot using Rattle

yes no

1

2

4

8

9

18 19 5

3

6 7

Humidity3pm < 60

WindGustSpeed < 64

Cloud3pm < 6.5

Temp3pm < 26

Pressure3pm >= 1015

No.83 .17100%

No.90 .10

84%

No.93 .07

80%

No.97 .03

64%

No.78 .22

16%

No.88 .12

13%

Yes.29 .71

3%

Yes.30 .70

4%

Yes.45 .55

16%

No.79 .21

7%

Yes.17 .83

9%

yes no

1

2

4

8

9

18 19 5

3

6 7

Rattle 2014−Mar−01 07:25:21 gjw



An R Scripting Hint

Notice the use of variables ds, target, vars.

Change these variables, and the remaining script is unchanged.

Simplifies script writing and reuse of scripts.

ds <- iris

target <- "Species"

vars <- names(ds)

Then repeat the rest of the script, without change.



An R Scripting Hint — Unchanged Code

This code remains the same to build the decision tree.

form <- formula(paste(target, "~ ."))

train <- c(sample(1:nrow(ds), 0.70*nrow(ds)))

test <- setdiff(1:nrow(ds), train)

model <- rpart(form, ds[train, vars])

model

## n= 105

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 105 69 setosa (0.34286 0.32381 0.33333)

## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) *

## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725)

## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) *

## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *



An R Scripting Hint — Unchanged Code

Similarly for the predictions.

head(predict(model, ds[test,], type="class"))

## 3 8 9 10 11 12

## setosa setosa setosa setosa setosa setosa

## Levels: setosa versicolor virginica

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## setosa versicolor virginica

## setosa 14 0 0

## versicolor 0 15 4

## virginica 0 1 11


Predictive Data Mining: Decision Trees Summary

Summary

Most widely deployed machine learning algorithm.

Simple idea, powerful learner.

Available in R through the rpart package.

Related packages include party, Cubist, C50, RWeka (J48).


Predictive Data Mining: Ensembles

Workshop Overview










Predictive Data Mining: Ensembles

Building Multiple Models

General idea developed in Multiple Inductive Learning algorithm(Williams 1987).

Ideas were developed (ACJ 1987, PhD 1990) in the context of:

observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.

Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation

Two approaches covered: Boosting and Random Forests.

Meta learners.


Predictive Data Mining: Ensembles Boosting

Boosting Algorithms

Basic idea: boost observations that are “hard to model.”Algorithm: iteratively build weak models using a poor learner:

Build an initial model;

Identify mis-classified cases in the training dataset;

Boost (over-represent) training observations modelled incorrectly;

Build a new model on the boosted training dataset;

Repeat.

The result is an ensemble of weighted models.

Best off the shelf model builder. (Leo Brieman)



Example: Ada on Weather Data

head(weather[c(1:5, 23, 24)], 3)

## Date Location MinTemp MaxTemp Rainfall RISK_MM...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...

....

set.seed(42)

train <- sample(1:nrow(weather), 0.7 * nrow(weather))

(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))

## Call:

## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])

##

## Loss: exponential Method: discrete Iteration: 50

....



Example: Error Rate

Notice error rate decreases quickly then flattens.

plot(m)

0 10 20 30 40 50

0.06

0.08

0.10

0.12

0.14

Iteration 1 to 50

Err

or

50

1

1

1

1

1

Training Error

1 Train



Example: Variable Importance

Helps understand the knowledge captured.

varplot(m)

RainToday

WindDir9am

Rainfall

WindDir3pm

WindGustDir

Pressure3pm

Sunshine

WindGustSpeed

Cloud3pm

WindSpeed9am

WindSpeed3pm

Humidity9am

Cloud9am

Evaporation

MaxTemp

Temp9am

Humidity3pm

MinTemp

Pressure9am

Temp3pm

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.04 0.05 0.06 0.07 0.08 0.09 0.10

Variable Importance Plot

Score



Example: Sample Trees

There are 50 trees in all. Here’s the first 3.

fancyRpartPlot(m$model$trees[[1]])



yes no

1

2

4

5

10 11 3

Cloud3pm < 7.5

Pressure3pm >= 1012

Humidity3pm < 42

−1.96 .04100%

−1.97 .03

94%

−1.99 .01

76%

−1.87 .13

18%

−1.96 .04

9%

1.72 .28

9%

1.42 .58

6%

yes no

1

2

4

5

10 11 3

Rattle 2014−Mar−01 07:25:22 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

Sunshine >= 11

−1.96 .04100%

−1.98 .02

80%

1.82 .18

20%

−11.00 .00

6%

1.63 .37

14%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

MaxTemp >= 27

−1.97 .03100%

−1.99 .01

82%

1.83 .17

18%

−1.97 .03

6%

1.67 .33

12%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw



Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

23%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (88%)

Risk (94%)

Precision



Example Applications

ATO Application: What life events affect compliance?

First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status

Boosted Neural Networks

OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993



Summary

1 Boosting is implemented in R in the ada library

2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)

3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)

4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.

5 Can be proved to converge to a perfect model if the learners arealways better than chance.


Predictive Data Mining: Ensembles Random Forests

Random Forests

Original idea from Leo Brieman and Adele Cutler.

The name is Licensed to Salford Systems!

Hence, R package is randomForest.

Typically presented in context of decision trees.

Random Multinomial Logit uses multiple multinomial logitmodels.



Random Forests

Build many decision trees (e.g., 500).

For each tree:

Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)

Classify a new entity using every decision tree:

Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.



Example: RF on Weather Data

set.seed(42)

(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],

na.action=na.roughfix,

importance=TRUE))

##

## Call:

## randomForest(formula=RainTomorrow ~ ., data=weath...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

##

## OOB estimate of error rate: 13.67%

## Confusion matrix:

## No Yes class.error

## No 211 4 0.0186

## Yes 31 10 0.7561



Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees.

plot(m)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

m

trees

Err

or



Example: Variable Importance

Helps understand the knowledge captured.

varImpPlot(m, main="Variable Importance")

RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 5 10 15MeanDecreaseAccuracy

RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 2 4 6 8MeanDecreaseGini

Variable Importance



Example: Sample Trees

There are 500 trees in all. Here’s some rules from the first tree.

## Random Forest Model 1

##

## ------------------------------------------------------...

## Tree 1 Rule 1 Node 30 Decision No

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

## 3: Cloud3pm <= 2.5

## 4: WindDir9am IN ("NNE")

## 5: Sunshine <= 10.25

## 6: Temp3pm <= 17.55

## ------------------------------------------------------...

## Tree 1 Rule 2 Node 31 Decision Yes

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

....



Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (92%)

Risk (97%)

Precision



Features of Random Forests: By Brieman

Most accurate of current algorithms.

Runs efficiently on large data sets.

Can handle thousands of input variables.

Gives estimates of variable importance.



Random Forests

Ensemble: Multiple models working together

Often better than a single model

Variance and bias of the model are reduced

The best available models today - accurate and robust

In daily use in very many areas of application


Moving into R and Scripting our Analyses

Workshop Overview











Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing









Step 1: Load the Dataset

dsname <- "weather"

ds <- get(dsname)

dim(ds)

## [1] 366 24

names(ds)

## [1] "Date" "Location" "MinTemp" "...

## [5] "Rainfall" "Evaporation" "Sunshine" "...

## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "...

## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "...

....



Step 2: Observe the Data — Observations

head(ds)

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

tail(ds)

## Date Location MinTemp MaxTemp Rainfall Evapo...

## 361 2008-10-26 Canberra 7.9 26.1 0 ...

## 362 2008-10-27 Canberra 9.0 30.7 0 ...

## 363 2008-10-28 Canberra 7.1 28.4 0 ...

....



Step 2: Observe the Data — Structure

str(ds)

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 1...

## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16...

## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6...

## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4....

## $ WindGustDir : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindGustSpeed: num 30 39 85 54 50 44 43 41 48 31 ...

## $ WindDir9am : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindDir3pm : Ord.factor w/ 16 levels "N"<"NNE"<"N...

....



Step 2: Observe the Data — Summary

summary(ds)

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.3...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3...

## Median :2008-05-01 Albany : 0 Median : 7.4...

## Mean :2008-05-01 Albury : 0 Mean : 7.2...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine Wind...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ...

## Median : 0.00 Median : 4.20 Median : 8.60 E ...

....



Step 2: Observe the Data — Variables

id <- c("Date", "Location")

target <- "RainTomorrow"

risk <- "RISK_MM"

(ignore <- union(id, risk))

## [1] "Date" "Location" "RISK_MM"

(vars <- setdiff(names(ds), ignore))

## [1] "MinTemp" "MaxTemp" "Rainfall" "...

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "...

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "...

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "...

....



Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 366 24

sum(is.na(ds[vars]))

## [1] 47

ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]



Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 328 24

sum(is.na(ds[vars]))

## [1] 0



Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## Min. :0.000

## 1st Qu.:0.000

## Median :0.000

## Mean :0.183

## 3rd Qu.:0.000

## Max. :1.000

....

ds[target] <- as.factor(ds[[target]])

levels(ds[target]) <- c("No", "Yes")



Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## 0:268

## 1: 60

0

100

200

0 1RainTomorrow

coun

t



Step 4: Prepare for Modelling

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(nobs <- nrow(ds))

## [1] 328

train <- sample(nobs, 0.70*nobs)

length(train)

## [1] 229

test <- setdiff(1:nobs, train)

length(test)

## [1] 99



Step 5: Build the Model—Random Forest

library(randomForest)

model <- randomForest(form, ds[train, vars], na.action=na.omit)

model

##

## Call:

## randomForest(formula=form, data=ds[train, vars], ...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

....



Step 6: Evaluate the Model—Risk Chart

pr <- predict(model, ds[test,], type="prob")[,2]

riskchart(pr, ds[test, target], ds[test, risk],

title="Random Forest - Risk Chart",

risk=risk, recall=target, thresholds=c(0.35, 0.15))

0.10.20.30.40.50.60.70.80.9Risk Scores

15%1

2

3

4

5

6

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (99%)

RISK_MM (100%)

Precision

Random Forest − Risk Chart


Literate Data Mining in R

Workshop Overview










Literate Data Mining in R Motivation

Why is Reproducibility Important?

Your Research Leader or Executive drops by and asks:

“Remember that research you did last year? I’ve heard there isan update on the data that you used. Can you add the new datain and repeat the same analysis?”

“Jo Bloggs did a great analysis of the company returns data justbefore she left. Can you get someone else to analyse the newdata set using the same methods, and so produce an updatedreport that we can present to the Exec next week?”

“The fraud case you provided an analysis of last year has finallyreached the courts. We need to ensure we have a clear trail of thedata sources, the analyses performed, and the results obtained,to stand up in court. Could you document these please.”



Literate Data Mining Overview

One document to intermix the analysis, code, and results

Authors productive with narrative and code in one document

Sweave (Leisch 2002) and now KnitR (Yihui 2011)

Embed R code into LATEX documents for typesetting

KnitR also supports publishing to the web



Why Reproducible Data Mining?

Automatically regenerate documents when code, data, orassumptions change.

Eliminate errors that occur when transcribing results intodocuments.

Record the context for the analysis and decisions made about thetype of analysis to perform in the one place.

Document the processes to provide integrity for the conclusionsof the analysis.

Share approach with others for peer review and for learning fromeach other—engender a continuous learning environment.



Prime Objective: Trustworthy Software

Those who receive the results of modern data analysis havelimited opportunity to verify the results by direct observation.Users of the analysis have no option but to trust the analysis,and by extension the software that produced it. This places anobligation on all creators of software to program in such a waythat the computations can be understood and trusted. Thisobligation I label the Prime Directive.

John Chambers (2008)Software for Data Analysis: Programming with R



Beautiful Output by KnitR

KnitR combined with LATEX will

Intermix analysis and results of analysis

Automatically generate graphics and tables

Support reproducible and transparent analysis

Produce the best looking reports.



Beautiful Output by Default

The reader wants to read the document and easily do so!

Code highlighting is done automatically

Default theme is carefully designed

Many other themes are available

R Code is “properly” reformatted

Analyses (Graphs and Tables) automatically included.


Literate Data Mining in R Using RStudio

Supporting Technology

A suite of Free and Open Source Software — FLOSS

RStudio — Creating, managing, compiling documents

LATEX — Markup language for typesetting a document

R — Statistical analysis language

KnitR — Integrator of typesetting and analysis



Using RStudio

Simplified interaction with R, LATEX, and KnitR

Executes R code one line at a time

Formats LATEX documents and provides and spell checking

A single click compile to PDF and synchronised views

Demonstrate: Startup and explore RStudio.



Introducing LATEX

A text markup language rather than a WYSIWYG.

Based on TEX from 1977 — very stable and powerful.

LATEX is easier to use macro package built on TEX.

Ensures consistent style (layout, fonts, tables, maths, etc.)

Automatic indexes, footnotes and references.

Documents are well structured and are clear text.

Has a learning curve.



Basic LATEX Usage

\documentclass{article}

\begin{document}

\end{document}

Demonstrate Create a new Sweave document in RStudio



Structures


\begin{document}

\section{Introduction}

...

\subsection{Concepts}

...

\end{document}



Formats


\begin{document}

\begin{itemize}

\item ABC

\item DEF

\end{itemize}

This if \textbf{bold} text or \textbf{italic} text, ...

\end{document}



RStudio Support for LATEX

RStudio provides excellent support for working with LATEX documents

Helps to avoid having to know too much abuot LATEX

Best illustrated through a demonstration

Format menu

Section commandsFont commandsList commandsVerbatim/Block commands

Spell Checker

Compile PDF

Demonstrate: Start a new document, add contents, format to PDF.


Literate Data Mining in R Incorporating R Code

Incorporating R Code

We insert R code in a Chunk starting with << >>=

We terminate the Chunk with @

Save LATEX with extension Rnw

This Chunk<<simple_example>>=

x <- sum(1:10)

x

@

Produces

x <- sum(1:10)

x

## [1] 55

Demonstrate: Do this in RStudio



Making You Look Good

<<format_example>>=

for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)}

@

for(i in 1:5)

{j <- cos(sin(i)*i^2)+3

print(j-5)

}

## [1] -1.334

## [1] -2.88

## [1] -1.704

## [1] -1.103

....



R Within the Text

Include information about data within the narrative.

We can do that with \Sexpr{...}.

Our dataset has \Sexpr{nrow(ds)} observations of\Sexpr{ncol(ds)} variables.

Becomes

Our dataset has 328 observations of 24 variables.

Better Still: \Sexpr{format(nrow(ds), big.mark=",")}

Our dataset has 328 observations of 24 variables.



A Simple Table

library(xtable)

obs <- sample(1:nrow(weatherAUS), 8)

vars <- 2:6

xtable(weatherAUS[obs, vars])

Location MinTemp MaxTemp Rainfall Evaporation

79696 AliceSprings 5.90 21.70 0.00 4.0032428 Ballarat 2.00 10.60 0.2058707 MountGambier 8.50 19.70 2.20 1.6022420 WaggaWagga 1.30 17.40 0.00 3.6050382 Cairns 24.00 31.80 0.00 4.60

1441 Albury 11.20 32.20 0.0078992 AliceSprings 13.80 30.90 0.00 5.8066576 PearceRAAF 5.60 22.90 0.00



Table: Exclude Row Names

print(xtable(weatherAUS[obs, vars]),

include.rownames=FALSE)


AliceSprings 5.90 21.70 0.00 4.00Ballarat 2.00 10.60 0.20MountGambier 8.50 19.70 2.20 1.60WaggaWagga 1.30 17.40 0.00 3.60Cairns 24.00 31.80 0.00 4.60Albury 11.20 32.20 0.00AliceSprings 13.80 30.90 0.00 5.80PearceRAAF 5.60 22.90 0.00



Table: Limit Number of Digits

print(xtable(weatherAUS[obs, vars],

digits=1),



AliceSprings 5.9 21.7 0.0 4.0Ballarat 2.0 10.6 0.2MountGambier 8.5 19.7 2.2 1.6WaggaWagga 1.3 17.4 0.0 3.6Cairns 24.0 31.8 0.0 4.6Albury 11.2 32.2 0.0AliceSprings 13.8 30.9 0.0 5.8PearceRAAF 5.6 22.9 0.0



Table: Tiny Font

vars <- 2:8


digits=0),

size="tiny",


Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

AliceSprings 6 22 0 4 11 SEBallarat 2 11 0 WMountGambier 8 20 2 2 3 SWWaggaWagga 1 17 0 4 11 SSWCairns 24 32 0 5 9 NEAlbury 11 32 0 NNEAliceSprings 14 31 0 6 11 NPearceRAAF 6 23 0 11 S



Table: Column Alignment

vars <- 2:8


digits=0,

align="rlrrrrrr"),

size="tiny")


79696 AliceSprings 6 22 0 4 11 SE32428 Ballarat 2 11 0 W58707 MountGambier 8 20 2 2 3 SW22420 WaggaWagga 1 17 0 4 11 SSW50382 Cairns 24 32 0 5 9 NE

1441 Albury 11 32 0 NNE78992 AliceSprings 14 31 0 6 11 N66576 PearceRAAF 6 23 0 11 S



Table: Caption


digits=1,

caption="This is the table caption."),

size="tiny")


79696 AliceSprings 5.9 21.7 0.0 4.0 10.8 SE32428 Ballarat 2.0 10.6 0.2 W58707 MountGambier 8.5 19.7 2.2 1.6 2.8 SW22420 WaggaWagga 1.3 17.4 0.0 3.6 10.8 SSW50382 Cairns 24.0 31.8 0.0 4.6 8.8 NE

1441 Albury 11.2 32.2 0.0 NNE78992 AliceSprings 13.8 30.9 0.0 5.8 11.2 N66576 PearceRAAF 5.6 22.9 0.0 11.0 S

Table : This is the table caption.



Plots

library(ggplot2)

cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")

ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm))

g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location))

g <- g + geom_density(alpha = 0.55)

print(g)

0.00

0.05

0.10

0.15

0.20

10 20 30 40Temp3pm

dens

ity

Location

Canberra

Darwin

Melbourne

Sydney


Literate Data Mining in R Our First KnitR Document

Create a KnitR Document: New→R Sweave



Setup KnitR

We wish to use KnitR rather than the older Sweave processor

In RStudio we can configure the options to use knitr:

Select Tools→Options

Choose the Sweave group

Choose knitr for Weave Rnw files using:

The remaining defaults should be okay

Click Apply and then OK



Simple KnitR Document

Insert the following into your new KnitR document:

\title{Sample KnitR Document}

\author{Graham Williams}

\maketitle

\section*{My First Section}

This is some text that is automatically typeset

by the LaTeX processor to produce well formatted

quality output as PDF.



Simple KnitR Document



Simple KnitR Document—Resulting PDF

Result of Compile PDF


Literate Data Mining in R Including R Commands in KnitR

KnitR: Add R Commands

R code can be used to generate results into the document:

<<echo=FALSE, message=FALSE>>=

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)

@



KnitR Document With R Code



Simple KnitR Document—PDF with Plot

Result of Compile PDF


Literate Data Mining in R Basics Cheat Sheet

LaTeX Basics

\subsection*{...} % Introduce a Sub Section

\subsubsection*{...} % Introduce a Sub Sub Section

\textbf{...} % Bold font

\textit{...} % Italic font

\begin{itemize} % A bullet list

\item ...

\item ...

\end{itemize}

Plus an extensive collection of other markup and capabilities.



KnitR Basics

echo=FALSE # Do not display the R code

eval=TRUE # Evaluate the R code

results="hide" # Hide the results of the R commands

fig.width=10 # Extend figure width from 7 to 10 inches

fig.height=8 # Extend figure height from 7 to 8 inches

out.width="0.8\\textwidth" # Fit figure 80% page width

out.height="0.5\\textheight" # Fit figure 50% page height

Plus an extensive collection of other options.



Workshop Overview










Literate Data Mining in R Resources

Resources: One Page R

A new series of guides to using R.

http://onepager.togaware.com

Data Preparation template:http://onepager.togaware.com/DataO.pdf

Model Building template:http://onepager.togaware.com/ModelO.pdf

Plots using GGPlot2:http://onepager.togaware.com/GGPlot2O.pdf

Test Mining in R:http://onepager.togaware.com/TextMiningO.pdf


Literate Data Mining in R Resources

Resources and References

OnePageR: http://onepager.togaware.com – Tutorial Notes

Rattle: http://rattle.togaware.com

Guides: http://datamining.togaware.com

Practise: http://analystfirst.com

Book: Data Mining using Rattle/R

Chapter: Rattle and Other Tales

Paper: A Data Mining GUI for R — R Journal, Volume 1(2)


That Is All Folks Time For Questions

Thank You

This document, sourced from CUNYWorkshop.Rnw revision 315, was processed by KnitR version 1.5 of 2013-09-28 and took 11.4 seconds toprocess. It was generated by gjw on theano running Ubuntu 13.10 with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB

of RAM. It completed the processing 2014-03-01 07:25:28.


Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...

Documents

Transcript of Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining...