Random Forests: The Vanilla of Machine Learning - Anna Quach

37
Welcome to my talk! I’m currently a PhD student at Utah State University working under Dr. Adele Cutler.

Transcript of Random Forests: The Vanilla of Machine Learning - Anna Quach

Page 1: Random Forests: The Vanilla of Machine Learning - Anna Quach

Welcome to my talk!

I’m currently a PhD student at Utah State University working underDr. Adele Cutler.

Page 2: Random Forests: The Vanilla of Machine Learning - Anna Quach

Do we need hundreds of classifiers to solve real worldclassification problems?

121 data sets from the University of California, Irvine (UCI) database (excluding large-scale problems) and their own data to evaluate179 classifiers. Overall Random Forests performed the best in termsof accuracy!

See the paper here:http://www.jmlr.org/papers/v15/delgado14a.html

Page 4: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests: A seminal paper!

https://scholar.google.com/citations?user=mXSv_1UAAAAJ&hl=en&oi=ao

Page 5: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests

The (very theoretical) paper can be found here:

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Page 6: Random Forests: The Vanilla of Machine Learning - Anna Quach

The inventers of Classification and Regression Trees(CART)

(a) LeoBreiman

(b) JeromeFriedman

(c) Charles J.Stone

(d) Richard A.Olshen

Page 7: Random Forests: The Vanilla of Machine Learning - Anna Quach

CART is actually published as a book.

https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418

Page 8: Random Forests: The Vanilla of Machine Learning - Anna Quach

Building a Classification Tree - Predicting Fake Likers

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25Predict Fake Facebook Likes

Average Verified Page Likes

Cat

egor

y E

ntro

py

Facebook Like

RealFake

Download the data here:http://digital.cs.usu.edu/~kyumin/data/likers.html

Page 9: Random Forests: The Vanilla of Machine Learning - Anna Quach

First split

Category Entropy: −∑k

i=1niN log ni

N where ni is the number of likedpages under category i, and N is the total number of pages like byuser u.Average Verified Page Likes: average proportion of verified pagesliked out of total number of pages liked by a user

Page 10: Random Forests: The Vanilla of Machine Learning - Anna Quach

Second split

Page 11: Random Forests: The Vanilla of Machine Learning - Anna Quach

Third split

Page 12: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to build a classification tree

library(rpart)library(rpart.plot)

data = read.csv("FakeLiker-dataset.csv")colnames(data)[9] = "Entropy"colnames(data)[10] = "Average"

levels(data$Class) = c("Real", "Fake")cols = c("lightblue", "orange")[data$Class]

ctree = rpart(Class ~ Entropy + Average, data)prp(ctree,

extra = 1,box.palette = c("lightblue", "orange"))

Page 13: Random Forests: The Vanilla of Machine Learning - Anna Quach

References on recursive partitioning (rpart) and rpart.plot

Some good reference to understanding how a CART works andexample code on the rpart R package can be found here:

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

and a guide with plenty of examples on plotting nice tree can befound here:

http://www.milbo.org/rpart-plot/prp.pdf

Page 14: Random Forests: The Vanilla of Machine Learning - Anna Quach

A visual introduction to a decision treeONE OF THE 10 AWARD-WINNING SCIENCE VISUALIZATIONSFROM THE 2016 VIZZIES

Learn how a classfication tree is built interactively here: http://www.popsci.com/how-machine-learning-works-interactive

Page 15: Random Forests: The Vanilla of Machine Learning - Anna Quach

Bagging (Bootstrap Aggregating)

Fit each tree to bootstrap samples (random sample withreplacement) from the data and combine by voting (classification)or averaging (regression).

http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

Page 16: Random Forests: The Vanilla of Machine Learning - Anna Quach

The powers of Random Forests!

Random Forests is applicable to a wide variety of problems. Hereare some of the features of Random Forests:

I Classification and RegressionI Rank Important Feature (Most widely used)I Impute Missing ValuesI Local Variable Importance (Underused)I Unbalance classesI Naturally fits interactionsI Does not overfit as you add more treeI Detect patterns using proximities (Underused)I Requires little tuning! Has two possible parameters to tune –

mtry and depth for regression

Page 17: Random Forests: The Vanilla of Machine Learning - Anna Quach

Original Implementation of Random Forests is in Fortran

(a) Leo Breiman (b) Adele Cutler

A good documentation of the capabilities of Random Forests andthe fortran code can be found here:https://www.stat.berkeley.edu/~breiman/RandomForests/

Page 18: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests is a trademark

The commercial version of Random Forests, as well as videos aboutRandom Forests can be found here: https://www.salford-systems.com/products/randomforests

Salford Systems provides a user guide on how to use RandomForests in their Software, Salford Predictive Modeler (SPM). Findthe user guide here: http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf

Page 19: Random Forests: The Vanilla of Machine Learning - Anna Quach

randomForest - the first Random Forests package in R

https://cran.r-project.org/web/packages/randomForest/

Page 20: Random Forests: The Vanilla of Machine Learning - Anna Quach

Variable Importance

Mean_postsMean_pagesMean_photosSkewness_of_postsMaximum_posts_in_a_daySTD_Cat_TemporalShares_per_postsSelf_post_updatesFriends_CountLinks_per_postsComments_per_postsLikes_per_postsSTD_TemporalPost_Frequency_per_dayAverageAbout_CountYears_ActiveEntropy

0.00 0.04 0.08 0.12MeanDecreaseAccuracy

Mean_pagesMean_postsMean_photosShares_per_postsMaximum_posts_in_a_daySelf_post_updatesSkewness_of_postsFriends_CountLinks_per_postsSTD_Cat_TemporalComments_per_postsLikes_per_postsPost_Frequency_per_daySTD_TemporalAbout_CountAverageYears_ActiveEntropy

0 50 100 150MeanDecreaseGini

Rank of Important Features

Page 21: Random Forests: The Vanilla of Machine Learning - Anna Quach

Variable Importance Definition

Random Forests computes two measures of variable importance:

1. Permutation Importance (Mean Decrease in Accuracy) ispermutation based.

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. Thepermutation importance for each variable is the average of(error rate of permuted variable) - (error rate with nopermutation) over all the trees.

2. Gini Importance (Mean Decrease in Gini) is gini based forclassification.

Average of (Gini impurity of parent node) - (the gini impurityof child nodes) over all trees in the forest for each variable.

Page 22: Random Forests: The Vanilla of Machine Learning - Anna Quach

randomForest code

The Random Forests can be built and display the importantvariables using the following code in R:

library(randomForest)

rf = randomForest(Class ~ ., data,importance = TRUE,ntree = 500)

varImpPlot(rf,scale = FALSE,main = "Rank of Important Features")

Page 23: Random Forests: The Vanilla of Machine Learning - Anna Quach

Determining how many trees to use

plot(rf,main = "",ylim = c(0.05, 0.25),col = c("black", "lightblue", "orange"))

Page 24: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. The localvariable importance for each case i and variable j is the average of(error rate of permuted variable) - (error rate with no permutation)over all the trees.

Page 25: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance example on detecting fakeFacebook likes

Entropy Years_Active About_Count Average Post_Frequency_per_day

−0.412

0.448

−0.292

0.326

−0.446

0.468

−0.299

0.236

−0.174

0.173

Page 26: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to extract local variable importance

library(MASS)

rf = randomForest(Class ~ ., data,importance = TRUE,localImp = TRUE,proximity = TRUE,ntree = 500,scale = FALSE)

impv = names(sort(rf$importance[, 3],decreasing = TRUE))[1:5]

parcoord(t(rf$localImportance)[, impv],col = cols,var.label = TRUE)

Page 27: Random Forests: The Vanilla of Machine Learning - Anna Quach

Proximities

Proximities in Random Forests is defined as the proportion of thetime two observations (both in the out-of-bag sample) end up in thesame terminal node. The proximity measures can be visualized usingMultidimensional Scaling (MDS) plots. Using the MDS plot we canlearn more about our data:

I identify characteristics of unusual pointsI find clusters within classesI see which classes are overlappingI see which classes differI see which variables are locally important

Page 28: Random Forests: The Vanilla of Machine Learning - Anna Quach

Visualizing the Proximites

MDS 1

MD

S 2

MDS 1

MD

S 3

MDS 2

MD

S 3

Page 29: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to extract the proximities

loc = row(rf$prox)aprox = rbind(loc,rf$prox)prox = matrix(aprox, nrow = nrow(rf$prox))scalerf = cmdscale(1 - rf$prox, eig = T, k = 3)$points

plot(scalerf[, 1], scalerf[, 2], col = cols,xlab = "MDS 1", ylab = "MDS 2",xlim = c(-0.5, 0.5),ylim = c(-0.5, 0.5),xaxt = "n",yaxt = "n")

Page 30: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance in interactive plots

We can find interesting patterns using an interactive plot.

Read more about irfplot (interactive random forests plots) here:http://digitalcommons.usu.edu/gradreports/134/

Page 31: Random Forests: The Vanilla of Machine Learning - Anna Quach

Brushing in interactive plots

Page 32: Random Forests: The Vanilla of Machine Learning - Anna Quach

randomForest

A short paper on the randomForest package can be found here:http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

Page 33: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests presentations by Dr. Adele Cutler

A more comprehensive set of notes on Random Forests by Dr. AdeleCutler can be found here:http://www.math.usu.edu/adele/RandomForests/UofU2013.pdfhttp://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf

Page 35: Random Forests: The Vanilla of Machine Learning - Anna Quach

Current Research - Improving the interpretation of RandomForests

https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314849

Page 36: Random Forests: The Vanilla of Machine Learning - Anna Quach

Remembering Leo Breiman

1928 – 2005

Read more about Leo Breiman’s life’s work from the article writtenby Dr. Adele Cutler: https://arxiv.org/pdf/1101.0917.pdf

Page 37: Random Forests: The Vanilla of Machine Learning - Anna Quach

Contact Information

Additional questions regarding Random Forests can be emailed tome at [email protected].