Online random forests in 10 minutes

20
Online Random Forest in 10 Minutes

Transcript of Online random forests in 10 minutes

Page 1: Online random forests in 10 minutes

Online Random Forest in 10 Minutes

Page 2: Online random forests in 10 minutes

Traditional Supervised Learning Algorithms● Regression● Random Forest● Support Vector Machines● Classification and Regression Tree (CART)● etc

Page 3: Online random forests in 10 minutes

Inputs

● Data Matrix (Regression)

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4

.56 Red .456 Male .589

.78 Green .654 Female .6654

.987 Blue .678 Female .789

.123 Blue .999 Male .543

Page 4: Online random forests in 10 minutes

Inputs

● Data Matrix (Binary Classification)

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4

Yes Red .456 Male .589

No Green .654 Female .6654

Yes Blue .678 Female .789

No Blue .999 Male .543

Page 5: Online random forests in 10 minutes

Inputs To Streaming Classification

● Observations now have an explicit arrival order.

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time

Yes Red .456 Male .589 Jan 1st 2011

No Green .654 Female .6654 Feb 4th 2012

Yes Blue .678 Female .789 Feb 5th 2013

No Blue .999 Male .543 July 4th2013

Page 6: Online random forests in 10 minutes

Inputs To Streaming Classification● New Observations can arrive at any time

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time

Yes Red .456 Male .589 Jan 1st 2011

No Green .654 Female .6654 Feb 4th 2012

Yes Blue .678 Female .789 Feb 5th 2013

No Blue .999 Male .543 July 4th2013

Yes Red .456 Male .456 NOW

Page 7: Online random forests in 10 minutes

Problems

● Do the important predictors change over time and when does this change occur?

● How far back is data relevant to today’s problem?

● What happens when our predictors change again in the future?

● What if this is all happening rapidly… will it scale?

Page 8: Online random forests in 10 minutes

Enter Online Random Forest

● Input is a single new observation● Trees learn incrementally on this new data● Trees are dropped from the forest based on

performance and replaced a new “ungrown” tree

Page 9: Online random forests in 10 minutes

Visualization of a single tree

5, 6 0, 70

Accuracy on test cases: 75%

Pure data stop splitting

Page 10: Online random forests in 10 minutes

Visualization of a single tree

2, 25

Accuracy on test cases: 55%

0, 70

50 new observations have come and we create another split off the parent node’s left branch

20,3

Page 11: Online random forests in 10 minutes

Tree gets pruned

2, 25

Accuracy on test cases: 55% … compare to Random variable and incorporate the age of the tree. Accuracy is TOO BAD. Prune the tree

0, 70

20,3

Page 12: Online random forests in 10 minutes

New Tree

It’s a stump that hasn’t yet split any data. If asked for a classification request it will vote the prior probability calculated from the last 100 observations that the old pruned tree saw

Page 13: Online random forests in 10 minutes

Online Random Forest

● By dropping trees that predict poorly we can adapt to change in important predictors

● If previous data is relevant to today’s problem, tree’s learned from it in the past. If it no longer becomes relevant it will be reflected in the accuracy and the tree will get prune

Page 14: Online random forests in 10 minutes

Online Random Forest

● This process of incremental learning and dropping is constantly occurring so we can constantly adapt to a changing signal

● We built our Online Random Forest with scala’s actor framework

● We distribute our tree’s computations (and physical location) therefore we can handle high input data streams

Page 15: Online random forests in 10 minutes

Example Stream

Page 16: Online random forests in 10 minutes
Page 17: Online random forests in 10 minutes
Page 18: Online random forests in 10 minutes
Page 19: Online random forests in 10 minutes
Page 20: Online random forests in 10 minutes

Changing Feature Importance