Jan vitek distributedrandomforest_5-2-2013

33
How to Grow Distributed Random Forests Jan Vitek Purdue University on sabbatical at 0xdata Photo credit: http://foundwalls.com/winter-snow-forest-pine/

description

 

Transcript of Jan vitek distributedrandomforest_5-2-2013

Page 1: Jan vitek distributedrandomforest_5-2-2013

How to Grow Distributed Random

ForestsJan Vitek

Purdue Universityon sabbatical at 0xdata

Photo credit: http://foundwalls.com/winter-snow-forest-pine/

Page 2: Jan vitek distributedrandomforest_5-2-2013

Overview

•Not data scientist…

• I implement programming languages for a living…

•Leading the FastR project; a next generation R implementation…

•Today I’ll tell you how to grow a distributed random forest in 2KLOC

Page 3: Jan vitek distributedrandomforest_5-2-2013

PART IWhy so random?

Introducing:Random Forest

BaggingOut of bag error estimate

Confusion matrix

Photo credit: http://foundwalls.com/winter-snow-forest-pine/

Leo Breiman. Random forests. Machine learning, 2001.

Page 4: Jan vitek distributedrandomforest_5-2-2013

Classification Trees

•Consider a supervised learning problem with a simple data set with two classes and the data has two features x in [1,4] and y in [5,8].

•We can build a classification tree to predict classes of new observations

1 2 3

6

5

7

Page 5: Jan vitek distributedrandomforest_5-2-2013

Classification Trees

•Consider a supervised learning problem with a simple data set with two classes and the data has two features x in [1,4] and y in [5,8].

•We can build a classification tree to predict classes of new observations

1 2 3

6

5

7

>3

>6

>7

>2

>6

>7

2.6

6.5

>6

>75.5

Page 6: Jan vitek distributedrandomforest_5-2-2013

Classification Trees

•Classification trees overfit the data

>3

>6

>7

>2

>6

>7

2.6

6.5

>6

>75.5

Page 7: Jan vitek distributedrandomforest_5-2-2013

Random Forest

•Avoid overfitting by building many randomized, partial, trees and vote to determine class of new observations

1 2 3

6

5

7

1 2 3

6

5

7

1 2 3

6

5

7

Page 8: Jan vitek distributedrandomforest_5-2-2013

Random Forest

•Each tree sees part of the training sets and captures part of the information it contains

1 2 3

6

5

7

1 2 3

6

5

7

1 2 3

6

5

7

>3

>6

>7

>2

>6

>7

>6

>7

>3

>6

>7

>2

>6

>7

>6

>7

Page 9: Jan vitek distributedrandomforest_5-2-2013

Bagging

•First rule of RF: each tree see is a different random selection (without replacement) of the training set.

1 2 3

6

5

7

1 2 3

6

5

7

1 2 3

6

5

7

Page 10: Jan vitek distributedrandomforest_5-2-2013

Split selection

•Second rule of RF: Splits are selected to maximize gain on a random subset of features. Each split sees a new random subset.

>6

>7 ?Gini impurity

Information gain

Page 11: Jan vitek distributedrandomforest_5-2-2013

1 2 3

6

5

7

OOBE

•One can use the training data to get an error estimate (“out of bag error” or OOBE)

•Validate each tree on complement of training data

1 2 3

6

5

7

1 2 3

6

5

7

Page 12: Jan vitek distributedrandomforest_5-2-2013

Validation

•Validation can be done using OOBE (which is often convenient as it does not require preprocessing) or with a separate validation data set.

•A Confusion Matrix summarizes the class assignments performed during validation and gives an overview of the classification errors

assigned / actual Red Green

Red 15 5 33%

Green 1 10 10%

Page 13: Jan vitek distributedrandomforest_5-2-2013

PART IIDemo

Running RF on Iris

Photo credit: http://foundwalls.com/winter-snow-forest-pine/

Page 14: Jan vitek distributedrandomforest_5-2-2013

Iris RF results

Page 15: Jan vitek distributedrandomforest_5-2-2013

Sample tree• // Column constants

int COLSEPALLEN = 0;int COLSEPALWID = 1;int COLPETALLEN = 2;int COLPETALWID = 3;int classify(float fs[]) { if( fs[COLPETALWID] <= 0.800000 ) return Iris-setosa; else if( fs[COLPETALLEN] <= 4.750000 ) if( fs[COLPETALWID] <= 1.650000 ) return Iris-versicolor; else return Iris-virginica; else if( fs[COLSEPALLEN] <= 6.050000 ) if( fs[COLPETALWID] <= 1.650000 ) return Iris-versicolor; else return Iris-virginica; else return Iris-virginica;}

Page 16: Jan vitek distributedrandomforest_5-2-2013

Comparing accuracy

•We compared several implementations and found that we are OK

“YES”, “NO”) expressing potential risk of financial distress. The dataset is imbalanced in its predictor sincethe ratio of the answers is 139,974 to 10,026 with dominance of “NO” answer. The dataset also includesmissing values in 6th and 11th columns marked by “NA” value. In total, there are 29,731 records containingmissing values. We created two subsets by randomly selecting rows without replacement: 100,000 rfor thetraining set and 50,000 for validation.

2.6 Intrusion

The “Intrusion” dataset collects data about running software systems and identifies suspicious situationswhich could be attacks. There are 41 features recording OS system state and a binary predictor expressing“normal” or “abnormal” system state. The dataset contains 148,517 instances which are split into 125,973train and 22,544 test instances. There are no missing values.4.

2.7 Covtype

The data set “Covtype” comes from the UCI Machine Learning Repository (www.ics.uci.edu/~

mlearn/

MLRepository.html). It has 581,012 instances with 54 features (the 55th dimension records the classinformation of objects). There are 7 classes, each of which represents one type of tree. We created two subsetsby randomly selecting rows without replacement: a 387,342 training set and a 193,672 instance validation set.The data has no missing observations. The data set was released as a part of a Kaggle competition.

3 Overview of the Comparison

Table 2 show the accuracy of each of the tools. The table reports the cumulative error rate over all classes.As each tool supports di↵erent options, we chose to report the best numbers we were able to obtain. Forsmall datasets most tools are reasonably accurate and give similar results. It is noteworthy that wiseRF,which usually runs the fastet, does not give as accurate results as R and Weka (this is the case for for theVehicle, Stego and Spam datasets). H2O consistently gives the best results for these small datasets. In thelarger case studies, the tools are practically tied on Credit (with Weka and wiseRF being 0.2% better). Forintrusion H2O is 0.8% less accurate than wiseRF, and for the largest dataset, Covtype, H2O is markedlymore accuracte than wiseRF (over 10%).

Dataset H2O R Weka wiseRF

Iris 2.0% 2.0% 2.0% 2.0%

Vehicle 21.3% 21.3% 22.0% 22.0%

Stego 13.6% 13.9% 14.0% 14.9%

Spam 4.2% 4.2% 4.4% 5.2%

Credit 6.7% 6.7% 6.5% 6.5%

Intrusion 21.2% 19.0% 19.5% 20.4%

Covtype 3.6% 22.9% – 14.8%

Tab. 2: The best overall classification errors for individual tools and datasets. H2O is generally the mostaccurate with the exception of Intrusion where it is slightly less precise than R. R is the secondbest, with the exception of larger datasets like Covtype where internal restrictions seem to result insignificant loss of accuracy.

4 The dataset is available at http://nsl.cs.unb.ca/NSL-KDD/. This dataset is used as benchmark of Mahout at https://cwiki.apache.org/MAHOUT/partial-implementation.html

3

Dataset Features Predictor Instances (train/test) Imbalanced Missing observations

Iris 4 3 classes 100/50 NO 0Vehicle 18 4 classes 564/282 NO 0Stego 163 3 classes 3,000/4,500 NO 0Spam 57 2 classes 3,067/1,534 YES 0Credit 10 2 classes 100,000/50,000 YES 29,731

Intrusion 41 2 classes 125,973/22,544 NO 0Covtype 54 7 classes 387,342/193,672 YES 0

Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use forcalibration. The last three datasets are medium sized problems. Credit is the only dataset withmissing observations. There are several imbalanced datasets.

2.1 Iris

The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data setcontains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearlyseparable from the other 2; the latter are not linearly separable from each other. The dataset contains nomissing data.

2.2 Vehicle

The “Vehicle” dataset is published by the UCI machine learning repository.1 It contains data extracted fromsilhouettes of four kinds of vehicles. The purpose of dataset is to classify the vehicle according to extractedfeatures. The whole dataset includes 18 continuous integer features and 946 instances. 564 instances wereused for training, the remaining 282 served for validation. The predictor contains four classes representingvehicle kinds. The dataset contains no missing data.

2.3 Stego

The “Stego” dataset collects information about images to predict hidden information.2 The dataset is definedby 163 continuous features and a balanced categorical predictor having three classes. The whole datasetcontains 7,500 instances which are divided into 3,000 training and 4,500 test instances. There are no missingvalues.

2.4 Spam

The “Spam” dataset contains data extracted from regular and spam emails.3 A model is used to identifysuspicious and unsolicited commercial emails. The dataset contains 57 continuous numeric (integer/real)features representing frequencies of common words. The binary predictor identifies spam emails. The wholedataset contains 4,601 instances – 3,067 are used for training and 1,534 for validation. The predictor isslightly unbalanced since it contains 1,813 “YES” answers and 2,788 “NO” answers. There are no missingvalues.

2.5 Credit

The “Credit” dataset (http://www.kaggle.com/c/GiveMeSomeCredit) has been used to predict the proba-bility of a person financial distress in the next two years. Our dataset is based on the training data providedby Kaggle. It has 150,000 instances with 10 features and binary predictor “SeriousDlqin2yrs” (with values

1 http://archive.ics.uci.edu/ml/datasets/Statlog(VehicleSilhouettes) provided by Turing Institute.2 http://dde.binghamton.edu/kodovsky/svm/index.php?content=Lectures3 The Spam dataset is accessible from UCI machine learning repository – http://archive.ics.uci.edu/ml/datasets/Spambase

2

Page 17: Jan vitek distributedrandomforest_5-2-2013

PART IIIWriting a DRF algorithm

in Java with H2ODesign choices,

implementation techniques,pitfalls.

Photo credit: http://foundwalls.com/winter-snow-forest-pine/

Page 18: Jan vitek distributedrandomforest_5-2-2013

Distributing and Parallelizing RF

•When data does not fit in RAM, what impact does that have for random forest:

• How do we sample?

• How do we select splits?

• How do we estimate OOBE?

Page 19: Jan vitek distributedrandomforest_5-2-2013

Insights

•RF building parallelize extremely well when random data sample fits in memory

•Trees can be built in parallel trivially

•Trees size increases with data volume

•Validation requires trees to be co-located with data

Page 20: Jan vitek distributedrandomforest_5-2-2013

Strategy

•Start with a randomized partition of the data on nodes

•Build trees in parallel on subsets of each node’s data

•Exchange trees for validation

Page 21: Jan vitek distributedrandomforest_5-2-2013

Reading and Parsing Data

•H2O does that for us and returns a ValueArray which is row-order distributed table

class ValueArray extends Iced implements Cloneable {

long numRows() int numCols() long length()

double datad(long rownum, int colnum) {

•Each 4MB chunk of the VA is stored on a (possibly) different node and identified by a unique key

Page 22: Jan vitek distributedrandomforest_5-2-2013

Extracting random subsets

•Each node holds a random set of 4MB chunks of the value array

final ValueArray ary = DKV.get( dataKey ).get();

ArrayList<RecursiveAction> dataInhaleJobs = new ArrayList<RecursiveAction>();

for( final Key k : keys ) { if (!k.home()) continue; // skip non-local keys final int rows = ary.rpc(ValueArray.getChunkIndex(k)); dataInhaleJobs.add(new RecursiveAction() { @Override protected void compute() { for(int j = 0; j < rows; ++j) for( int c = 0; c < ncolumns; ++c) localData.add ( ary.datad( j , c) ); }}); }

ForkJoinTask.invokeAll(dataInhaleJobs);

Page 23: Jan vitek distributedrandomforest_5-2-2013

Evaluating splits

•Each feature that must be considered for a split requires processing data of the form (feature value, class)

{ (3.4, red), (3.3, green), (2, red), (5, green), (6.1, green) }

•We should sort the values before processing

{ (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) }

•But since each split is done on different sets of rows, we have to sort features at every split

•Trees can have 100k splits

Page 24: Jan vitek distributedrandomforest_5-2-2013

Evaluating splits

• Instead we discretize the value

{ (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) }

•becomes

{ (0, red), (1, green), (2, red), (3, green), (4, green) }

•and no sorting is required as we can represent the colors by arrays (of size #cardinality of the feature)

•For efficiency we can bin multiple values together

Page 25: Jan vitek distributedrandomforest_5-2-2013

Evaluating splits

•The implementation of entropy based split is now simple Split ltSplit(int col, Data d, int[] dist, Random rand) {

final int[] distL = new int[d.classes()], distR = dist.clone(); final double upperBoundReduction = upperBoundReduction(d.classes()); double maxReduction = -1; int bestSplit = -1; for (int i = 0; i < columnDists[col].length - 1; ++i) { for (int j = 0; j < distL.length; ++j) { double v = columnDists[col][i][j]; distL[j] += v; distR[j] -= v; } int totL = 0, totR = 0; for (int e: distL) totL += e; for (int e: distR) totR += e; double eL = 0, eR = 0; for (int e: distL) eL += gain(e,totL); for (int e: distR) eR += gain(e,totR); double eReduction = upperBoundReduction-( (eL*totL + eR*totR) / (totL+totR) );

if (eReduction > maxReduction) { bestSplit = i; maxReduction = eReduction; } } return Split.split(col,bestSplit,maxReduction);

Page 26: Jan vitek distributedrandomforest_5-2-2013

Parallelizing tree building•Trees are built in parallel with the Fork/Join framework

Statistic left = getStatistic(0,data, seed + LTSSINIT); Statistic rite = getStatistic(1,data, seed + RTSSINIT); int c = split.column, s = split.split; SplitNode nd = new SplitNode(c, s,…); data.filter(nd,res,left,rite); FJBuild fj0 = null, fj1 = null; Split ls = left.split(res[0], depth >= maxdepth); Split rs = rite.split(res[1], depth >= maxdepth); if (ls.isLeafNode()) nd.l = new LeafNode(...); else fj0 = new FJBuild(ls,res[0],depth+1, seed + LTSINIT); if (rs.isLeafNode()) nd.r = new LeafNode(...); else fj1 = new FJBuild(rs,res[1],depth+1, seed - RTSINIT); if (data.rows() > ROWSFORKTRESHOLD)… fj0.fork(); nd.r = fj1.compute(); nd.l = fj0.join();

Page 27: Jan vitek distributedrandomforest_5-2-2013

Challenges

•Found out that Java Random isn’t

•Tree size does get to be a challenge

•Need more randomization

•Determinism is needed for debugging

Page 28: Jan vitek distributedrandomforest_5-2-2013

PART IIIPlaying with DRF

Covtype, playing with knobs

Photo credit: http://foundwalls.com/winter-snow-forest-pine/

Page 29: Jan vitek distributedrandomforest_5-2-2013

Covtype

Dataset Features Predictor Instances (train/test) Imbalanced Missing observations

Iris 4 3 classes 100/50 NO 0Vehicle 18 4 classes 564/282 NO 0Stego 163 3 classes 3,000/4,500 NO 0Spam 57 2 classes 3,067/1,534 YES 0Credit 10 2 classes 100,000/50,000 YES 29,731

Intrusion 41 2 classes 125,973/22,544 NO 0Covtype 54 7 classes 387,342/193,672 YES 0

Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use forcalibration. The last three datasets are medium sized problems. Credit is the only dataset withmissing observations. There are several imbalanced datasets.

2.1 Iris

The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data setcontains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearlyseparable from the other 2; the latter are not linearly separable from each other. The dataset contains nomissing data.

2.2 Vehicle

The “Vehicle” dataset is published by the UCI machine learning repository.1 It contains data extracted fromsilhouettes of four kinds of vehicles. The purpose of dataset is to classify the vehicle according to extractedfeatures. The whole dataset includes 18 continuous integer features and 946 instances. 564 instances wereused for training, the remaining 282 served for validation. The predictor contains four classes representingvehicle kinds. The dataset contains no missing data.

2.3 Stego

The “Stego” dataset collects information about images to predict hidden information.2 The dataset is definedby 163 continuous features and a balanced categorical predictor having three classes. The whole datasetcontains 7,500 instances which are divided into 3,000 training and 4,500 test instances. There are no missingvalues.

2.4 Spam

The “Spam” dataset contains data extracted from regular and spam emails.3 A model is used to identifysuspicious and unsolicited commercial emails. The dataset contains 57 continuous numeric (integer/real)features representing frequencies of common words. The binary predictor identifies spam emails. The wholedataset contains 4,601 instances – 3,067 are used for training and 1,534 for validation. The predictor isslightly unbalanced since it contains 1,813 “YES” answers and 2,788 “NO” answers. There are no missingvalues.

2.5 Credit

The “Credit” dataset (http://www.kaggle.com/c/GiveMeSomeCredit) has been used to predict the proba-bility of a person financial distress in the next two years. Our dataset is based on the training data providedby Kaggle. It has 150,000 instances with 10 features and binary predictor “SeriousDlqin2yrs” (with values

1 http://archive.ics.uci.edu/ml/datasets/Statlog(VehicleSilhouettes) provided by Turing Institute.2 http://dde.binghamton.edu/kodovsky/svm/index.php?content=Lectures3 The Spam dataset is accessible from UCI machine learning repository – http://archive.ics.uci.edu/ml/datasets/Spambase

2

Page 30: Jan vitek distributedrandomforest_5-2-2013

Varying sampling rate for covtype

•Changing the proportion of data used for each tree affects error

•The danger is overfitting; and loosing the OOBE

As an benchmarks execution platform we have utilized dedicated EC2 instances built on top Xen hypervisor.The hypervisor is configured with 8 Intel Xeon CPUs with single core running at 2.67GHz. Each instance has66GB available physical memory.

To launch JVM-based tools – H2O and Weka – we have not use any dedicated tuning JVM parametersexcept specifying the maximum heap size to 48GB via �Xmx48g JVM command line option.

4.2 Sampling Rate

H2O supports changing the proportion of the population that is randomly selected (without replacement)to build each tree. Figure 1 illustrates the impact of varying the sampling rate between 1 and 99% whenbuilding trees for Covtype.5 The blue line tracks the OOB error. It shows clearly that after approximately80% sampling the OOBE will not improve. The red line shows the improvement in classification error, thiskeeps dropping suggesting that for Covtype more data is better.

0 20 40 60 80 100

510

1520

sampling rate

erro

r

OOB errClassif err

Fig. 1: Impact of changing the sampling rate on the overall error rate (both classification and OOB).

Recommendation: The sampling rate needs to be set to the level that minimizes the classification error.The OOBEE is a good estimator of the classficiation error.

4.3 Feature selection

H2O allows users to exclude features from consideration. Figure 2 shows the impact on classification errorand OOBE of ignoring individual features for the Covtype dataset. The OOBE and classification error trackone another closely. Dropping features 1, 2 and 6, 7, 8 result in improving the accuracy.

Figure 3 shows the impact on error rate of selecting di↵erent numbers of features (for di↵erent samplingrates as well) per split in each tree. The data suggest that 30 features yields optimal results and that highersampling rates also improve the error rate.

5 Tree built with h20 options:-ntrees=150 -binLimit=10000 -statType=entropy -seed=3.

5

Page 31: Jan vitek distributedrandomforest_5-2-2013

0 10 20 30 40 50

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

ignores

erro

r

OOB errClassif err

Fig. 2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB).

0 10 20 30 40 50

3.5

4.0

4.5

5.0

features

error

sample=50%

sample=60%

sample=70%

sample=80%

Fig. 3: Impact of changing the number of features used to evaluate each split with di↵erent sampling rates(50%, 60%, 70% and 80%) on the overall error rate (both classification and OOB).

Recommendation: Individual features should be evaluated for their contribution to the OOBEE. Featureswith high arity should be evaluated first as they sometimes decrease prediction quality. The number offeatures per split should be varied to obtain the best accuracy.

6

Changing #feature / split for covtype

•Increasing the number of features can be beneficial

• Impact is not huge though

Page 32: Jan vitek distributedrandomforest_5-2-2013

Ignoring features for covtype

•Some features are best ignored

0 10 20 30 40 50

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

ignores

erro

r

OOB errClassif err

Fig. 2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB).

0 10 20 30 40 50

3.5

4.0

4.5

5.0

features

error

sample=50%

sample=60%

sample=70%

sample=80%

Fig. 3: Impact of changing the number of features used to evaluate each split with di↵erent sampling rates(50%, 60%, 70% and 80%) on the overall error rate (both classification and OOB).

Recommendation: Individual features should be evaluated for their contribution to the OOBEE. Featureswith high arity should be evaluated first as they sometimes decrease prediction quality. The number offeatures per split should be varied to obtain the best accuracy.

6

Page 33: Jan vitek distributedrandomforest_5-2-2013

Conclusion

•Random forest is a powerful machine learning technique

• It’s easy to write a distributed and parallel implementation

•Different implementations choices are possible

•Scaling it up to TB data comes next…

Photo credit www.twitsnaps.com