Model selection and tuning at scale

25
Model Selection and Tuning at Scale March 2016

Transcript of Model selection and tuning at scale

Model Selection and Tuning at Scale

March 2016

About us

Owen Zhang

Chief Product Officer @ DataRobot

Former #1 ranked Data Scientist on Kaggle

Former VP, Science @ AIG

Peter Prettenhofer

Software Engineer @ DataRobot

Scikit-learn core developer

Agenda

● Introduction

● Case-study Criteo 1TB

● Conclusion / Discussion

Model Selection

● Estimating the performance of different models in order to choose the best one.

● K-Fold Cross-validation

● The devil is in the detail:○ Partitioning○ Leakage○ Sample size○ Stacked-models require nested layers

Train Validation Holdout

1 2 3 4 5

Model Complexity & Overfitting

More data to the rescue?

Underfitting or Overfitting?

http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

Model Tuning

● Optimizing the performance of a model

● Example: Gradient Boosted Trees

○ Nr of trees○ Learning rate○ Tree depth / Nr of leaf nodes○ Min leaf size○ Example subsampling rate○ Feature subsampling rate

Search Space

Hyperparameter GBRT (naive) GBRT RandomForest

Nr of trees 5 1 1

Learning rate 5 5 -

Tree depth 5 5 1

Min leaf size 3 3 3

Example subsample rate 3 1 1

Feature subsample rate 2 2 5

Total 2250 150 15

Hyperparameter Optimization

● Grid Search

● Random Search

● Bayesian optimization

Challenges at Scale

● Why learning with more data is harder?○ Paradox: we could use more complex models due to more data but we cannot because

of computational constraints*○ => we need more efficient ways for creating complex models!

● Need to account for the combined cost: model fitting + model selection / tuning○ Smart hyperparameter tuning tries to decrease the # of model fits○ … we can accomplish this with fewer hyperparameters too**

* Pedro Domingos, A few useful things to know about machine learning, 2012.** Practitioners often favor algorithms with few hyperparameters such as RandomForest or AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)

A case study -- binary classification on 1TB of data

● Criteo click through data● Down sampled ads impression data on 24 days ● Fully anonymized dataset:

○ 1 target○ 13 integer features○ 26 hashed categorical features

● Experiment setup:○ Using day 0 - day 22 data for training, day 23 data for testing

Big Data?

Data size:● ~46GB/day● ~180,000,000/day

However it is very imbalanced (even after downsampling non-events)● ~3.5% events rate

Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB● Will fit into a single node under “optimal” conditions● Loss of model accuracy is negligible in most situations

Assuming 0.1% raw event (click through) rate:

Raw Data:[email protected]%

Data:[email protected]%

Data:70GB@50%

Where to start?

● 70GB (~260,000,000 data points) is still a lot of data● Let’s take a tiny slice of that to experiment

○ Take 0.25%, then .5%, then 1%, and do grid search on them

Time (Seconds)

RF

ASVM

Regularized Regression

GBM (with Count)

GBM (without Count)Better

GBM is the way to go, let’s go up to 10% data

# of Trees

Sample Size/Depth of Tree/Time to Finish

A “Fairer” Way of Comparing Models

A better model when time is the constraint

Can We Extrapolate?

?

Where We (can) do better than generic Bayesian Optimization

Tree Depth vs Data Size

● A natural heuristic -- increment tree depth by 1 every time data size doubles

1%

2%

4%

10%

Optimal Depth = a + b * log(DataSize)

What about VW?

● Highly efficient online learning algorithm● Support adaptive learning rate● Inherently linear, user needs to specify non-linear feature or interactions explicitly● 2-way and 3-way interactions can be generated on the fly

● Supports “every k” validation

● The only “tuning” REQUIRED is specification of interactions ○ Due to availability of progressive validation, bad interactions can be detected immediately

thus don’t waste time:

Data pipeline for VW

Training

Test

T1

T2

Tm

Test

T1s

Random Split

T2s

Tms

Random Shuffle

Concat + Interleave

It takes longer to prep the data than to run the model!

VW Results

Without

With Count + Count*Numeric Interaction

1% Data

10% Data

100% Data

Putting it All Together 1 Hour 1 Day

Do We Really “Tune/Select Model @ Scale”?● What we claim we do:

○ Model tuning and selection on big data● We we actually do:

○ Model tuning and selection on small data○ Re-run the model and expect/hope performance/hyper

parameters extrapolate as expected

● If you start the model tuning/selection process with GBs (even 100s of MBs) of data, you are doing it wrong!

Some Interesting Observations

● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise) non-linear model, even with much larger data

● There is meaningful structure in the hyper parameter space

● When we have limited time (relative to data size), running “deeper” models on smaller data sample may actually yield better results

● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We

need models that has # of parameters that can scale with # of data points○ GBM can have any many parameters as we want○ So does factorization machines

● For any data any model we will run into a “diminishing return” issue, as data get bigger and bigger

DataRobot Essentials

April 7-8 LondonApril 28-29 San Francisco

May 17-18 AtlantaJune 23-24 Boston

datarobot.com/training© DataRobot, Inc. All rights reserved.

Thanks / Questions?