How to win data science competitions with Deep Learning

37
How to win data science competitions with Deep Learning @ArnoCandel Silicon Valley Big Data Science Meetup 0xdata Campus, Mountain View, Oct 9 2014 Join us at H2O World 2014 | November 18th and 19th | Computer History Museum, Mountain View Register now at http://h2oworld2014.eventbrite.com

description

Note: Please download the slides first, otherwise some links won't work! How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms. We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.

Transcript of How to win data science competitions with Deep Learning

Page 1: How to win data science competitions with Deep Learning

How to win data science competitions with Deep Learning

ArnoCandel

Silicon Valley Big Data Science Meetup 0xdata Campus Mountain View Oct 9 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

H2O Deep Learning Booklet

httpbitlydeeplearningh2o

Who am I

PhD in Computational Physics 2005from ETH Zurich Switzerland

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning

10 months at 0xdataH2O - Machine Learning

15 years in HPCSupercomputingModeling

Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine

ArnoCandel

Outlinebull Intro

bull H2O Architecture bull Latest News

bull Live Modeling on Kaggle Datasets

bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge

bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks

bull Outlook

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 2: How to win data science competitions with Deep Learning

H2O Deep Learning Booklet

httpbitlydeeplearningh2o

Who am I

PhD in Computational Physics 2005from ETH Zurich Switzerland

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning

10 months at 0xdataH2O - Machine Learning

15 years in HPCSupercomputingModeling

Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine

ArnoCandel

Outlinebull Intro

bull H2O Architecture bull Latest News

bull Live Modeling on Kaggle Datasets

bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge

bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks

bull Outlook

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 3: How to win data science competitions with Deep Learning

Who am I

PhD in Computational Physics 2005from ETH Zurich Switzerland

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning

10 months at 0xdataH2O - Machine Learning

15 years in HPCSupercomputingModeling

Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine

ArnoCandel

Outlinebull Intro

bull H2O Architecture bull Latest News

bull Live Modeling on Kaggle Datasets

bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge

bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks

bull Outlook

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 4: How to win data science competitions with Deep Learning

Outlinebull Intro

bull H2O Architecture bull Latest News

bull Live Modeling on Kaggle Datasets

bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge

bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks

bull Outlook

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 5: How to win data science competitions with Deep Learning

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 6: How to win data science competitions with Deep Learning

Data Science Workflow

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 7: How to win data science competitions with Deep Learning

Powering Intelligent Applications

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 8: How to win data science competitions with Deep Learning

Sparkling Water Spark + H2O

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 9: How to win data science competitions with Deep Learning

ndashIon Stoica (CEO Databricks)

ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more

future collaborationrdquo

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 10: How to win data science competitions with Deep Learning

Live Modeling 1

KaggleAfrican Soil Properties Prediction Challenge

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 11: How to win data science competitions with Deep Learning

Any thoughts Questions Challenges Approaches

Still ongoing

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 12: How to win data science competitions with Deep Learning

Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog

Letrsquos follow this tutorial

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 13: How to win data science competitions with Deep Learning

Jo-fairsquos blog continued

launch H2O from R (on your laptop)

import (zipped) data

multi-threading enabled

single node mode (here)

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 14: How to win data science competitions with Deep Learning

Jo-fairsquos blog continued

5 different regression targets

3594 features

trainvalidation split

Train Deep Learning model

save to CSV

predict on test set

check validation error

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 15: How to win data science competitions with Deep Learning

Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)

BUT Too early to tell Leaderboard scores are computed on ~100 rows

My best ranking was 11th with a single H2O Deep Learning model (per target)

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 16: How to win data science competitions with Deep Learning

Itrsquos time for Live Modeling Challenges

bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable

We will do the following steps

bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 17: How to win data science competitions with Deep Learning

httpsgithubcom0xdatah2o

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 18: How to win data science competitions with Deep Learning

Live Modeling 2

KaggleDisplay Advertising

Challenge

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 19: How to win data science competitions with Deep Learning

Real-world dirty data Missing values categorical values with gt10k factor levels etc

458M rows 41 columns binary classifier

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 20: How to win data science competitions with Deep Learning

H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network

fully categorical expansiontraining is slow (also more communication)

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 21: How to win data science competitions with Deep Learning

Use hashing trick to reduce dimensionality N

naive input[key_to_idx(key)] = 1 one-hot encoding

hash input[hash_fun(key) M] += 1 where M ltlt N

Eg M=1000 mdashgt much faster training rapid convergence

Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 22: How to win data science competitions with Deep Learning

Live Modeling 3

KaggleHiggs Boson Machine Learning Challenge

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 23: How to win data science competitions with Deep Learning

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

Images courtesy CERN LHC

I missed the deadline for Kaggle by 1 day But still want to get good numbers

Higgs Particle Discovery

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 24: How to win data science competitions with Deep Learning

Higgs Binary Classification Problem

Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 25: How to win data science competitions with Deep Learning

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

Letrsquos build a H2O Deep Learning model and find out

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 26: How to win data science competitions with Deep Learning

H2O Steam Scoring Platform

httpserverportsteamindexhtml

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 27: How to win data science competitions with Deep Learning

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 28: How to win data science competitions with Deep Learning

Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift

1000 9355 17683085 1768

2000 8576 16210535 3389

3000 7839 14817379 4871

4000 6986 13206078 6192

5000 6080 11492702 7341

6000 5038 09523756 8293

7000 3952 07470124 9040

8000 2824 05337477 9574

9000 1688 0319122 9893

10000 565 010676467 10000

Total 5290 10

Lift Chart

Cum

ulat

ive

Hig

gs E

vent

s

00125025

037505

0625075

08751

Cumulative Events

010

00

20

00

30

00

40

00

50

00

60

00

70

00

80

00

90

00

10

000

Tota

l

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 29: How to win data science competitions with Deep Learning

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Nature paper httparxivorgpdf14024735v2pdf

Higgs Particle Detection with H2O

Algorithm Paperrsquosl-l AUC

low-level H2O AUC

all features H2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 30: How to win data science competitions with Deep Learning

What methods led the Higgs Boson Kaggle contest

Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks

(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)

2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest

3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)

Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 31: How to win data science competitions with Deep Learning

Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 32: How to win data science competitions with Deep Learning

You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644

- Faster Training GPGPU support PUB-1013

- NLP Recurrent Neural Networks PUB-1052

- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

- Win Kaggle competitions with H2O

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info

Page 33: How to win data science competitions with Deep Learning

httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014

Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13

Register now at httph2oworld2014eventbritecom

More info