How to win data science competitions with Deep Learning
-
Upload
0xdata -
Category
Data & Analytics
-
view
3.696 -
download
5
description
Transcript of How to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
ArnoCandel
Silicon Valley Big Data Science Meetup 0xdata Campus Mountain View Oct 9 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
H2O Deep Learning Booklet
httpbitlydeeplearningh2o
Who am I
PhD in Computational Physics 2005from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning
10 months at 0xdataH2O - Machine Learning
15 years in HPCSupercomputingModeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
Outlinebull Intro
bull H2O Architecture bull Latest News
bull Live Modeling on Kaggle Datasets
bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge
bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks
bull Outlook
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
H2O Deep Learning Booklet
httpbitlydeeplearningh2o
Who am I
PhD in Computational Physics 2005from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning
10 months at 0xdataH2O - Machine Learning
15 years in HPCSupercomputingModeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
Outlinebull Intro
bull H2O Architecture bull Latest News
bull Live Modeling on Kaggle Datasets
bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge
bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks
bull Outlook
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Who am I
PhD in Computational Physics 2005from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning
10 months at 0xdataH2O - Machine Learning
15 years in HPCSupercomputingModeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
Outlinebull Intro
bull H2O Architecture bull Latest News
bull Live Modeling on Kaggle Datasets
bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge
bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks
bull Outlook
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Outlinebull Intro
bull H2O Architecture bull Latest News
bull Live Modeling on Kaggle Datasets
bull African Soil Property Prediction Challenge bull Display Advertising Challenge bull Higgs Boson Machine Learning Challenge
bull Hands-On Training - See slides booklet H2Oai bull Implementation Concepts bull Tips amp Tricks
bull Outlook
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Data Science Workflow
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Powering Intelligent Applications
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Sparkling Water Spark + H2O
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
ndashIon Stoica (CEO Databricks)
ldquoWere thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water and look forward to more
future collaborationrdquo
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Live Modeling 1
KaggleAfrican Soil Properties Prediction Challenge
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Any thoughts Questions Challenges Approaches
Still ongoing
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Special thanks toJo-fai Chow for pointing us to this challenge and for his awesome blog
Letrsquos follow this tutorial
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Jo-fairsquos blog continued
launch H2O from R (on your laptop)
import (zipped) data
multi-threading enabled
single node mode (here)
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Jo-fairsquos blog continued
5 different regression targets
3594 features
trainvalidation split
Train Deep Learning model
save to CSV
predict on test set
check validation error
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Jo-fai made it to the very top with H2O Deep Learning(He probably used ensembles stackingblending etc)
BUT Too early to tell Leaderboard scores are computed on ~100 rows
My best ranking was 11th with a single H2O Deep Learning model (per target)
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Itrsquos time for Live Modeling Challenges
bull no domain knowledge -gt need to findread papers bull small data -gt need to apply regularization to avoid overfitting bull need to explore data and do feature engineering if applicable
We will do the following steps
bull launch H2O from R bull import Kaggle data bull train separate Deep Learning model for 5 targets bull compute cross-validation error (N-fold) bull tune model parameters (grid search) bull do simple feature engineering (dimensionality reduction) bull build an ensemble model bull create a Kaggle submission
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
httpsgithubcom0xdatah2o
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Live Modeling 2
KaggleDisplay Advertising
Challenge
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Real-world dirty data Missing values categorical values with gt10k factor levels etc
458M rows 41 columns binary classifier
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
H2O Deep Learning expands categorical factor levels (on the fly but still) 41 features turn into N=47411 input values for neural network
fully categorical expansiontraining is slow (also more communication)
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Use hashing trick to reduce dimensionality N
naive input[key_to_idx(key)] = 1 one-hot encoding
hash input[hash_fun(key) M] += 1 where M ltlt N
Eg M=1000 mdashgt much faster training rapid convergence
Note Hashing trick is also used by winners of Kaggle challenge and the 4-th ranked submission uses an Ensemble of Deep Learning models with some feature engineering (drop rare categories etc)
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Live Modeling 3
KaggleHiggs Boson Machine Learning Challenge
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
Images courtesy CERN LHC
I missed the deadline for Kaggle by 1 day But still want to get good numbers
Higgs Particle Discovery
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Higgs Binary Classification Problem
Conventional methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
Letrsquos build a H2O Deep Learning model and find out
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
H2O Steam Scoring Platform
httpserverportsteamindexhtml
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Lift chart demo for simple Higgs modelQuantile Response rate Lift Cumulative lift
1000 9355 17683085 1768
2000 8576 16210535 3389
3000 7839 14817379 4871
4000 6986 13206078 6192
5000 6080 11492702 7341
6000 5038 09523756 8293
7000 3952 07470124 9040
8000 2824 05337477 9574
9000 1688 0319122 9893
10000 565 010676467 10000
Total 5290 10
Lift Chart
Cum
ulat
ive
Hig
gs E
vent
s
00125025
037505
0625075
08751
Cumulative Events
010
00
20
00
30
00
40
00
50
00
60
00
70
00
80
00
90
00
10
000
Tota
l
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Nature paper httparxivorgpdf14024735v2pdf
Higgs Particle Detection with H2O
Algorithm Paperrsquosl-l AUC
low-level H2O AUC
all features H2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
What methods led the Higgs Boson Kaggle contest
Winning 3 submissions1 Bag of 70 dropout (50) deep neural networks
(600600600) channel-out activation function l1=5e-6 and l2=5e-5 (for first weight matrix only) Each input neurons is connected to only 10 input values (sampled with replacement before training)
2 A blend of boosted decision tree ensembles constructed using Regularized Greedy Forest
3 Ensemble of 108 rather small deep neural networks (3050) (305025) (30505025)
Ensemble Deep Learning models are winning Check out h2oRensemble by Erin LeDell
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
Deep Learning Tips amp TricksGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Validate model performance on holdout data Add some regularization for less overfitting (lower validation set error) Ensembles typically perform better than individual models Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can hold all the data
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
You can participate- Image Recognition Convolutional amp Pooling Layers PUB-644
- Faster Training GPGPU support PUB-1013
- NLP Recurrent Neural Networks PUB-1052
- Unsupervised Pre-Training Stacked Auto-Encoders PUB-1014
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
- Win Kaggle competitions with H2O
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info
httph2oaihttph2oaibloghttpbitlydeeplearningh2ohttpwwwslidesharenet0xdata httpswwwyoutubecomuser0xdata httpsgithubcomh2oaih2ohttpsgithubcomh2oaih2o-dev h2oaiArnoCandel Join us at H2O World 2014
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum Mountain View 13
Register now at httph2oworld2014eventbritecom
More info