Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
-
Upload
mlconf -
Category
Technology
-
view
1.190 -
download
0
Transcript of Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
![Page 1: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/1.jpg)
@benhamnerPhoto by mikebaird, www.flickr.com/photos/mikebaird
Lessons from ML CompetitionsBen [email protected]
November 13, 2015
![Page 2: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/2.jpg)
@benhamner
Kaggle runs machine learning competitions
![Page 3: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/3.jpg)
@benhamner
We release challenging machine learning problems to our community of 410,000 data scientists
![Page 4: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/4.jpg)
@benhamner
Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-150
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Our community makes 100k submissions per month on these competitions
![Page 5: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/5.jpg)
@benhamner@benhamner
Examples of Machine Learning Competitions
![Page 6: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/6.jpg)
@benhamner
Automatically grading student-written essays
197 entrants155 teams2,499 submissionsover 80 days$100,000 in prizes
Human-level performance
www.kaggle.com/c/asap-aes
21,000+ essays
![Page 7: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/7.jpg)
@benhamner
Predicting compounds toxicity given its molecular structure
796 entrants703 teams8,841 submissionsover 91 days$20,000 in prizes
25.6% improvement over previous accuracy benchmark
www.kaggle.com/c/BioResponse
![Page 8: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/8.jpg)
@benhamner
Personalizing web search results
261 entrants194 teams3570 submissionsover 91 days$9,000 in prizes
www.kaggle.com/c/yandex-personalized-web-search-challenge
167,000,000+ logs
![Page 9: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/9.jpg)
@benhamner
Detecting diabetic retinopathy
www.kaggle.com/c/diabetic-retinopathy-detection
88,000+ retina images
854 entrants661 teams6999 submissionsOver 160 days$100,000 in prizes
85% agreement with a human rater (quadratic weighted kappa)
![Page 10: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/10.jpg)
@benhamner@benhamner
How do machine learning competitions work?
![Page 11: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/11.jpg)
@benhamner
We take a dataset with a target variable – something we’re trying to predict
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT 2 3$394k 3546 HOME 0.4 4 4$82k 903 APT 2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT 1 1
Predicting the saleprice of a home
![Page 12: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/12.jpg)
@benhamner
Training
Test
Split the data into two sets, a training set and a test set
Solution“Ground Truth”
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT 2 3$394k 3546 HOME 0.4 4 4$82k 903 APT 2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT 1 1
![Page 13: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/13.jpg)
@benhamner
Training
Test
Our community gets everything but the solution on the test set
Solution“Ground Truth”
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1??? 582 HOME 0.61 1 1??? 1640 APT 2 3??? 3546 HOME 0.4 4 4??? 903 APT 2 2??? 1096 HOME 0.04 3 4??? 1280 HOME 0.15 2 2??? 1139 APT 1 1
![Page 14: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/14.jpg)
@benhamner
Competition participants use the training set to learn the relation between the data and the target
![Page 15: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/15.jpg)
@benhamner
Training
Test
Competition participants apply their models to make predictions on the test set
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1??? 582 HOME 0.61 1 1??? 1640 APT 2 3??? 3546 HOME 0.4 4 4??? 903 APT 2 2??? 1096 HOME 0.04 3 4??? 1280 HOME 0.15 2 2??? 1139 APT 1 1
Submission
Predicted$41k$165k$280k$76k$128k$115k$94k
![Page 16: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/16.jpg)
@benhamner
Training
Test
Kaggle compares the submission to the ground truth
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT 2 3$394k 3546 HOME 0.4 4 4$82k 903 APT 2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT 1 1
Submission
Predicted$41k$165k$380k$76k$128k$115k$94k
Delta-$9k$20k-$14k-$6k$13k-$14k-$12k
![Page 17: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/17.jpg)
@benhamner
Training
Test
Kaggle calculates two scores, one for the public leaderboard and one for the private leaderboard
SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT 3 2$72k 697 APT 1 1$85k 948 HOME 1.02 2 3$271k 3375 APT 3 4$482k 3968 APT 4 4$88k 790 APT 1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT 3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT 1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT 2 3$394k 3546 HOME 0.4 4 4$82k 903 APT 2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT 1 1
Submission
Predicted$41k$165k$380k$76k$128k$115k$94k
MeanErrorPublic Leaderboard $14kPrivate Leaderboard $15k
Delta-$9k$20k-$14k-$6k$13k-$14k-$12k
![Page 18: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/18.jpg)
@benhamner
The participant immediately sees their public score on the public leaderboard
![Page 19: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/19.jpg)
@benhamner
Participants explore the problem and iterate on their models to improve them
![Page 20: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/20.jpg)
@benhamner
At the end, the participant with the best score on the private leaderboard wins
![Page 21: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/21.jpg)
@benhamner@benhamner
Competition leaderboards
![Page 22: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/22.jpg)
@benhamner
The leaderboard is a powerful mechanism to drive competition
![Page 23: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/23.jpg)
@benhamner
The leaderboard is objective and meritocratic
![Page 24: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/24.jpg)
@benhamner
The leaderboard encourages leapfrogging
![Page 25: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/25.jpg)
@benhamner
The leaderboard encourages iterative improvements over many submissions
![Page 26: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/26.jpg)
@benhamner
This causes the competition to approach the frontier of what’s possible given the data
![Page 27: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/27.jpg)
@benhamner
Many competitions quickly approach a frontier; the most challenging ones take longer
![Page 28: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/28.jpg)
@benhamner
Some applied ML research looks like competitions running over years instead of months
www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/
![Page 29: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/29.jpg)
@benhamner
One long-running research competition is ImageNet (not hosted on Kaggle)
www.image-net.org
![Page 30: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/30.jpg)
@benhamner
We see a similar progression in ImageNet performance over time as we do in Kaggle competitions
www.image-net.org
![Page 31: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/31.jpg)
@benhamner
Can we do better than competition results?
![Page 32: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/32.jpg)
@benhamner@benhamner
Looking holistically across all the competitions
![Page 33: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/33.jpg)
@benhamner
At Kaggle, we’ve run hundreds of public machine learning competitions
![Page 34: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/34.jpg)
@benhamner
And over 600 in-class competitions for university students
![Page 35: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/35.jpg)
@benhamner
These competitions have generated over 2,000,000 submissions from around the world
![Page 36: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/36.jpg)
@benhamner
Most of the competitions we’ve run have involved supervised classification or regression
![Page 37: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/37.jpg)
@benhamner@benhamner
Doing well in competitions
![Page 38: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/38.jpg)
@benhamner
Setup your environment to enable rapid iteration and experimentation
Extract and Select Features
Train Models
Evaluate and Visualize Results
Identify & Handle Data
Oddities
Data Preprocessing
![Page 39: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/39.jpg)
@benhamner
As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models
http://jeffreydf.github.io/diabetic-retinopathy-detection/
![Page 40: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/40.jpg)
@benhamner
Successful users invest time, thought, and creativity in problem structure and feature extraction
![Page 41: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/41.jpg)
@benhamner
Random Forests / GBM’s work very well for many common classification and regression tasks
(Verikas et al. 2011)
![Page 42: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/42.jpg)
@benhamner
Deep learning has been very effective in computer vision competitions we’ve hosted
caffe, theano, torch7, and keras are four popular open source libraries that facilitate this
![Page 43: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/43.jpg)
@benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
![Page 44: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/44.jpg)
@benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
![Page 45: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/45.jpg)
@benhamner
The Boruta feature selection algorithm is robust and reliable
• Wrapper method around Random Forest and its calculated variable importance
• Iteratively trains RF’s and runs statistical tests to identify features as important or not important
• Widely used in competition-winning models to select a small subset of features for use in training more complex models
• library(boruta) in R
![Page 46: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/46.jpg)
@benhamner
Model ensembling usually results in marginal but significant performance gains
![Page 47: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/47.jpg)
@benhamner
Data leakage is our (and our user’s) #1 challenge
http://www.navy.mil/view_image.asp?id=12495
![Page 48: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/48.jpg)
@benhamner@benhamner
We’ve also seen some things that competitions aren’t effective at
![Page 49: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/49.jpg)
@benhamner
Competitions don’t typically yield simple and theoretically elegant solutions
*exception – Factorization Machines in KDD Cup 2012
![Page 50: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/50.jpg)
@benhamner
Competitions don’t typically yield production code
http://ora-00001.blogspot.ru/2011/07/mythbusters-stored-procedures-edition.html
![Page 51: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/51.jpg)
@benhamner
Competitions don’t always yield computationally efficient solutions
• Rewards performance without computational and complexity constraints
http://iinustechtips.com/main/topic/193045-need-help-underclocking-d/
![Page 52: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/52.jpg)
@benhamner@benhamner
Competitions tend to be highly effective at
![Page 53: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/53.jpg)
@benhamner
Optimizing a quantifiable evaluation metric by exploring an enormously broad range of approaches
![Page 54: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/54.jpg)
@benhamner
Fairly and consistently evaluating a variety of approaches on the same problem
• Implementation details matter, which can make it tough to reproduce results in other settings where data and/or code is not open source
• “A quick, simple way to apply machine learning successfully? In your domain, find the stupid baseline that new methods consistently claim to beat. Implement that stupid baseline”
![Page 55: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/55.jpg)
@benhamner
Identifying data quality and leakage issues
Check that ID column isn’t informative
“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.”
- “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
Time series
are tricky
Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”Human Scores: 5/5, 4/5
![Page 56: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/56.jpg)
@benhamner
Exposing a specific domain problem to many new communities around the world
![Page 57: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/57.jpg)
@benhamner@benhamner
Where Kaggle’s going
![Page 58: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/58.jpg)
@benhamner
Kaggle’s mission is to help the world learn from data
http://data-arts.appspot.com/globe/
![Page 59: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/59.jpg)
@benhamner
We’re building a public platform for collaborating on data and analytics results
People
CodeData
![Page 60: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/60.jpg)
@benhamner
An early alpha version of this is released as Kaggle Scripts
![Page 61: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/61.jpg)
@benhamner
It enables users to immediately access R/Python/Julia environments with data preloaded
![Page 62: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/62.jpg)
@benhamner
Everything created on Kaggle Scripts is published as soon as it’s run
www.kaggle.com/scripts
![Page 63: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/63.jpg)
@benhamner
Reproducing and building on another’s work is simply a click away
![Page 64: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/64.jpg)
@benhamner
We’re starting to enable users to do this on non-competition datasets
![Page 65: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/65.jpg)
@benhamner
Soon, any user will be able to publish data through Kaggle for analysis
![Page 66: Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15](https://reader034.fdocuments.us/reader034/viewer/2022050614/587d581b1a28abee158b6033/html5/thumbnails/66.jpg)
@benhamner@benhamner
Thank you!
head to www.kaggle.com/scripts to check out code, visualizations, and results from our community
ps. are you a software engineer passionate about data? I’m hiring