Lessons Learned from Running Hundreds of Kaggle Competitions

66
@benhamner Photo by mikebaird, www.flickr.com/photos/mikebaird Lessons from ML Competitions Ben Hamner [email protected] November 13, 2015

Transcript of Lessons Learned from Running Hundreds of Kaggle Competitions

Page 1: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamnerPhoto by mikebaird, www.flickr.com/photos/mikebaird

Lessons from ML CompetitionsBen [email protected]

November 13, 2015

Page 2: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Kaggle runs machine learning competitions

Page 3: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

We release challenging machine learning problems to our community of 410,000 data scientists

Page 4: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-150

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Our community makes 100k submissions per month on these competitions

Page 5: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Examples of Machine Learning Competitions

Page 6: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Automatically grading student-written essays

197 entrants155 teams2,499 submissionsover 80 days$100,000 in prizes

Human-level performance

www.kaggle.com/c/asap-aes

21,000+ essays

Page 7: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Predicting compounds toxicity given its molecular structure

796 entrants703 teams8,841 submissionsover 91 days$20,000 in prizes

25.6% improvement over previous accuracy benchmark

www.kaggle.com/c/BioResponse

Page 8: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Personalizing web search results

261 entrants194 teams3570 submissionsover 91 days$9,000 in prizes

www.kaggle.com/c/yandex-personalized-web-search-challenge

167,000,000+ logs

Page 9: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Detecting diabetic retinopathy

www.kaggle.com/c/diabetic-retinopathy-detection

88,000+ retina images

854 entrants661 teams6999 submissionsOver 160 days$100,000 in prizes

85% agreement with a human rater (quadratic weighted kappa)

Page 10: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

How do machine learning competitions work?

Page 11: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

We take a dataset with a target variable – something we’re trying to predict

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT   2 3$394k 3546 HOME 0.4 4 4$82k 903 APT   2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT   1 1

Predicting the saleprice of a home

Page 12: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Training

Test

Split the data into two sets, a training set and a test set

Solution“Ground Truth”

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT   2 3$394k 3546 HOME 0.4 4 4$82k 903 APT   2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT   1 1

Page 13: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Training

Test

Our community gets everything but the solution on the test set

Solution“Ground Truth”

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1??? 582 HOME 0.61 1 1??? 1640 APT   2 3??? 3546 HOME 0.4 4 4??? 903 APT   2 2??? 1096 HOME 0.04 3 4??? 1280 HOME 0.15 2 2??? 1139 APT   1 1

Page 14: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Competition participants use the training set to learn the relation between the data and the target

Page 15: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Training

Test

Competition participants apply their models to make predictions on the test set

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1??? 582 HOME 0.61 1 1??? 1640 APT   2 3??? 3546 HOME 0.4 4 4??? 903 APT   2 2??? 1096 HOME 0.04 3 4??? 1280 HOME 0.15 2 2??? 1139 APT   1 1

Submission

Predicted$41k$165k$280k$76k$128k$115k$94k

Page 16: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Training

Test

Kaggle compares the submission to the ground truth

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT   2 3$394k 3546 HOME 0.4 4 4$82k 903 APT   2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT   1 1

Submission

Predicted$41k$165k$380k$76k$128k$115k$94k

Delta-$9k$20k-$14k-$6k$13k-$14k-$12k

Page 17: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Training

Test

Kaggle calculates two scores, one for the public leaderboard and one for the private leaderboard

SalePrice SquareFeet Type LotAcres Beds Baths$88k 719 HOME 1.64 1 1$164k 2017 APT   3 2$72k 697 APT   1 1$85k 948 HOME 1.02 2 3$271k 3375 APT   3 4$482k 3968 APT   4 4$88k 790 APT   1 2$128k 1341 HOME 0.66 3 3$235k 2379 APT   3 3$309k 2495 HOME 0.21 3 4$163k 1356 APT   1 1$375k 3361 HOME 1.64 3 4$98k 1060 HOME 0.05 1 1$50k 582 HOME 0.61 1 1$145k 1640 APT   2 3$394k 3546 HOME 0.4 4 4$82k 903 APT   2 2$105k 1096 HOME 0.04 3 4$129k 1280 HOME 0.15 2 2$106k 1139 APT   1 1

Submission

Predicted$41k$165k$380k$76k$128k$115k$94k

MeanErrorPublic Leaderboard $14kPrivate Leaderboard $15k

Delta-$9k$20k-$14k-$6k$13k-$14k-$12k

Page 18: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The participant immediately sees their public score on the public leaderboard

Page 19: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Participants explore the problem and iterate on their models to improve them

Page 20: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

At the end, the participant with the best score on the private leaderboard wins

Page 21: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Competition leaderboards

Page 22: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The leaderboard is a powerful mechanism to drive competition

Page 23: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The leaderboard is objective and meritocratic

Page 24: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The leaderboard encourages leapfrogging

Page 25: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The leaderboard encourages iterative improvements over many submissions

Page 26: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

This causes the competition to approach the frontier of what’s possible given the data

Page 27: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Many competitions quickly approach a frontier; the most challenging ones take longer

Page 28: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Some applied ML research looks like competitions running over years instead of months

www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/

Page 29: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

One long-running research competition is ImageNet (not hosted on Kaggle)

www.image-net.org

Page 30: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

We see a similar progression in ImageNet performance over time as we do in Kaggle competitions

www.image-net.org

Page 31: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Can we do better than competition results?

Page 32: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Looking holistically across all the competitions

Page 33: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

At Kaggle, we’ve run hundreds of public machine learning competitions

Page 34: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

And over 600 in-class competitions for university students

Page 35: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

These competitions have generated over 2,000,000 submissions from around the world

Page 36: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Most of the competitions we’ve run have involved supervised classification or regression

Page 37: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Doing well in competitions

Page 38: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Setup your environment to enable rapid iteration and experimentation

Extract and Select Features

Train Models

Evaluate and Visualize Results

Identify & Handle Data

Oddities

Data Preprocessing

Page 39: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models

http://jeffreydf.github.io/diabetic-retinopathy-detection/

Page 40: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Successful users invest time, thought, and creativity in problem structure and feature extraction

Page 41: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Random Forests / GBM’s work very well for many common classification and regression tasks

(Verikas et al. 2011)

Page 42: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Deep learning has been very effective in computer vision competitions we’ve hosted

caffe, theano, torch7, and keras are four popular open source libraries that facilitate this

Page 43: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

XGBoost and Keras — two ML libraries with great power:effort ratios

Competition Type Winning ML Algorithm

Liberty Mutual Regression XGBoost

Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest

Diabetic Retinopathy Image SparseConvNet + RF

Avito CTR XGBoost

Taxi Trajectory 2 Geostats Classic neural net

Grasp and Lift EEG Keras + XGBoost + other CNN

Otto Group Classification Stacked ensemble of 35 models

Facebook IV Classification sklearn GBM

Page 44: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

XGBoost and Keras — two ML libraries with great power:effort ratios

Competition Type Winning ML Algorithm

Liberty Mutual Regression XGBoost

Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest

Diabetic Retinopathy Image SparseConvNet + RF

Avito CTR XGBoost

Taxi Trajectory 2 Geostats Classic neural net

Grasp and Lift EEG Keras + XGBoost + other CNN

Otto Group Classification Stacked ensemble of 35 models

Facebook IV Classification sklearn GBM

Page 45: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

The Boruta feature selection algorithm is robust and reliable

• Wrapper method around Random Forest and its calculated variable importance

• Iteratively trains RF’s and runs statistical tests to identify features as important or not important

• Widely used in competition-winning models to select a small subset of features for use in training more complex models

• library(boruta) in R

Page 46: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Model ensembling usually results in marginal but significant performance gains

Page 47: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Data leakage is our (and our user’s) #1 challenge

http://www.navy.mil/view_image.asp?id=12495

Page 48: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

We’ve also seen some things that competitions aren’t effective at

Page 49: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Competitions don’t typically yield simple and theoretically elegant solutions

*exception – Factorization Machines in KDD Cup 2012

Page 50: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Competitions don’t typically yield production code

http://ora-00001.blogspot.ru/2011/07/mythbusters-stored-procedures-edition.html

Page 51: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Competitions don’t always yield computationally efficient solutions

• Rewards performance without computational and complexity constraints

http://iinustechtips.com/main/topic/193045-need-help-underclocking-d/

Page 52: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Competitions tend to be highly effective at

Page 53: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Optimizing a quantifiable evaluation metric by exploring an enormously broad range of approaches

Page 54: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Fairly and consistently evaluating a variety of approaches on the same problem

• Implementation details matter, which can make it tough to reproduce results in other settings where data and/or code is not open source

• “A quick, simple way to apply machine learning successfully? In your domain, find the stupid baseline that new methods consistently claim to beat. Implement that stupid baseline”

Page 55: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Identifying data quality and leakage issues

Check that ID column isn’t informative

“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.”

- “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al

Time series

are tricky

Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”Human Scores: 5/5, 4/5

Page 56: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Exposing a specific domain problem to many new communities around the world

Page 57: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Where Kaggle’s going

Page 58: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Kaggle’s mission is to help the world learn from data

http://data-arts.appspot.com/globe/

Page 59: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

We’re building a public platform for collaborating on data and analytics results

People

CodeData

Page 60: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

An early alpha version of this is released as Kaggle Scripts

Page 61: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

It enables users to immediately access R/Python/Julia environments with data preloaded

Page 62: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Everything created on Kaggle Scripts is published as soon as it’s run

www.kaggle.com/scripts

Page 63: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Reproducing and building on another’s work is simply a click away

Page 64: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

We’re starting to enable users to do this on non-competition datasets

Page 65: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner

Soon, any user will be able to publish data through Kaggle for analysis

Page 66: Lessons Learned from Running Hundreds of Kaggle Competitions

@benhamner@benhamner

Thank you!

head to www.kaggle.com/scripts to check out code, visualizations, and results from our community

ps. are you a software engineer passionate about data? I’m hiring