Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
-
Upload
gpano -
Category
Data & Analytics
-
view
229 -
download
3
Transcript of Kaggle presentation at SF Data Mining Meetup - Trulia June 23, 2015
Kaggle The home of data science
GE Flight Quest 2 Optimize flight routes based on weather & traffic
$250,000 122 teams
Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays
$100,000 155 teams
Allstate Purchase Prediction Challenge Develop an automated scoring algorithm for student-written essays
$50,000 1,570 teams
Merck Molecular Activity Challenge Help develop safe and effective medicines by predicting molecular activity
$40,000 236 teams
Higgs Boson Machine Learning Challenge Use the ATLAS experiment to identify the Higgs boson
$13,000 1,302 teams
Age Income Default
58 $95,824 True
73 $20,708 False
59 $82,152 False
66 $25,334 True
Age Income Default
73 $53,445
61 $36,679
47 $90,422
44 $79,040
Training Data Test Data
The Kaggle Approach
Mapping Dark Matter
Competition Progress
Accuracy (lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170 Martin O’Leary PhD student in Glaciology, Cambridge U
“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”
“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”
Mapping Dark Matter
Competition Progress
Accuracy (lower is better)
Week 1 Week 3 Week 5 Week 7 End
.0150
.0170
Martin O’Leary PhD student in Glaciology, Cambridge U
Marius Cobzarenco Grad student in computer vision, UC London
Ali Haissaine & Eu Jin Loc Signature Verification, Qatar U & Grad Student @ Deloitte
Other
deepZot (David Kirkby & Daniel Margala) Particle Physicist & Cosmologist
EXAMPLE ESSAY QUESTION —
We all understand the benefits of laughter. For example, someone once said, “Laughter is the shortest distance between two people.” Many other people believe that laughter is an important part of any relationship. Tell a true story in which laughter was one element or part.
We can work with difficult data —
The winning model correctly predicted seizures 82% of the time. Until that point, researchers had struggled to develop an algorithm that did better than chance
Mayo Clinic: Seizure detection from EEG readings
We’ve worked with many of the world’s largest companies
Healthcare & Pharma
Consumer Internet
Finance Industrial Consumer Marketing
Oil & Gas
$50b+ Beverage
Co.
Global Bank
Top Credit Card
Issuer
Top 5 E&P
Top 20 E&P
Community of over 320K data scientists
That submit over 100K machine learning models per month
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
May-10 May-11 May-12 May-13 May-14 May-15
Monthly Submissions to Kaggle Competitions
Feature engineering matters most
Good software engineering practices and robust statistical methods are key
80% of data science is grunt work and only 20% involves deep thinking
A good pipeline makes data scientists more productive and their work higher quality and more enjoyable
Our workflow environment will be the central repository for all data science work in a company
Anthony Goldbloom [email protected] 650 283 9781