Ben Hamner, CTO, Kaggle, at MLconf NYC 2017

40
The Future Of Kaggle Where we came from and where we’re going kaggle.com/benhamner @benhamner

Transcript of Ben Hamner, CTO, Kaggle, at MLconf NYC 2017

The Future Of KaggleWhere we came from and where we’re going

kaggle.com/benhamner@benhamner

Our mission is to help the world learn from data

@benhamner

We got started running supervised learning competitions

@benhamner

Since 2010, we’ve run

● 240 general competitions● 1,610 university classroom competitions

We’re now doing this at scale

@benhamner

This has attracted a talented and diverse community

@benhamner

We’ve taught hundreds of thousands machine learning

@benhamner

We’ve pushed the state of the art forward

@benhamner

● What techniques work well● How people win competitions● Why our community participates● What major pain points data scientists hit● How we can help data scientists ameliorate these pain points

We’ve learned a tremendous amount along the way

@benhamner

Great data scientists optimize the entire ML workflow

@benhamner

GBM’s and deep neural networks are incredibly effective

@benhamner

Model ensembling almost always ekes out gains

@benhamner

Successful participants avoid overfitting

@benhamner

We’ve seen major pain points

@benhamner

Today’s practices are like programming in assembly

@benhamner

Beside software engineering tools, ML tools feel like they came from the stone age

@benhamner

Accessing data is tough

@benhamner

Getting high quality data is even tougher

@benhamner

Cleaning data is painful

Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”

Human Scores: 5/5, 4/5@benhamner

Data leakage is common and subtle

@benhamner

Going from research to production can be brutal

@benhamner

Reproducing work takes days to months

@benhamner

We can do better than this

@benhamner

Accessing data should be seamless

@benhamner

You should never need to repeat work others have done

@benhamner

A single command should reproduce everything start-to-end

> make all

@benhamner

Making a successful one-line update should take seconds

@benhamner

Helpful metadata shouldn’t stay buried in minds or emails

@benhamner

Best practices should be easy defaults, not complicated custom contraptions

@benhamner

We’re changing this

@benhamner

We’ve launched two new products: Kernels and Datasets

@benhamner

We recently joined Google Cloud to accelerate our growth

@benhamner

Datasets, Kernels, and Competitions have an exciting future

@benhamner

The world’s data will be accessible with a common interface

@benhamner

That captures the important code and metadata on top of it

@benhamner

A central searchable hub for your organization’s data

@benhamner

A kernel is an atom of reproducible data science

@benhamner

Kernels will be your continuous integration server for data

@benhamner

We’ve started running code competitions

@benhamner

● Backtested time series● Live data feeds● Reinforcement learning● Generative modeling● Adversarial learning● Machine learning under computational constraints● Sensitive datasets

This will enable exciting new competition formats

@benhamner