OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized:...

57
OpenML TOWARDS NETWORKED, AUTOMATED MACHINE LEARNING Joaquin Vanschoren (TU/e) 2015

Transcript of OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized:...

Page 1: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

OpenMLT O W A R D S N E T W O R K E D , A U T O M A T E D

M A C H I N E L E A R N I N G

Joaquin Vanschoren (TU/e) 2015

Page 2: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

1 6 1 0

G A L I L E O G A L I L E I D I S C O V E R S S A T U R N ’ S R I N G S

‘ S M A I S M R M I L M E P O E TA LE U M I B U N E N U G T TA U I R A S ’

Page 3: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Research different. Royal society: Take nobody’s word for itOpen sharing of findings and methods: scientific journalReputation-based economy

Page 4: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

A F T E R 3 0 0 Y E A R SI S T H E P R I N T I N G P R E S S S T I L L T H E B E S T M E D I U M ?

• Code, data too complex

(published separately)

• Experiment details scant

• Results unactionable, hard to

reproduce, reuse

• Papers not updatable

• Slow, limited impact tracking

• Publication bias

F O R M A C H I N E L E A R N I N G ?

Page 5: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Gaps in data-driven scienceDomain scientists: doubts about latest/best techniques, run simple models on complex data, small (biased) collaborations

Data scientists: don’t speak the language, don’t know how to access scientific databases, run complex models on simple data

Industry: small companies delimitated by lack of access to data/expertise

Data scientist

Domain scientist

Tool developer

data+ code

Page 6: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated
Page 7: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

85% medical research resources are wasted: associations/effects are false, exaggerated, translation into applications is inefficient

Research findings less likely to be true if:- Studies are smaller, few collaborators- Effect sizes are smaller- Flexibility in designs, definitions, outcomes, analytical modes- Teams are chasing statistical significance

Increase credibility: - Large-scale interdisciplinary collaboration- Replication culture, reproducibility- Registration/sharing of data- Better statistical methods, models- Better study design, training

Page 8: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

magnetopause

bow shock

solar wind

solar wind

magnetosheath

PLASMA PHYSICS

SURFACE WAVES AND PROCESSES

Nicolas Aunai

Page 9: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

LABELLING (CATALOGUING) OF EVENTS STILL DONE MANUALLY

Page 10: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Research different. Polymaths: Solve math problems through massive collaboration (not competition)

Broadcast question, combine many minds to solve it

Solved hard problems in weeks

Many (joint) publications

Page 11: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Research different. SDSS: Robotic telescope, data publicly online (SkyServer)

+1 million distinct users vs. 10.000 astronomers

Broadcast data, allow many minds to ask the right questions

Thousands of papers, citations

Next: Synoptic Telescope

Page 12: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Research different. How do you label a million galaxies?

Offer right tools so that anybody can contribute, instantly

Many novel discoveries by scientists and citizens

More? Read ‘Networked science’ by M. Nielsen

Page 13: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Why? Designed serendipityBroadcasting data fosters spontaneous, unexpected discoveriesWhat’s hard for one scientist is easy for another: connect minds

How? Remove frictionOrganized body of compatible scientific data (and tools) onlineMicro-contributions: seconds, not days Easy, organised communicationTrack who did what, give credit

Page 14: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

OpenMLA R E A LT I M E , W O R L D W I D E L A B

Page 15: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

F R I C T I O N - L E S S E N V I R O N M E N T F O R M A C H I N E L E A R N I N G R E S E A R C H

Organized: Experiments connected to data, code, people. Reproducible.Easy to use: Automated download/upload within your ML environmentMicro-contributions: Upload single dataset, algorithm, experimentEasy communication: Online discussions per dataset, algorithm, experimentReputation: Auto-tracking of downloads, reuse, likes. Real time: Share and reuse instantly, openly or in circles of trusted people

Page 16: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated
Page 17: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Data from various sources analysed and

organised online for easy access

Scientists broadcast data by uploading or linking from existing repos. OpenML will automatically check and analyze the data, compute

characteristics, annotate, version and index it for easy search

Page 18: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• Search by keywords or properties

• Through website or API

• Filters

• Tagging

Page 19: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• Search on keywords or properties

• Wiki-like descriptions

• Analysis and visualisation of features

• Auto-calculation of large range of meta-features

Page 20: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Scientific tasks that can be

interpreted by tools and solved collaboratively

Tasks: containers with all data, goals, procedures. Machine-readable: tools can automatically download data, use

correct procedures, and upload results. Creates realtime, collaborative data mining challenges.

Page 21: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• Example: Classification on click prediction dataset, using 10-fold CV and AUC

• People submit results (e.g. predictions)

• Server-side evaluation (many measures)

• All results organized online, per algorithm, parameter setting

• Online visualizations: every dot is a run plotted by score

Page 22: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions

• Collaborative: all code and data available, learn from others, form teams

• Real-time: who submits first gets credit, others can improve immediately

Page 23: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Machine learning flows (code) that can

solve tasks and report results.

Flows: wrappers that read tasks, return required results. Scientists upload code or link from existing repositories/libraries. Tool integrations allow automated data download, flow upload

and experiment logging and sharing.

Page 24: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• WEKA/MOA plugins: automatically load tasks, export results

REST API + Java, R, Python APIs

• RapidMiner plugin: new operators to load tasks, export results and subworkflow

• R/Python interfaces: functions to down/upload data, code, results in few lines of code

Page 25: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• All results obtained with same flow organised online

• Results linked to data sets, parameter settings -> trends/comparisons

• Visualisations (dots are models, ranked by score, colored by parameters)

Page 26: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Experiments auto-uploaded,

linked to data, flows and authors, and organised for easy

reuse

Runs uploaded by flows, contain fully reproducible results for all tasks. OpenML evaluates and organizes all results

online for discovery, comparison and reuse

Page 27: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

• Detailed run info

• Author, data, flow, parameter settings, result files, …

• Evaluation details (e.g., results per sample)

Page 28: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

OpenML Community

Jan-Jun 2015

Used all over the world (and still in beta)Great open source community, on GitHub450+ active users, many more passive ones1000s of datasets, flows, 450000+ runs

Page 29: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Projects (e-papers) - Online counterpart of a paper, linkable- Merge data, code, experiments (new or old)- Public or shared within circle

Circles Create collaborations with trusted researchersShare results within team prior to publication

Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,…- Online reputation (more sharing)

Things we’re working on

Page 30: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Algorithm selection, hyperparameter tuning - Upload dataset, system recommends techniques- Model-based optimisation techniques- Continuous improvement (learns from past)

Distributed computing - Create jobs online, run anywhere you want- Locally, clusters, clouds

Things we’re working on (please join)

Page 31: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Things we’re working on (please join)

Algorithm/code connections - Improved API’s (R,Java,Python,CLI,…)- Your favourite tool integrated

Data repository connections - Wonderful open data repo’s (e.g. rOpenSci)- More data formats, data set analysis

Statistical analysis - Proper significance testing in comparisons- Recommend evaluation techniques (e.g. CV)Online task creation - Definition of scientific tasks- Freeform tasks or server-side support

p

Page 32: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Towards OpenML in education

Page 33: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Towards OpenML in education

Page 34: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Collaboratory: bring data scientists and domain scientists together online (and their data and tools)

Easy, large-scale collab: Extract actionable datasets, key tools. Scientists shares data and get help, DS can test technique on many current datasets.

Real-time collab: share experiments automatically, discuss online. Automate experimentation.

Data scientist

Domain scientist

Tool developer

data+ code

Bridging gaps in data-driven science

Page 35: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Collaboratory: bring data scientists and domain scientists together online (and their data and tools)

Easy, large-scale collab: Extract actionable datasets, key tools. Scientists shares data and get help, DS can test technique on many current datasets.

Real-time collab: share experiments automatically, discuss online. Automate experimentation.

Data scientist

Domain scientist

Tool developer

data+ code

Bridging gaps in data-driven science

Page 36: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Networked data(-driven) science

Data repositories

Code repositories

Human scientists

meta-data, models, evaluations

data, code, experiment sharing, commenting

API

Page 37: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Automating machine learning

Human scientists

meta-data, models, evaluations

Automated processes

Data cleaning Algorithm Selection

Parameter Optimization Workflow construction

Post processing

API

Page 38: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

A U T O M AT E D M A C H I N E L E A R N I N G

Data C

ollecti

on

Data C

leaning

Metri

c Selecti

on

Algorithm Selecti

on

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data C

oding

S E M I -

Thanks to R. Caruana

Page 39: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

A U T O M AT E D M A C H I N E L E A R N I N G

Data C

ollecti

on

Data C

leaning

Metri

c Selecti

on

Algorithm Selecti

on

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data C

oding

S E M I -

Thanks to R. Caruana

Page 40: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

A U T O M AT E D M A C H I N E L E A R N I N G

Data C

ollecti

on

Data C

leaning

Metri

c Selecti

on

Algorithm Selecti

on

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data C

oding

S E M I -

Thanks to R. Caruana

Page 41: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

M A C H I N E L E A R N I N G P I P E L I N E

Data C

ollecti

on

Data C

leaning

Metri

c Selecti

on

Algorithm Selecti

on

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data C

oding

Manual heuristic search: surprisingly suboptimalGrid search: only effective with very small number of parametersRandom search: better with larger number of parametersBayesian Optimization: better with very large number of parameters

Critical with modern (low bias) algorithms: boosting, deep learning,…Thanks to R. Caruana

Page 42: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

Thanks to Matthew Hoffmann

Page 43: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

Thanks to Matthew Hoffmann

Page 44: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

2. Posterior model

Thanks to Matthew Hoffmann

Page 45: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

2. Posterior model

3. Exploration strategy (acquisition function)

Thanks to Matthew Hoffmann

Page 46: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

2. Posterior model

3. Exploration strategy (acquisition function)

4. Optimize it

Thanks to Matthew Hoffmann

Page 47: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

2. Posterior model

3. Exploration strategy (acquisition function)

4. Optimize it

5. Sample new data, update model

Thanks to Matthew Hoffmann

Page 48: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Bayesian Optimization1. Initial sample

2. Posterior model

3. Exploration strategy (acquisition function)

4. Optimize it

5. Sample new data, update model

6. Repeat

Thanks to Matthew Hoffmann

Page 49: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

M A C H I N E L E A R N I N G P I P E L I N E

Data C

ollecti

on

Data C

leaning

Metri

c Selecti

on

Algorithm

Selec

tion

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data C

oding

Meta-learning: Learn link between data properties & algorithm performanceActive testing: Sequentially predict algorithm, learn from evaluation scoreRecommender systems: Evaluations are ‘ratings’ of datasets by algorithms

Data properties are crucial, depend on data domain (streams, graphs,…)Best combined with parameter optimisation

Page 50: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

OpenML in drug discovery

ChEMBL databaseSM

ILEs'

Molecular'proper0es'

(e.g.'MW,'LogP)'

Fingerprints'(e.g.'FP2,'FP3,'FP4,'MACSS)'

MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#!!

377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!

! ! ! ! ! ! !.!! ! ! ! ! ! !: !!

10.000+ regression datasets

Page 51: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

OpenML in drug discovery

Predict best algorithm with meta-models

cforest

ctree

earth

fnn

gbm

glmnetlassolmnnetpcrplsr

rforest

ridge

rtree

ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*

0.697* 0.427* 0.725+ 0.627*

|msdfeat< 0.1729

mutualinfo< 0.7182

skewresp< 0.3315

nentropyfeat< 1.926

netcharge1>=0.9999

netcharge3< 13.85

hydphob21< −0.09666

glmnet

fnn pcr

rforest

ridge rforest

rforest ridge

I. Olier et al. MetaSel@ECMLPKDD 2015

Page 52: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Fast algorithm selection

0

0.02

0.04

0.06

0.08

0.1

1 4 16 64 256 1024 4096 16384 65536

Accu

racy

Los

s

Time (seconds)

Best on SampleAverage Rank

PCCPCC (A3R, r = 1)

J. van Rijn et al. IDA 2015

By evaluating algorithm on smaller data samples, multi-objective evaluation

Page 53: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

M A C H I N E L E A R N I N G P I P E L I N E

Data C

ollecti

on

Data Clea

ning

Metri

c Selecti

on

Algorithm Selecti

on

Paramete

r Optim

izatio

n

Post-pro

cessi

ng

Online eva

luation

Data Cod

ing

Detect variable types, missing values, anomalies, …Auto-coding: Different learning algorithms require different coding Detect changes in data (that affect model): sensors break, human error,…Feedback loop: implementing a model changes the data it was trained on Leakage: model is trained on data it should not see

Thanks to R. Caruana

Page 54: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Towards an AI for data-driven researchSymbiotic AI: learns from thousands of human data scientists how to analyse data. Learns which workflows work well on which data.

Collaborates with scientists: take care of tedious, error-prone tasks: clean data, find similar data, try state-of-the-art, algorithm selection, configuration,…

Couple human expertise and machine learning: offer the right information at the right time to the right person, so that she can make informed decisions based on clear evidence.

Page 55: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Towards an AI for data-driven researchAI (novice) - what?Upload dataset, give 1 day to return best possible model

Man-Computer Symbiosis - why?Answer specific questions by scientist. Combine machine learning with talents of human mind (hunches, expertise). Explains why recommendation made.

Expert level (expert) - how?Expert asks system how it did something, system explains all details, show code, data

Page 56: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

Look at my new data set :) I analysed it, here’s a full report. Can you remove the outliers? I removed them, does this look ok? Yes, now I want to predict X OK, let me build you some optimized classification models. Are there similar data sets? Here’s a ranked list, and some papers. What’s the state-of-the-art? I’m running all state-of-the-art techniques, I’ll send a report soon. What’s the difference with algo A and B? Here’s a comparison. B looks most promising on high-dimensional data

Towards an AI for data-driven research

Page 57: OpenML - BCS Aberdeen Branch · FRICTION-LESS ENVIRONMENT FOR MACHINE LEARNING RESEARCH Organized: Experiments connected to data, code, people.Reproducible. Easy to use: Automated

T H A N K Y O U

Joaquin Vanschoren

Jan van Rijn

Bernd Bischl

Matthias Feurer

Michel Lang

Nenad Tomašev

Giuseppe Casalicchio

Luis Torgo

You?

#OpenML