Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

33
Scikit-learn: an incomplete yearly review Ga¨ el Varoquaux scikit machine learning in Python

Transcript of Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Scikit-learn: an incomplete yearly review

Gael Varoquaux

scikit

machine learning in Python

Trends with

1 The library

2 The community

G Varoquaux 2

1 The library

scikit

machine learning in Python

G Varoquaux 3

1 In 0.18 oldies but goodies

New cross-validation objects V.R. Rajagopalan

PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra

Fights global warming

Huge speed gains for biggish data

G Varoquaux 4

1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalanfrom s k l e a r n . c r o s s v a l i d a t i o n

import S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :

X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]

PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra

Fights global warming

Huge speed gains for biggish data

G Varoquaux 4

1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalanfrom s k l e a r n . m o d e l s e l e c t i o n

import S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):

X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]

⇒ better nested-CV

PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra

Fights global warming

Huge speed gains for biggish data

G Varoquaux 4

1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalan

PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra

Fights global warming

Huge speed gains for biggish data

G Varoquaux 4

1 Coming soon Merged in masterMemory in pipeline: G. Lemaitremake pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’)

Limits recomputation (eg in grid search)

G Varoquaux 5

1 Coming soon Merged in masterMemory in pipeline G. Lemaitre

New solver for logistic regression: SAGA A. Menschlinear model.LogisticRegression(solver=’saga’)Fast linear model on biggish data

Train

ing

ob

ject

ive

SAGALiblinear

RCV1

G Varoquaux 5

1 Coming soon Merged in masterMemory in pipeline G. Lemaitre

New solver for logistic regression: SAGA A. Mensch

Quantile transformer: G. Lemaitre

0 2 4 6 8 10 12Median Income

0

1

2

3

4

5

6

Num

ber o

f hou

seho

lds

0.6

1.2

1.8

2.4

3.0

3.6

4.2

4.8

Colo

r map

ping

for v

alue

s of y

G Varoquaux 5

1 Coming soon Merged in masterMemory in pipeline G. Lemaitre

New solver for logistic regression: SAGA A. Mensch

Quantile transformer: G. Lemaitre

0 2 4 6 8 10 12Median Income

0

1

2

3

4

5

6

Num

ber o

f hou

seho

lds

0.6

1.2

1.8

2.4

3.0

3.6

4.2

4.8

Colo

r map

ping

for v

alue

s of y

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2Median Income

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Num

ber o

f hou

seho

lds

0.6

1.2

1.8

2.4

3.0

3.6

4.2

4.8

Colo

r map

ping

for v

alue

s of y

G Varoquaux 5

1 Coming soon Merged in masterMemory in pipeline G. Lemaitre

New solver for logistic regression: SAGA A. Mensch

Quantile transformer G. Lemaitre

Local outlier factor: N. Goix

normalabnormal

G Varoquaux 5

1 Coming soon Merged in masterMemory in pipeline G. Lemaitre

New solver for logistic regression: SAGA A. Mensch

Quantile transformer G. Lemaitre

Local outlier factor N. Goix

Memory savingsAvoid casting (work with float32) J. Massich, A. ImbertT-SNE (in progress) T. Moreau

G Varoquaux 5

1 To come MaybeColumnsTransformer: J. Van den BosschePandas in ... feature engineering ... array outtransformer = make column transformer({

StandardScaler(): [’age’],OneHotEncoder(): [’company’]

})

array = transformer.fit transform(data frame)

G Varoquaux 6

1 To come MaybeColumnsTransformer J. Van den Bossche

Faster trees, forest& boosting:V.R. Rajagopalan, G. Lemaitre

Teaching from XGBoost, lightgbm:bin features for discrete valuesdepth-first tree, for access locality

G Varoquaux 6

1 Scaling out InfrastructureUsing many computers: cloud, elastic computing

Orchestration, data distributionIntegration in corporate infrastructure

Hadoop, queues, services

joblib backendsParallel computing

Loky (robust single-machine process pool)Distributed (Yarn, dask, CMFActivity)

Storage (S3, HDFS)

G Varoquaux 7

1 Continuous integration

Testing under numpy & scipy dev

A. Mueller

G Varoquaux 8

1 Scikit-learn-contribScaling the scikit-learn universe quicker

https://github.com/scikit-learn-contrib

py-earth multivariate adaptive regression splinesimbalanced-learn under-sampling and over-samplinglightning fast linear modelspolylearn factorization machines and polynomial networkshdbscan high-performance clusteringforest-confidence-interval confidence interval for forestsboruta py boruta feature selection

sklearn.utils.estimator checks.check estimator

G Varoquaux 9

1 Scikit-learn-contribScaling the scikit-learn universe quicker

https://github.com/scikit-learn-contrib

py-earth multivariate adaptive regression splinesimbalanced-learn under-sampling and over-samplinglightning fast linear modelspolylearn factorization machines and polynomial networkshdbscan high-performance clusteringforest-confidence-interval confidence interval for forestsboruta py boruta feature selection

sklearn.utils.estimator checks.check estimator

G Varoquaux 9

2 The communityUsers & developers

G Varoquaux 10

2 User base

350 000 returning users 5 000 citations

OS EmployerWindows

Mac

Linux

Industry Academia

Other

50%

20%

30%

63%

3%

34%

G Varoquaux 11

2 User base

350 000 returning users 5 000 citations

OS EmployerWindows

Mac

Linux

Industry Academia

Other

50%

20%

30%

63%

3%

34%

G Varoquaux 11

2 User base

Jun Jul Aug Sep Oct Nov Dec Jan2017

Feb Mar Apr May Jun0

20000

40000

Num

ber o

f PyP

I dow

nloa

ds

G Varoquaux 12

2 User base

Jun Jul Aug Sep Oct Nov Dec Jan2017

Feb Mar Apr May Jun0

20000

40000

60000

80000

100000Nu

mbe

r of P

yPI d

ownl

oads

numpypandasscikit-learn

djangoflask

G Varoquaux 12

2 In the Python ecosystem

1 10 100 1000 10000Package rank

104

105

106

107

108

109Nu

mbe

r of P

yPI d

ownl

oads

G Varoquaux 13

2 In the Python ecosystem

1 10 100 1000 10000Package rank

104

105

106

107

108

109Nu

mbe

r of P

yPI d

ownl

oads

numpyscikit-learn

joblib

simplejsonsixsetuptools

G Varoquaux 13

2 Core software is infrastructureEverybody uses it everyday

In industry, education, & research“Roads and Bridge”: Ford foundation reportExcellent talk by Heather Millerhttps://www.youtube.com/watch?v=17yy5BwIiTw

G Varoquaux 14

2 Community-based development in scikit-learn

Active development team

2010 2012 2014 2016

0

25

50Monthly contributors

https://www.openhub.net/p/scikit-learn

G Varoquaux 15

2 Funding & spending 2015 & 2016New York A. Mueller

$ 350 000 Moore-Sloan grantA. Mueller (full time). Students: M. Kumar, V. Birodkar

Telecom ParisTech A. Gramfort200 000e WendelinIA grant + 12 000e CDSProgrammers: T. Guillemot, T. DupreStudents: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix

Inria Parietal G. Varoquaux120 000e Inria + 100 000e WendelinIA+ 50 000e ANR + 30 000e CDSProgrammers: O. Grisel, L. Esteve (programmer), G.Lemaitre, J. Van den BooscheStudents: A. Mensch, J. Schreiber, G. Patrini

> 400 000e/yrG Varoquaux 16

2 Funding & spending 2015 & 2016New York A. Mueller

$ 350 000 Moore-Sloan grantA. Mueller (full time). Students: M. Kumar, V. Birodkar

Telecom ParisTech A. Gramfort200 000e WendelinIA grant + 12 000e CDSProgrammers: T. Guillemot, T. DupreStudents: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix

Inria Parietal G. Varoquaux120 000e Inria + 100 000e WendelinIA+ 50 000e ANR + 30 000e CDSProgrammers: O. Grisel, L. Esteve (programmer), G.Lemaitre, J. Van den BooscheStudents: A. Mensch, J. Schreiber, G. Patrini

> 400 000e/yrG Varoquaux 16

2 Sustainability

G Varoquaux 17

2 Sustainability

Educating decision makersNot funding your infrastructure is a risk

A fundationDanger: governance, focus on features for the richWe need partners, good ones

G Varoquaux 17

@GaelVaroquaux

Scikit-learn

Machine learning for everyone– from beginner to expert

On going progressFaster models (algorithmics, float32)Easier usage (better pandas integration)Coupling to infrastructure (via joblib)Thinking about sustainability & partnership