1 In 0.18 oldies but goodies
New cross-validation objects V.R. Rajagopalan
PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalanfrom s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalanfrom s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
⇒ better nested-CV
PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
1 In 0.18 oldies but goodiesNew cross-validation objects V.R. Rajagopalan
PCA == Randomized PCA G. PatriniHeuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 4
1 Coming soon Merged in masterMemory in pipeline: G. Lemaitremake pipeline(PCA(), LinearSVC(), memory=’/tmp/joe’)
Limits recomputation (eg in grid search)
G Varoquaux 5
1 Coming soon Merged in masterMemory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Menschlinear model.LogisticRegression(solver=’saga’)Fast linear model on biggish data
Train
ing
ob
ject
ive
SAGALiblinear
RCV1
G Varoquaux 5
1 Coming soon Merged in masterMemory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12Median Income
0
1
2
3
4
5
6
Num
ber o
f hou
seho
lds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colo
r map
ping
for v
alue
s of y
G Varoquaux 5
1 Coming soon Merged in masterMemory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer: G. Lemaitre
0 2 4 6 8 10 12Median Income
0
1
2
3
4
5
6
Num
ber o
f hou
seho
lds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colo
r map
ping
for v
alue
s of y
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2Median Income
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Num
ber o
f hou
seho
lds
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
Colo
r map
ping
for v
alue
s of y
G Varoquaux 5
1 Coming soon Merged in masterMemory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor: N. Goix
normalabnormal
G Varoquaux 5
1 Coming soon Merged in masterMemory in pipeline G. Lemaitre
New solver for logistic regression: SAGA A. Mensch
Quantile transformer G. Lemaitre
Local outlier factor N. Goix
Memory savingsAvoid casting (work with float32) J. Massich, A. ImbertT-SNE (in progress) T. Moreau
G Varoquaux 5
1 To come MaybeColumnsTransformer: J. Van den BosschePandas in ... feature engineering ... array outtransformer = make column transformer({
StandardScaler(): [’age’],OneHotEncoder(): [’company’]
})
array = transformer.fit transform(data frame)
G Varoquaux 6
1 To come MaybeColumnsTransformer J. Van den Bossche
Faster trees, forest& boosting:V.R. Rajagopalan, G. Lemaitre
Teaching from XGBoost, lightgbm:bin features for discrete valuesdepth-first tree, for access locality
G Varoquaux 6
1 Scaling out InfrastructureUsing many computers: cloud, elastic computing
Orchestration, data distributionIntegration in corporate infrastructure
Hadoop, queues, services
joblib backendsParallel computing
Loky (robust single-machine process pool)Distributed (Yarn, dask, CMFActivity)
Storage (S3, HDFS)
G Varoquaux 7
1 Scikit-learn-contribScaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splinesimbalanced-learn under-sampling and over-samplinglightning fast linear modelspolylearn factorization machines and polynomial networkshdbscan high-performance clusteringforest-confidence-interval confidence interval for forestsboruta py boruta feature selection
sklearn.utils.estimator checks.check estimator
G Varoquaux 9
1 Scikit-learn-contribScaling the scikit-learn universe quicker
https://github.com/scikit-learn-contrib
py-earth multivariate adaptive regression splinesimbalanced-learn under-sampling and over-samplinglightning fast linear modelspolylearn factorization machines and polynomial networkshdbscan high-performance clusteringforest-confidence-interval confidence interval for forestsboruta py boruta feature selection
sklearn.utils.estimator checks.check estimator
G Varoquaux 9
2 User base
350 000 returning users 5 000 citations
OS EmployerWindows
Mac
Linux
Industry Academia
Other
50%
20%
30%
63%
3%
34%
G Varoquaux 11
2 User base
350 000 returning users 5 000 citations
OS EmployerWindows
Mac
Linux
Industry Academia
Other
50%
20%
30%
63%
3%
34%
G Varoquaux 11
2 User base
Jun Jul Aug Sep Oct Nov Dec Jan2017
Feb Mar Apr May Jun0
20000
40000
Num
ber o
f PyP
I dow
nloa
ds
G Varoquaux 12
2 User base
Jun Jul Aug Sep Oct Nov Dec Jan2017
Feb Mar Apr May Jun0
20000
40000
60000
80000
100000Nu
mbe
r of P
yPI d
ownl
oads
numpypandasscikit-learn
djangoflask
G Varoquaux 12
2 In the Python ecosystem
1 10 100 1000 10000Package rank
104
105
106
107
108
109Nu
mbe
r of P
yPI d
ownl
oads
G Varoquaux 13
2 In the Python ecosystem
1 10 100 1000 10000Package rank
104
105
106
107
108
109Nu
mbe
r of P
yPI d
ownl
oads
numpyscikit-learn
joblib
simplejsonsixsetuptools
G Varoquaux 13
2 Core software is infrastructureEverybody uses it everyday
In industry, education, & research“Roads and Bridge”: Ford foundation reportExcellent talk by Heather Millerhttps://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 14
2 Community-based development in scikit-learn
Active development team
2010 2012 2014 2016
0
25
50Monthly contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 15
2 Funding & spending 2015 & 2016New York A. Mueller
$ 350 000 Moore-Sloan grantA. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort200 000e WendelinIA grant + 12 000e CDSProgrammers: T. Guillemot, T. DupreStudents: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux120 000e Inria + 100 000e WendelinIA+ 50 000e ANR + 30 000e CDSProgrammers: O. Grisel, L. Esteve (programmer), G.Lemaitre, J. Van den BooscheStudents: A. Mensch, J. Schreiber, G. Patrini
> 400 000e/yrG Varoquaux 16
2 Funding & spending 2015 & 2016New York A. Mueller
$ 350 000 Moore-Sloan grantA. Mueller (full time). Students: M. Kumar, V. Birodkar
Telecom ParisTech A. Gramfort200 000e WendelinIA grant + 12 000e CDSProgrammers: T. Guillemot, T. DupreStudents: M. Kumar, D. Sullivan, V.R. Rajagopalan, N. Goix
Inria Parietal G. Varoquaux120 000e Inria + 100 000e WendelinIA+ 50 000e ANR + 30 000e CDSProgrammers: O. Grisel, L. Esteve (programmer), G.Lemaitre, J. Van den BooscheStudents: A. Mensch, J. Schreiber, G. Patrini
> 400 000e/yrG Varoquaux 16
2 Sustainability
Educating decision makersNot funding your infrastructure is a risk
A fundationDanger: governance, focus on features for the richWe need partners, good ones
G Varoquaux 17
Top Related