Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg...

51
scikit machine learning in Python Scikit-learn Machine learning for the small and the many Ga¨ el Varoquaux In this meeting, I represent low performance computing

Transcript of Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg...

Page 1: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

scikit

machine learning in Python

Scikit-learnMachine learning for the small and the many

Gael Varoquaux

In this meeting, I represent low performance computing

Page 2: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

scikit

machine learning in Python

Scikit-learnMachine learning for the small and the many

Gael Varoquaux

In this meeting, I represent low performance computing

What I do: bridging psychology to neuroscience via machine learningon brain images

Page 3: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 Scikit-learn

2 Statistical algorithms

3 Scaling up / scaling out?

G Varoquaux 2

Page 4: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 Scikit-learn

Goals and tradeoff

G Varoquaux 3

Page 5: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

Minimal prerequisites & assumptions

G Varoquaux 4

Page 6: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

Minimal prerequisites & assumptionsG Varoquaux 4

Page 7: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 scikit-learn user base

350 000 returning users 5 000 citations

OS Employer

Windows Mac Linux industry academia other

50%

20%

30%

63%

3%

34%

G Varoquaux 5

Page 8: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

Page 9: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

Slow ⇒ compiled code as a backendPython’s primitive virtual machine

makes it easy

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

Page 10: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

Page 11: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python libraryUsers like Python

Web searches: Google trends

G Varoquaux 7

Page 12: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python libraryAnd developpers like Python

2010 2012 2014 20160

25

50

Number of contributors active in a week

⇒ Huge set of features(∼ 160 different statistical models)

G Varoquaux 8

Page 13: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 A Python libraryAnd developpers like Python

2010 2012 2014 20160

25

50

Number of contributors active in a week

⇒ Huge set of features(∼ 160 different statistical models)

G Varoquaux 8

Page 14: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

Page 15: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

Page 16: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

Page 17: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

1 Tradeoffs

Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent

Little or no dependenciesEasy build everywhere

All compiled code generated from CythonHigh-level languages give features (Spark)

Low-level gives speed (eg cache-friendly code)

G Varoquaux 10

Page 18: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Statistical algorithmsFast algorithms accept statistical error

Models most used in scikit-learn:

1. Logistic regression, SVM2. Random forests3. PCA

4. Kmeans5. Naive Bayes6. Nearest neighbor

G Varoquaux 11

Page 19: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Statistical algorithmsFast algorithms accept statistical error

Models most used in scikit-learn:

1. Logistic regression, SVM2. Random forests3. PCA

4. Kmeans5. Naive Bayes6. Nearest neighbor

G Varoquaux 11

Page 20: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

“Big” data

Many samples or

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features

Web behavior dataCheap sensors (cameras)

Many features

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features 03078

090707907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

Medical patientsScientific experiments

G Varoquaux 12

Page 21: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

It works because:Features are redundantSparse models can guess which wj are zero

Progress = better selection of features

Many samples Stochastic gradient descent

G Varoquaux 13

Page 22: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Progress = second order schemesG Varoquaux 13

Page 23: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Data-access localityG Varoquaux 13

Page 24: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Data-access locality

Deep learningComposition of linear modelsoptimized jointly (non-convex)with stochastic gradient descent

G Varoquaux 13

Page 25: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Trees & (random) forests

(on subsets of the data)Compute simple bi-variatestatisticsSplit data accordingly ...Speed ups

Share computing between trees or precomputeCache friendly access ⇒ optimize traversal orderApproximate histograms / statistics

LightGBM, XGBoost

G Varoquaux 14

Page 26: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]

G Varoquaux 15

Page 27: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]

G Varoquaux 15

Page 28: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]G Varoquaux 15

Page 29: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000Datamatrix

XU

V

minU,V‖X−UVT‖2 + ‖V‖1

G Varoquaux 16

Page 30: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation

Datamatrix

minU,V‖X−UVT‖2 + ‖V‖1

G Varoquaux 16

Page 31: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation

Online matrixfactorization

Alternatingminimization

Seen at t Seen at t+1 Unseen at t

Datamatrix

[Mairal... 2010]out of core, huge speed upsG Varoquaux 16

Page 32: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation Subsample

rows

Online matrixfactorization

New subsamplingalgorithm

Alternatingminimization

Seen at t Seen at t+1 Unseen at t

Datamatrix

[Mensch... 2017]10X speed ups, or moreG Varoquaux 16

Page 33: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Scaling up / scaling out?

G Varoquaux 17

Page 34: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Dataflow is key to scale

Array computingCPU

03878794797927

01790752701578

03878794797927

01790752701578

Data parallel

03878794797927

03878794797927

Streaming

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Parallel computingData + code transfer Out-of-memory persistence

These patterns can yield horrible codeG Varoquaux 18

Page 35: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

Queues are the central abstraction

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 19

Page 36: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backend

with parallel backend(’dask.distributed’,scheduler host=’HOST:PORT’):

# normal Joblib code

G Varoquaux 19

Page 37: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 19

Page 38: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

Middleware to plug in distributed infrastructures

G Varoquaux 19

Page 39: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

Page 40: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

Page 41: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

Page 42: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

Page 43: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

Page 44: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

Page 45: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

Challenges and dreams 03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

High-level constructs fordistributed computation & data exchange

MPI feels too low level and without data concepts

Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece

Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph

Also: deep-learning engines, for GPUs

G Varoquaux 22

Page 46: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

Challenges and dreams 03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

High-level constructs fordistributed computation & data exchange

MPI feels too low level and without data concepts

Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece

Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph

Also: deep-learning engines, for GPUsG Varoquaux 22

Page 47: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very farEnables focusing on algorithmic optimizationGreat to grow a communityCan easily drop to compiled code

Statistical algorithmics

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

Page 48: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmicsAlgorithms operate on expectancies

Stochastic Gradient DescentRandom projections

Can bring data locality

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

Page 49: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmics

Distributed data computingData acces is centralMust be optimized for algorithmFile system and memory no longer suffice

03878794797927

03878794797927

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

Page 50: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmics

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

Page 51: Scikit-learn · 3 Scaling up / scaling out? G Varoquaux 2. ... (Spark) Low-level gives speed (eg cache-friendly code) G Varoquaux 10. ... 1.Logistic regression, SVM 2.Random forests

4 References I

N. Halko, P. G. Martinsson, and J. A. Tropp. Findingstructure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions. SIAMRev., 53, 2011. ISSN 0036-1445. doi: 10.1137/090771806.URL http://dx.doi.org/10.1137/090771806.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learningfor matrix factorization and sparse coding. Journal ofMachine Learning Research, 11:19, 2010.

A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux.Stochastic subsampling for factorizing huge matrices. Arxivpreprint, 2017.