Scikit-learn - exascale.org · Scikit-learn’s vision:Machine learning for everyone Outreach...

51
scikit machine learning in Python Scikit-learn Machine learning for the small and the many Ga¨ el Varoquaux In this meeting, I represent low performance computing

Transcript of Scikit-learn - exascale.org · Scikit-learn’s vision:Machine learning for everyone Outreach...

scikit

machine learning in Python

Scikit-learnMachine learning for the small and the many

Gael Varoquaux

In this meeting, I represent low performance computing

scikit

machine learning in Python

Scikit-learnMachine learning for the small and the many

Gael Varoquaux

In this meeting, I represent low performance computing

What I do: bridging psychology to neuroscience via machine learningon brain images

1 Scikit-learn

2 Statistical algorithms

3 Scaling up / scaling out?

G Varoquaux 2

1 Scikit-learn

Goals and tradeoff

G Varoquaux 3

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

Minimal prerequisites & assumptions

G Varoquaux 4

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

Minimal prerequisites & assumptionsG Varoquaux 4

1 scikit-learn user base

350 000 returning users 5 000 citations

OS Employer

Windows Mac Linux industry academia other

50%

20%

30%

63%

3%

34%

G Varoquaux 5

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

Slow ⇒ compiled code as a backendPython’s primitive virtual machine

makes it easy

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 6

1 A Python libraryUsers like Python

Web searches: Google trends

G Varoquaux 7

1 A Python libraryAnd developpers like Python

2010 2012 2014 20160

25

50

Number of contributors active in a week

⇒ Huge set of features(∼ 160 different statistical models)

G Varoquaux 8

1 A Python libraryAnd developpers like Python

2010 2012 2014 20160

25

50

Number of contributors active in a week

⇒ Huge set of features(∼ 160 different statistical models)

G Varoquaux 8

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

1 API: simplify, but do not dumb down

Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )

classifier often has hyperparametersFinding good defaults is crucial, and hard

A lot of effort on the documentationExample-driven development

G Varoquaux 9

1 Tradeoffs

Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent

Little or no dependenciesEasy build everywhere

All compiled code generated from CythonHigh-level languages give features (Spark)

Low-level gives speed (eg cache-friendly code)

G Varoquaux 10

2 Statistical algorithmsFast algorithms accept statistical error

Models most used in scikit-learn:

1. Logistic regression, SVM2. Random forests3. PCA

4. Kmeans5. Naive Bayes6. Nearest neighbor

G Varoquaux 11

2 Statistical algorithmsFast algorithms accept statistical error

Models most used in scikit-learn:

1. Logistic regression, SVM2. Random forests3. PCA

4. Kmeans5. Naive Bayes6. Nearest neighbor

G Varoquaux 11

“Big” data

Many samples or

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features

Web behavior dataCheap sensors (cameras)

Many features

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features 03078

090707907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

Medical patientsScientific experiments

G Varoquaux 12

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

It works because:Features are redundantSparse models can guess which wj are zero

Progress = better selection of features

Many samples Stochastic gradient descent

G Varoquaux 13

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Progress = second order schemesG Varoquaux 13

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Data-access localityG Varoquaux 13

2 Linear models

minw∑i

l(yi , xi w)Many features Coordinate descent

Iteratively optimize w.r.t. wj separately

Many samples Stochastic gradient descentminw E[l(y , x w)]

Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]

Use a cheap estimate of E[∇wl ] (e.g. subsampling)

Data-access locality

Deep learningComposition of linear modelsoptimized jointly (non-convex)with stochastic gradient descent

G Varoquaux 13

2 Trees & (random) forests

(on subsets of the data)Compute simple bi-variatestatisticsSplit data accordingly ...Speed ups

Share computing between trees or precomputeCache friendly access ⇒ optimize traversal orderApproximate histograms / statistics

LightGBM, XGBoost

G Varoquaux 14

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]

G Varoquaux 15

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]

G Varoquaux 15

2 PCA: principal component analysisTruncated SVD (singular value decomposition)

X = U s VT

Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:

X =random projection(X) # e.g. subsamplingUi, si , VT

i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT

redXU′ s′V′T = SVD(Xred)VT = V′TVT

red

X summarize well the dataEach SVD is on local data

[Halko... 2011]G Varoquaux 15

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000Datamatrix

XU

V

minU,V‖X−UVT‖2 + ‖V‖1

G Varoquaux 16

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation

Datamatrix

minU,V‖X−UVT‖2 + ‖V‖1

G Varoquaux 16

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation

Online matrixfactorization

Alternatingminimization

Seen at t Seen at t+1 Unseen at t

Datamatrix

[Mairal... 2010]out of core, huge speed upsG Varoquaux 16

2 Stochastic factorization of huge matrices

Factorization of dense matrices ∼ 200 000× 2 000 000

- Data access

- Dictionary update

Streamcolumns

- Code com- putation Subsample

rows

Online matrixfactorization

New subsamplingalgorithm

Alternatingminimization

Seen at t Seen at t+1 Unseen at t

Datamatrix

[Mensch... 2017]10X speed ups, or moreG Varoquaux 16

3 Scaling up / scaling out?

G Varoquaux 17

3 Dataflow is key to scale

Array computingCPU

03878794797927

01790752701578

03878794797927

01790752701578

Data parallel

03878794797927

03878794797927

Streaming

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Parallel computingData + code transfer Out-of-memory persistence

These patterns can yield horrible codeG Varoquaux 18

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

Queues are the central abstraction

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 19

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backend

with parallel backend(’dask.distributed’,scheduler host=’HOST:PORT’):

# normal Joblib code

G Varoquaux 19

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 19

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibParallel for loops concurrency is hard

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

Middleware to plug in distributed infrastructures

G Varoquaux 19

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

3 Distributed data flow and storage

Moving data aroundis costly

03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs

Very big data calls for couplinga database to a computing engine

G Varoquaux 20

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

3 joblib.Memory as a storage poolA caching / function memoizing system

Stores results of function executions

Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()

S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)

https://github.com/joblib/joblib/pull/397

G Varoquaux 21

Challenges and dreams 03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

High-level constructs fordistributed computation & data exchange

MPI feels too low level and without data concepts

Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece

Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph

Also: deep-learning engines, for GPUs

G Varoquaux 22

Challenges and dreams 03878794797927Read-onlydata store

Node'smemory

Parameterdatabase

High-level constructs fordistributed computation & data exchange

MPI feels too low level and without data concepts

Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece

Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph

Also: deep-learning engines, for GPUsG Varoquaux 22

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very farEnables focusing on algorithmic optimizationGreat to grow a communityCan easily drop to compiled code

Statistical algorithmics

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmicsAlgorithms operate on expectancies

Stochastic Gradient DescentRandom projections

Can bring data locality

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmics

Distributed data computingData acces is centralMust be optimized for algorithmFile system and memory no longer suffice

03878794797927

03878794797927

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

@GaelVaroquaux

Lessons from scikit-learnSmall-computer machine-learning trying to scale

Python gets us very far

Statistical algorithmics

Distributed data computing

If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic

4 References I

N. Halko, P. G. Martinsson, and J. A. Tropp. Findingstructure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions. SIAMRev., 53, 2011. ISSN 0036-1445. doi: 10.1137/090771806.URL http://dx.doi.org/10.1137/090771806.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learningfor matrix factorization and sparse coding. Journal ofMachine Learning Research, 11:19, 2010.

A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux.Stochastic subsampling for factorizing huge matrices. Arxivpreprint, 2017.