Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations...

26
scikit machine learning in Python Scikit-learn open, easy, yet versatile machine learning Ga¨ el Varoquaux

Transcript of Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations...

Page 1: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

scikit

machine learning in Python

Scikit-learnopen, easy, yet versatile machine learning

Gael Varoquaux

Page 2: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 The vision

Goals and tradeoff

G Varoquaux 2

Page 3: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

G Varoquaux 3

Page 4: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 scikit-learn user base

350 000 returning users 5 000 citations

OS Employer

Windows Mac Linux industry academia other

50%

20%

30%

63%

3%

34%

G Varoquaux 4

Page 5: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

Slow ⇒ compiled code as a backend

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 5

Page 6: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

Slow ⇒ compiled code as a backend

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 5

Page 7: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 A Python libraryUsers like Python

Web searches: Google trends

G Varoquaux 6

Page 8: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

1 Tradeoffs for outreach

Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent

Didactic documentationCourse on machine learningRich examples

G Varoquaux 7

Page 9: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 The toolA Python library for machine learning

c©Theodore W. GrayG Varoquaux 8

Page 10: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 API:

Simplify, but do not dumb down

G Varoquaux 9

Page 11: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 API: specifying a model

A central concept: the estimatorInstanciated without dataBut specifying the parameters

from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s

e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)

Models often have hyperparametersFinding good defaults is crucial, and hard

G Varoquaux 10

Page 12: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 API: training a model

Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:X a numpy array with shape

nsamples × nfeatures

y a numpy 1D array, of ints or float, with shapensamples

G Varoquaux 11

Page 13: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 API: using a model

Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )

Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )

Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )

G Varoquaux 12

Page 14: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Vectorizing: from raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t

import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()

X = h a s h e r . f i t t r a n s f o r m ( documents )

G Varoquaux 13

Page 15: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Very rich feature set: 160 estimators

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear models SVMGaussian processes ...

Unsupervised LearningClustering Mixture modelsDictionary learning ICAOutlier detection ...

Model selectionCross-validationParameter optimization

G Varoquaux 14

Page 16: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Models most used in scikit-learn1. Logistic regression, SVM

SAG/SAGA solver

2. Random forestsParallel computing, feature importance

3. PCARandomized linear algebra, online solver

4. KmeansKMeans++ initialization, Elkan solver, online solver

5. Naive BayesOn-line solver

6. Nearest neighborKD-tree / ball-tree

Multi-class / multi-outputDense or sparse data

G Varoquaux 15

Page 17: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Models most used in scikit-learn1. Logistic regression, SVM

SAG/SAGA solver

2. Random forestsParallel computing, feature importance

3. PCARandomized linear algebra, online solver

4. KmeansKMeans++ initialization, Elkan solver, online solver

5. Naive BayesOn-line solver

6. Nearest neighborKD-tree / ball-tree

Multi-class / multi-outputDense or sparse data

G Varoquaux 15

Page 18: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

“Big” data

Efficient processing on large datasets

G Varoquaux 16

Page 19: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Many samples: on-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Supervised models:Naive Bayes, linear models (various loss),multi-layer perceptronClustering:KMeans, BirchLinear decompositions:PCA, Dictionnary learning, Latent Dirichlet Allocation

G Varoquaux 17

Page 20: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

2 Many features: on-the-fly data reductionX s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Feature clusteringsklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Feature hashingsklearn.feature extraction.text.

HashingVectorizerstateless: can be used in parallel

G Varoquaux 18

Page 21: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

3 The development

G Varoquaux 19

Page 22: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

3 Community-based development in scikit-learn

2010 2012 2014 20160

25

50Huge feature set:benefits of a large team

Project growth:

More than 400 contributors∼ 20 core contributors

https://www.openhub.net/p/scikit-learn

Community-driven project

G Varoquaux 20

Page 23: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 21

Page 24: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

3 Quality assuranceCode review: pull requestsWe read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are the numerics stable?- Could it be faster?

Unit testingEverything is testedGreat for numerics

G Varoquaux 22

Page 25: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

3 Quality assuranceCode review: pull requestsWe read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are the numerics stable?- Could it be faster?

Unit testingEverything is testedGreat for numerics

G Varoquaux 22

Page 26: Scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations OS Employer Windows Mac Linux industry academia other 50% 20% 30% 63% 3% 34% G Varoquaux

@GaelVaroquaux

Scikit-learn

Machine learning for everyone– from beginner to expert

A design challenge: hiding complexityA development challenge: keeping qualityA research challenge: robust methods without surprisesSustainability (funding) is an issue