Download - machine learning in Python

Scikit-learn for easy machine learning:the vision, the tool, and the project

Gael Varoquaux

scikit

machine learning in Python

1 Scikit-learn: the vision

G Varoquaux 2

1 Scikit-learn: the visionAn enabler

G Varoquaux 2

1 Scikit-learn: the visionAn enabler

Machine learningfor everybody andfor everything

Machine learningwithout learning themachinery

G Varoquaux 2

Machine learning in a nutshell

Machine learning is about making prediction from data

G Varoquaux 3

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Eatable?

Mobile?

Tall?

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

G Varoquaux 5

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

?G Varoquaux 5

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” method

How many errors on already-known images?... 0: no errors

Test data 6= Train data

G Varoquaux 6

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” methodHow many errors on already-known images?

... 0: no errors

Test data 6= Train data

G Varoquaux 6

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

G Varoquaux 7


x

y

x

yWhich model to prefer?

G Varoquaux 7


x

y

x

yProblem of “over-fitting”

Minimizing error is not always the best strategy(learning noise)

Test data 6= train dataG Varoquaux 7


x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

G Varoquaux 7


x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

Bias variance tradeoff

G Varoquaux 7


x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters

G Varoquaux 7


x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters⇒ need more data

“curse of dimensionality”

G Varoquaux 7

1 Machine learning in a nutshell: classification

Example:recognizing hand-written digits

G Varoquaux 8


X1

X2

Example:recognizing hand-written digits

Represent with 2 numerical features

G Varoquaux 8


X1

X2G Varoquaux 8


X1

X2

It’s about findingseparating lines

G Varoquaux 8

1 Machine learning in a nutshell: models

1 staircase

Fit with a staircase of 10 constant values

Fit a new staircase on errorsKeep going

Boosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9


1 staircase2 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errors

Keep goingBoosted regression trees


G Varoquaux 9


1 staircase2 staircases combined3 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going



G Varoquaux 9


1 staircase2 staircases combined3 staircases combined300 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going



G Varoquaux 9

1 Machine learning in a nutshell: unsupervised

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 10

Machine learning

Mathematics and algorithms for fitting predictive models

Regression

x

y

Classification

Unsupervised...

Notions of overfit, test errorregularization, model complexity

G Varoquaux 11

Machine learning is everywhere

Image recognition

Marketing (click-through rate)

Movie / music recommendation

Medical data

Logistic chains (eg supermarkets)

Language translation

Detecting industrial failures

G Varoquaux 12

Why another machine learning package?

G Varoquaux 13

Real statisticians use R

And real astronomers use IRAF

Real economists use Gauss

Real coders use C assembler

Real experiments are controlled in Labview

Real Bayesians use BUGS stan

Real text processing is done in Perl

Real Deep learner is best done with torch (Lua)

And medical doctors only trust SPSSG Varoquaux 14

1 My stack

Python, what else?General purposeInteractive languageEasy to read / write

G Varoquaux 15

1 My stack

The scientific Python stacknumpy arrays

Mostly a float**No annotation / structureUniversal across applicationsEasily shared with C / fortran

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620G Varoquaux 15

1 My stack


Connecting toscipyscikit-imagepandas...

It’s about plugin thingstogether

G Varoquaux 15

1 My stack


Connecting toscipyscikit-imagepandas...

Being Pythonic andSciPythonic

G Varoquaux 15

1 scikit-learn vision

Machine learning for allNo specific application domain

No requirements in machine learning

High-quality Pythonic software libraryInterfaces designed for users

Community-driven developmentBSD licensed, very diverse contributors

http://scikit-learn.org

G Varoquaux 16

http://scikit-learn.org

1 Between research and applications

Machine learning researchConceptual complexity is not an issueNew and bleeding edge is betterSimple problems are old science

In the fieldTried and tested (aka boring) is goodLittle sophistication from the userAPI is more important than maths

Solving simple problems mattersSolving them really well matters a lot

G Varoquaux 17

2 Scikit-learn: the toolA Python library for machine learning

c©Theodore W. GrayG Varoquaux 18

2 A Python library

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

As easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 19

2 API: specifying a model

A central concept: the estimatorInstanciated without dataBut specifying the parameters

from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s

e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)

G Varoquaux 20

2 API: training a model

Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:X a numpy array with shape

nsamples × nfeatures

y a numpy 1D array, of ints or float, with shapensamples

G Varoquaux 21

2 API: using a model

Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )

Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )

Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )

G Varoquaux 22

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrix

G Varoquaux 23

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t

import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()

X = h a s h e r . f i t t r a n s f o r m ( documents )

G Varoquaux 23

2 Scikit-learn: very rich feature set

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear modelsSVM

Unsupervised LearningClusteringDictionary learningOutlier detection

Model selectionBuilt in cross-validationParameter optimization

G Varoquaux 24

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000

Fit

tim

e(s

)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 25

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000Fi

tti

me

(s)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 25

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 26

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

See also: http://www.slideshare.net/GaelVaroquaux/processing-biggish-data-on-commodity-hardware-simple-python-patterns

G Varoquaux 26

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

G Varoquaux 27

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

Linear modelssklearn.linear model.SGDRegressorsklearn.linear model.SGDClassifier

Clusteringsklearn.cluster.MiniBatchKMeanssklearn.cluster.Birch (new in 0.16)

PCA (new in 0.16)sklearn.decompositions.IncrementalPCA

G Varoquaux 27

2 On-the-fly data reduction

Many features

⇒ Reduce the data as it is loaded

X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)

G Varoquaux 28

2 On-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 28

3 Scikit-learn: the project

G Varoquaux 29

3 Having an impact

G Varoquaux 30

3 Having an impact

1% of Debian installs1200 job offers on stack overflow

G Varoquaux 30

3 Community-based development in scikit-learnHuge feature set:

benefits of a large teamProject growth:

More than 200 contributors∼ 12 core contributors

1 full-time INRIA programmerfrom the start

Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn

G Varoquaux 31

http://www.ohloh.net/p/scikit-learn

3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 32

3 6 steps to a community-driven project

1 Focus on quality

2 Build great docs and examples

3 Use github

4 Limit the technicality of your codebase

5 Releasing and packaging matter

6 Focus on your contributors,give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire

G Varoquaux 33



3 Quality assuranceCode review: pull requests

Can include newcomers

We read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are names meaningfull?- Are the numerics stable?- Could it be faster?

G Varoquaux 34

3 Quality assuranceUnit testing

Everything is tested

Great for numerics

Overall tests enforce on all estimators- consistency with the API- basic invariances- good handling of various inputs

If it ain’t testedit’s broken

G Varoquaux 35

Make it work, make it right, make it boring

G Varoquaux 36

3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement

They need citation, in papers & on corporate web pages

+ It’s so hard to scaleUser supportGrowing codebase

G Varoquaux 37

@GaelVaroquaux

Scikit-learnThe vision

Machine learning as a means not an endVersatile library: the “right” level of abstractionClose to research, but seeking different tradeoffs

@GaelVaroquaux


Machine learning as a means not an endThe tool

Simple API uniform across learnersNumpy matrices as data containers

Reasonnably fast

@GaelVaroquaux


Machine learning as a means not an endThe tool

Simple API uniform across learnersThe project

Many people working togetherTests and discussions for quality

We’re hiring!