Scikit-learn for easy machine learning:the vision, the tool, and the project
Gael Varoquaux
scikit
machine learning in Python
1 Scikit-learn: the visionAn enabler
Machine learningfor everybody andfor everything
Machine learningwithout learning themachinery
G Varoquaux 2
1 Machine learning: a historical perspectiveArtificial Intelligence The 80s
Building decision rules
Eatable?
Mobile?
Tall?
Machine learning The 90sLearn these from observations
Statistical learning 2000sModel the noise in the observations
Big data todayMany observations,simple rules
“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning: a historical perspectiveArtificial Intelligence The 80s
Building decision rules
Machine learning The 90sLearn these from observations
Statistical learning 2000sModel the noise in the observations
Big data todayMany observations,simple rules
“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning: a historical perspectiveArtificial Intelligence The 80s
Building decision rules
Machine learning The 90sLearn these from observations
Statistical learning 2000sModel the noise in the observations
Big data todayMany observations,simple rules
“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning: a historical perspectiveArtificial Intelligence The 80s
Building decision rules
Machine learning The 90sLearn these from observations
Statistical learning 2000sModel the noise in the observations
Big data todayMany observations,simple rules
“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning: a historical perspectiveArtificial Intelligence The 80s
Building decision rules
Machine learning The 90sLearn these from observations
Statistical learning 2000sModel the noise in the observations
Big data todayMany observations,simple rules
“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 5
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
?G Varoquaux 5
1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names
that go with them.2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
How many errors on already-known images?... 0: no errors
Test data 6= Train data
G Varoquaux 6
1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names
that go with them.2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” methodHow many errors on already-known images?
... 0: no errors
Test data 6= Train data
G Varoquaux 6
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
y
x
yWhich model to prefer?
G Varoquaux 7
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
y
x
yProblem of “over-fitting”
Minimizing error is not always the best strategy(learning noise)
Test data 6= train dataG Varoquaux 7
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
y
x
yPrefer simple models
= concept of “regularization”Balance the number of parameters to learnwith the amount of data
G Varoquaux 7
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
y
x
yPrefer simple models
= concept of “regularization”Balance the number of parameters to learnwith the amount of data
Bias variance tradeoff
G Varoquaux 7
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
yTwo descriptors:2 dimensions
X_1
X_2
y
More parameters
G Varoquaux 7
1 Machine learning in a nutshell: regressionA single descriptor:one dimension
x
yTwo descriptors:2 dimensions
X_1
X_2
y
More parameters⇒ need more data
“curse of dimensionality”
G Varoquaux 7
1 Machine learning in a nutshell: classification
Example:recognizing hand-written digits
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
Example:recognizing hand-written digits
Represent with 2 numerical features
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
It’s about findingseparating lines
G Varoquaux 8
1 Machine learning in a nutshell: models
1 staircase
Fit with a staircase of 10 constant values
Fit a new staircase on errorsKeep going
Boosted regression trees
Complexitity trade offsComputational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase2 staircases combined
Fit with a staircase of 10 constant valuesFit a new staircase on errors
Keep goingBoosted regression trees
Complexitity trade offsComputational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase2 staircases combined3 staircases combined
Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going
Boosted regression trees
Complexitity trade offsComputational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase2 staircases combined3 staircases combined300 staircases combined
Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going
Boosted regression trees
Complexitity trade offsComputational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase2 staircases combined3 staircases combined300 staircases combined
Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going
Boosted regression trees
Complexitity trade offsComputational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: unsupervised
Stock market structure
Unlabeled datamore common than labeled data
G Varoquaux 10
1 Machine learning in a nutshell: unsupervised
Stock market structure
Unlabeled datamore common than labeled data
G Varoquaux 10
Machine learning
Mathematics and algorithms for fitting predictive models
Regression
x
y
Classification
Unsupervised...
Notions of overfit, test errorregularization, model complexity
G Varoquaux 11
Machine learning is everywhere
Image recognition
Marketing (click-through rate)
Movie / music recommendation
Medical data
Logistic chains (eg supermarkets)
Language translation
Detecting industrial failures
G Varoquaux 12
Real statisticians use R
And real astronomers use IRAF
Real economists use Gauss
Real coders use C assembler
Real experiments are controlled in Labview
Real Bayesians use BUGS stan
Real text processing is done in Perl
Real Deep learner is best done with torch (Lua)
And medical doctors only trust SPSSG Varoquaux 14
1 My stack
The scientific Python stacknumpy arrays
Mostly a float**No annotation / structureUniversal across applicationsEasily shared with C / fortran
0387879479
7927
0179075270
1578
9407174612
4797
5497071871
7887
1365349049
5190
7475426535
8098
4872154634
9084
9034567324
5614
7895718774
5620
0387879479
7927
0179075270
1578
9407174612
4797
5497071871
7887
1365349049
5190
7475426535
8098
4872154634
9084
9034567324
5614
7895718774
5620G Varoquaux 15
1 My stack
The scientific Python stacknumpy arrays
Connecting toscipyscikit-imagepandas...
It’s about plugin thingstogether
G Varoquaux 15
1 My stack
The scientific Python stacknumpy arrays
Connecting toscipyscikit-imagepandas...
Being Pythonic andSciPythonic
G Varoquaux 15
1 scikit-learn vision
Machine learning for allNo specific application domain
No requirements in machine learning
High-quality Pythonic software libraryInterfaces designed for users
Community-driven developmentBSD licensed, very diverse contributors
http://scikit-learn.org
G Varoquaux 16
1 Between research and applications
Machine learning researchConceptual complexity is not an issueNew and bleeding edge is betterSimple problems are old science
In the fieldTried and tested (aka boring) is goodLittle sophistication from the userAPI is more important than maths
Solving simple problems mattersSolving them really well matters a lot
G Varoquaux 17
2 A Python library
A library, not a programMore expressive and flexibleEasy to include in an ecosystem
As easy as py
from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 19
2 API: specifying a model
A central concept: the estimatorInstanciated without dataBut specifying the parameters
from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s
e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)
G Varoquaux 20
2 API: training a model
Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )
with:X a numpy array with shape
nsamples × nfeatures
y a numpy 1D array, of ints or float, with shapensamples
G Varoquaux 21
2 API: using a model
Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )
Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )
Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )
G Varoquaux 22
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrix
G Varoquaux 23
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t
import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()
X = h a s h e r . f i t t r a n s f o r m ( documents )
G Varoquaux 23
2 Scikit-learn: very rich feature set
Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear modelsSVM
Unsupervised LearningClusteringDictionary learningOutlier detection
Model selectionBuilt in cross-validationParameter optimization
G Varoquaux 24
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000
Fit
tim
e(s
)
203.01 211.53
4464.65
3342.83
1518.14 1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF
Scikit-LearnPython, Cython
OpenCVC++
OK3C Weka
Java
randomForestR, Fortran
OrangePython
Figure: Gilles Louppe
G Varoquaux 25
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000Fi
tti
me
(s)
203.01 211.53
4464.65
3342.83
1518.14 1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF
Scikit-LearnPython, Cython
OpenCVC++
OK3C Weka
Java
randomForestR, Fortran
OrangePython
Figure: Gilles Louppe
G Varoquaux 25
What if the data does not fit in memory?
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
G Varoquaux 26
What if the data does not fit in memory?
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
See also: http://www.slideshare.net/GaelVaroquaux/processing-biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 26
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
G Varoquaux 27
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Linear modelssklearn.linear model.SGDRegressorsklearn.linear model.SGDClassifier
Clusteringsklearn.cluster.MiniBatchKMeanssklearn.cluster.Birch (new in 0.16)
PCA (new in 0.16)sklearn.decompositions.IncrementalPCA
G Varoquaux 27
2 On-the-fly data reduction
Many features
⇒ Reduce the data as it is loaded
X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 28
2 On-the-fly data reduction
Random projections (will average features)sklearn.random projection
random linear combinations of the features
Fast clustering of featuressklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size(e.g. words)
sklearn.feature extraction.text.HashingVectorizer
stateless: can be used in parallel
G Varoquaux 28
3 Community-based development in scikit-learnHuge feature set:
benefits of a large teamProject growth:
More than 200 contributors∼ 12 core contributors
1 full-time INRIA programmerfrom the start
Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn
G Varoquaux 31
3 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 32
3 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire
G Varoquaux 33
3 Quality assuranceCode review: pull requests
Can include newcomers
We read each others code
Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are names meaningfull?- Are the numerics stable?- Could it be faster?
G Varoquaux 34
3 Quality assuranceUnit testing
Everything is tested
Great for numerics
Overall tests enforce on all estimators- consistency with the API- basic invariances- good handling of various inputs
If it ain’t testedit’s broken
G Varoquaux 35
3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.
Wikipedia
Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
+ It’s so hard to scaleUser supportGrowing codebase
G Varoquaux 37
3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.
Wikipedia
Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
+ It’s so hard to scaleUser supportGrowing codebase
G Varoquaux 37
@GaelVaroquaux
Scikit-learnThe vision
Machine learning as a means not an endVersatile library: the “right” level of abstractionClose to research, but seeking different tradeoffs
@GaelVaroquaux
Scikit-learnThe vision
Machine learning as a means not an endThe tool
Simple API uniform across learnersNumpy matrices as data containers
Reasonnably fast
Top Related