scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations...
TRANSCRIPT
scikit
machine learning in Python
Scikit-learnopen, easy, yet versatile machine learning
Gael Varoquaux
1 The vision
Goals and tradeoff
G Varoquaux 2
Scikit-learn’s vision: Machine learning for everyone
Outreachacross scientific fields,
applications, communities
Enablingfoster innovation
G Varoquaux 3
1 scikit-learn user base
350 000 returning users 5 000 citations
OS Employer
Windows Mac Linux industry academia other
50%
20%
30%
63%
3%
34%
G Varoquaux 4
1 A Python library
PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use
Slow ⇒ compiled code as a backend
ScipyVibrant scientific stacknumpy arrays = wrappers on
C pointerspandas for columnar datascikit-image for images
G Varoquaux 5
1 A Python library
PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use
Slow ⇒ compiled code as a backend
ScipyVibrant scientific stacknumpy arrays = wrappers on
C pointerspandas for columnar datascikit-image for images
G Varoquaux 5
1 A Python libraryUsers like Python
Web searches: Google trends
G Varoquaux 6
1 Tradeoffs for outreach
Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent
Didactic documentationCourse on machine learningRich examples
G Varoquaux 7
2 The toolA Python library for machine learning
c©Theodore W. GrayG Varoquaux 8
2 API:
Simplify, but do not dumb down
G Varoquaux 9
2 API: specifying a model
A central concept: the estimatorInstanciated without dataBut specifying the parameters
from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s
e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)
Models often have hyperparametersFinding good defaults is crucial, and hard
G Varoquaux 10
2 API: training a model
Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )
with:X a numpy array with shape
nsamples × nfeatures
y a numpy 1D array, of ints or float, with shapensamples
G Varoquaux 11
2 API: using a model
Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )
Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )
Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )
G Varoquaux 12
2 Vectorizing: from raw data to a sample matrix X
For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t
import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()
X = h a s h e r . f i t t r a n s f o r m ( documents )
G Varoquaux 13
2 Very rich feature set: 160 estimators
Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear models SVMGaussian processes ...
Unsupervised LearningClustering Mixture modelsDictionary learning ICAOutlier detection ...
Model selectionCross-validationParameter optimization
G Varoquaux 14
2 Models most used in scikit-learn1. Logistic regression, SVM
SAG/SAGA solver
2. Random forestsParallel computing, feature importance
3. PCARandomized linear algebra, online solver
4. KmeansKMeans++ initialization, Elkan solver, online solver
5. Naive BayesOn-line solver
6. Nearest neighborKD-tree / ball-tree
Multi-class / multi-outputDense or sparse data
G Varoquaux 15
2 Models most used in scikit-learn1. Logistic regression, SVM
SAG/SAGA solver
2. Random forestsParallel computing, feature importance
3. PCARandomized linear algebra, online solver
4. KmeansKMeans++ initialization, Elkan solver, online solver
5. Naive BayesOn-line solver
6. Nearest neighborKD-tree / ball-tree
Multi-class / multi-outputDense or sparse data
G Varoquaux 15
“Big” data
Efficient processing on large datasets
G Varoquaux 16
2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
Supervised models:Naive Bayes, linear models (various loss),multi-layer perceptronClustering:KMeans, BirchLinear decompositions:PCA, Dictionnary learning, Latent Dirichlet Allocation
G Varoquaux 17
2 Many features: on-the-fly data reductionX s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)
Random projections (will average features)sklearn.random projection
random linear combinations of the features
Feature clusteringsklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Feature hashingsklearn.feature extraction.text.
HashingVectorizerstateless: can be used in parallel
G Varoquaux 18
3 The development
G Varoquaux 19
3 Community-based development in scikit-learn
2010 2012 2014 20160
25
50Huge feature set:benefits of a large team
Project growth:
More than 400 contributors∼ 20 core contributors
https://www.openhub.net/p/scikit-learn
Community-driven project
G Varoquaux 20
3 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 21
3 Quality assuranceCode review: pull requestsWe read each others code
Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are the numerics stable?- Could it be faster?
Unit testingEverything is testedGreat for numerics
G Varoquaux 22
3 Quality assuranceCode review: pull requestsWe read each others code
Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are the numerics stable?- Could it be faster?
Unit testingEverything is testedGreat for numerics
G Varoquaux 22
@GaelVaroquaux
Scikit-learn
Machine learning for everyone– from beginner to expert
A design challenge: hiding complexityA development challenge: keeping qualityA research challenge: robust methods without surprisesSustainability (funding) is an issue