scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations...

scikit

machine learning in Python

Scikit-learnopen, easy, yet versatile machine learning

Gael Varoquaux

1 The vision

Goals and tradeoff

G Varoquaux 2

Scikit-learn’s vision: Machine learning for everyone

Outreachacross scientific fields,

applications, communities

Enablingfoster innovation

G Varoquaux 3

1 scikit-learn user base

350 000 returning users 5 000 citations

OS Employer

Windows Mac Linux industry academia other

50%

20%

30%

63%

3%

34%

G Varoquaux 4

1 A Python library

PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use

Slow ⇒ compiled code as a backend

ScipyVibrant scientific stacknumpy arrays = wrappers on

C pointerspandas for columnar datascikit-image for images

G Varoquaux 5

1 A Python libraryUsers like Python

Web searches: Google trends

G Varoquaux 6

1 Tradeoffs for outreach

Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent

Didactic documentationCourse on machine learningRich examples

G Varoquaux 7

2 The toolA Python library for machine learning

c©Theodore W. GrayG Varoquaux 8

2 API:

Simplify, but do not dumb down

G Varoquaux 9

2 API: specifying a model

A central concept: the estimatorInstanciated without dataBut specifying the parameters

from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s

e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)

Models often have hyperparametersFinding good defaults is crucial, and hard

G Varoquaux 10

2 API: training a model

Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:X a numpy array with shape

nsamples × nfeatures

y a numpy 1D array, of ints or float, with shapensamples

G Varoquaux 11

2 API: using a model

Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )

Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )

Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )

G Varoquaux 12

2 Vectorizing: from raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t

import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()

X = h a s h e r . f i t t r a n s f o r m ( documents )

G Varoquaux 13

2 Very rich feature set: 160 estimators

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear models SVMGaussian processes ...

Unsupervised LearningClustering Mixture modelsDictionary learning ICAOutlier detection ...

Model selectionCross-validationParameter optimization

G Varoquaux 14

2 Models most used in scikit-learn1. Logistic regression, SVM

SAG/SAGA solver

2. Random forestsParallel computing, feature importance

3. PCARandomized linear algebra, online solver

4. KmeansKMeans++ initialization, Elkan solver, online solver

5. Naive BayesOn-line solver

6. Nearest neighborKD-tree / ball-tree

Multi-class / multi-outputDense or sparse data

G Varoquaux 15

“Big” data

Efficient processing on large datasets

G Varoquaux 16

2 Many samples: on-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Supervised models:Naive Bayes, linear models (various loss),multi-layer perceptronClustering:KMeans, BirchLinear decompositions:PCA, Dictionnary learning, Latent Dirichlet Allocation

G Varoquaux 17

2 Many features: on-the-fly data reductionX s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Feature clusteringsklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Feature hashingsklearn.feature extraction.text.

HashingVectorizerstateless: can be used in parallel

G Varoquaux 18

3 The development

G Varoquaux 19

3 Community-based development in scikit-learn

2010 2012 2014 20160

25

50Huge feature set:benefits of a large team

Project growth:

More than 400 contributors∼ 20 core contributors

https://www.openhub.net/p/scikit-learn

Community-driven project

G Varoquaux 20

https://www.openhub.net/p/scikit-learn

3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 21

3 Quality assuranceCode review: pull requestsWe read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are the numerics stable?- Could it be faster?

Unit testingEverything is testedGreat for numerics

G Varoquaux 22

@GaelVaroquaux

Scikit-learn

Machine learning for everyone– from beginner to expert

A design challenge: hiding complexityA development challenge: keeping qualityA research challenge: robust methods without surprisesSustainability (funding) is an issue

scikit-learn - paris-bigdata.org · 1 scikit-learn user base 350000 returning users 5000 citations...

Documents