scikit-learn · 3 scaling up / scaling out? g varoquaux 2. ... (spark) low-level gives speed (eg...
TRANSCRIPT
scikit
machine learning in Python
Scikit-learnMachine learning for the small and the many
Gael Varoquaux
In this meeting, I represent low performance computing
scikit
machine learning in Python
Scikit-learnMachine learning for the small and the many
Gael Varoquaux
In this meeting, I represent low performance computing
What I do: bridging psychology to neuroscience via machine learningon brain images
1 Scikit-learn
2 Statistical algorithms
3 Scaling up / scaling out?
G Varoquaux 2
1 Scikit-learn
Goals and tradeoff
G Varoquaux 3
Scikit-learn’s vision: Machine learning for everyone
Outreachacross scientific fields,
applications, communities
Enablingfoster innovation
Minimal prerequisites & assumptions
G Varoquaux 4
Scikit-learn’s vision: Machine learning for everyone
Outreachacross scientific fields,
applications, communities
Enablingfoster innovation
Minimal prerequisites & assumptionsG Varoquaux 4
1 scikit-learn user base
350 000 returning users 5 000 citations
OS Employer
Windows Mac Linux industry academia other
50%
20%
30%
63%
3%
34%
G Varoquaux 5
1 A Python library
PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use
ScipyVibrant scientific stacknumpy arrays = wrappers on
C pointerspandas for columnar datascikit-image for images
G Varoquaux 6
1 A Python library
PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use
Slow ⇒ compiled code as a backendPython’s primitive virtual machine
makes it easy
ScipyVibrant scientific stacknumpy arrays = wrappers on
C pointerspandas for columnar datascikit-image for images
G Varoquaux 6
1 A Python library
PythonHigh-level language, for users and developersGeneral-purpose: suitable for any applicationExcellent interactive use
ScipyVibrant scientific stacknumpy arrays = wrappers on
C pointerspandas for columnar datascikit-image for images
G Varoquaux 6
1 A Python libraryUsers like Python
Web searches: Google trends
G Varoquaux 7
1 A Python libraryAnd developpers like Python
2010 2012 2014 20160
25
50
Number of contributors active in a week
⇒ Huge set of features(∼ 160 different statistical models)
G Varoquaux 8
1 A Python libraryAnd developpers like Python
2010 2012 2014 20160
25
50
Number of contributors active in a week
⇒ Huge set of features(∼ 160 different statistical models)
G Varoquaux 8
1 API: simplify, but do not dumb down
Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )
classifier often has hyperparametersFinding good defaults is crucial, and hard
A lot of effort on the documentationExample-driven development
G Varoquaux 9
1 API: simplify, but do not dumb down
Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )
classifier often has hyperparametersFinding good defaults is crucial, and hard
A lot of effort on the documentationExample-driven development
G Varoquaux 9
1 API: simplify, but do not dumb down
Universal estimator interfacefrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )# orX r e d = c l a s s i f i e r . t r a n s f o r m ( X t e s t )
classifier often has hyperparametersFinding good defaults is crucial, and hard
A lot of effort on the documentationExample-driven development
G Varoquaux 9
1 Tradeoffs
Algorithms and models with good failure modeAvoid parameters hard to set or fragile convergenceStatistical computing = ill-posed & data-dependent
Little or no dependenciesEasy build everywhere
All compiled code generated from CythonHigh-level languages give features (Spark)
Low-level gives speed (eg cache-friendly code)
G Varoquaux 10
2 Statistical algorithmsFast algorithms accept statistical error
Models most used in scikit-learn:
1. Logistic regression, SVM2. Random forests3. PCA
4. Kmeans5. Naive Bayes6. Nearest neighbor
G Varoquaux 11
2 Statistical algorithmsFast algorithms accept statistical error
Models most used in scikit-learn:
1. Logistic regression, SVM2. Random forests3. PCA
4. Kmeans5. Naive Bayes6. Nearest neighbor
G Varoquaux 11
“Big” data
Many samples or
0307809070
7907
0079075270
0578
9407100600
0797
0097000800
7000
1000040040
0090
0005020500
8000
samples
features
Web behavior dataCheap sensors (cameras)
Many features
0307809070
7907
0079075270
0578
9407100600
0797
0097000800
7000
1000040040
0090
0005020500
8000
samples
features 03078
090707907
0079075270
0578
9407100600
0797
0097000800
7000
1000040040
0090
0005020500
8000
Medical patientsScientific experiments
G Varoquaux 12
2 Linear models
minw∑i
l(yi , xi w)Many features Coordinate descent
Iteratively optimize w.r.t. wj separately
It works because:Features are redundantSparse models can guess which wj are zero
Progress = better selection of features
Many samples Stochastic gradient descent
G Varoquaux 13
2 Linear models
minw∑i
l(yi , xi w)Many features Coordinate descent
Iteratively optimize w.r.t. wj separately
Many samples Stochastic gradient descentminw E[l(y , x w)]
Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]
Use a cheap estimate of E[∇wl ] (e.g. subsampling)
Progress = second order schemesG Varoquaux 13
2 Linear models
minw∑i
l(yi , xi w)Many features Coordinate descent
Iteratively optimize w.r.t. wj separately
Many samples Stochastic gradient descentminw E[l(y , x w)]
Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]
Use a cheap estimate of E[∇wl ] (e.g. subsampling)
Data-access localityG Varoquaux 13
2 Linear models
minw∑i
l(yi , xi w)Many features Coordinate descent
Iteratively optimize w.r.t. wj separately
Many samples Stochastic gradient descentminw E[l(y , x w)]
Gradient descent: w← w + α∇wlStochastic gradient descent w← w + αE[∇wl ]
Use a cheap estimate of E[∇wl ] (e.g. subsampling)
Data-access locality
Deep learningComposition of linear modelsoptimized jointly (non-convex)with stochastic gradient descent
G Varoquaux 13
2 Trees & (random) forests
(on subsets of the data)Compute simple bi-variatestatisticsSplit data accordingly ...Speed ups
Share computing between trees or precomputeCache friendly access ⇒ optimize traversal orderApproximate histograms / statistics
LightGBM, XGBoost
G Varoquaux 14
2 PCA: principal component analysisTruncated SVD (singular value decomposition)
X = U s VT
Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:
X =random projection(X) # e.g. subsamplingUi, si , VT
i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT
redXU′ s′V′T = SVD(Xred)VT = V′TVT
red
X summarize well the dataEach SVD is on local data
[Halko... 2011]
G Varoquaux 15
2 PCA: principal component analysisTruncated SVD (singular value decomposition)
X = U s VT
Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:
X =random projection(X) # e.g. subsamplingUi, si , VT
i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT
redXU′ s′V′T = SVD(Xred)VT = V′TVT
red
X summarize well the dataEach SVD is on local data
[Halko... 2011]
G Varoquaux 15
2 PCA: principal component analysisTruncated SVD (singular value decomposition)
X = U s VT
Randomized linear algebra → 20x speed upsfor i in [1, . . . k]:
X =random projection(X) # e.g. subsamplingUi, si , VT
i = SVD(X)Vred,R = QR([V1, . . . , Vk ])Xred = VT
redXU′ s′V′T = SVD(Xred)VT = V′TVT
red
X summarize well the dataEach SVD is on local data
[Halko... 2011]G Varoquaux 15
2 Stochastic factorization of huge matrices
Factorization of dense matrices ∼ 200 000× 2 000 000Datamatrix
XU
V
minU,V‖X−UVT‖2 + ‖V‖1
G Varoquaux 16
2 Stochastic factorization of huge matrices
Factorization of dense matrices ∼ 200 000× 2 000 000
- Data access
- Dictionary update
Streamcolumns
- Code com- putation
Datamatrix
minU,V‖X−UVT‖2 + ‖V‖1
G Varoquaux 16
2 Stochastic factorization of huge matrices
Factorization of dense matrices ∼ 200 000× 2 000 000
- Data access
- Dictionary update
Streamcolumns
- Code com- putation
Online matrixfactorization
Alternatingminimization
Seen at t Seen at t+1 Unseen at t
Datamatrix
[Mairal... 2010]out of core, huge speed upsG Varoquaux 16
2 Stochastic factorization of huge matrices
Factorization of dense matrices ∼ 200 000× 2 000 000
- Data access
- Dictionary update
Streamcolumns
- Code com- putation Subsample
rows
Online matrixfactorization
New subsamplingalgorithm
Alternatingminimization
Seen at t Seen at t+1 Unseen at t
Datamatrix
[Mensch... 2017]10X speed ups, or moreG Varoquaux 16
3 Scaling up / scaling out?
G Varoquaux 17
3 Dataflow is key to scale
Array computingCPU
03878794797927
01790752701578
03878794797927
01790752701578
Data parallel
03878794797927
03878794797927
Streaming
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
Parallel computingData + code transfer Out-of-memory persistence
These patterns can yield horrible codeG Varoquaux 18
3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)
Under the hood: joblibParallel for loops concurrency is hard
Queues are the central abstraction
New: distributed computing backends:Yarn, dask.distributed, IPython.parallel
import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):# normal Joblib code
G Varoquaux 19
3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)
Under the hood: joblibParallel for loops concurrency is hard
New: distributed computing backends:Yarn, dask.distributed, IPython.parallel
import distributed.joblibfrom joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,scheduler host=’HOST:PORT’):
# normal Joblib code
G Varoquaux 19
3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)
Under the hood: joblibParallel for loops concurrency is hard
New: distributed computing backends:Yarn, dask.distributed, IPython.parallel
import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):# normal Joblib code
G Varoquaux 19
3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)
Under the hood: joblibParallel for loops concurrency is hard
New: distributed computing backends:Yarn, dask.distributed, IPython.parallel
import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):# normal Joblib code
Middleware to plug in distributed infrastructures
G Varoquaux 19
3 Distributed data flow and storage
Moving data aroundis costly
03878794797927
Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs
Very big data calls for couplinga database to a computing engine
G Varoquaux 20
3 Distributed data flow and storage
Moving data aroundis costly
03878794797927Read-onlydata store
Node'smemory
Parameterdatabase
Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs
Very big data calls for couplinga database to a computing engine
G Varoquaux 20
3 Distributed data flow and storage
Moving data aroundis costly
03878794797927Read-onlydata store
Node'smemory
Parameterdatabase
Why databases and not files?Maintain integrity themselvesKnow how to do data replication & distributionFast lookup via indexesNot bound by POSIX FS specs
Very big data calls for couplinga database to a computing engine
G Varoquaux 20
3 joblib.Memory as a storage poolA caching / function memoizing system
Stores results of function executions
Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()
S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 21
3 joblib.Memory as a storage poolA caching / function memoizing system
Stores results of function executions
Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()
S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 21
3 joblib.Memory as a storage poolA caching / function memoizing system
Stores results of function executions
Out-of-memory computing>>>>>>>>> result = mem.cache(g).call and shelve(a)>>>>>>>>> result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)>>>>>>>>> c = result.get()
S3/HDFS/cloud backend:joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 21
Challenges and dreams 03878794797927Read-onlydata store
Node'smemory
Parameterdatabase
High-level constructs fordistributed computation & data exchange
MPI feels too low level and without data concepts
Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece
Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph
Also: deep-learning engines, for GPUs
G Varoquaux 22
Challenges and dreams 03878794797927Read-onlydata store
Node'smemory
Parameterdatabase
High-level constructs fordistributed computation & data exchange
MPI feels too low level and without data concepts
Goal: reusable algorithms from laptops to datacentersCapturing data access patterns is the missing piece
Dask project:Limit to purely-functional codeLazy computation / compilationBuild a data flow + execution graph
Also: deep-learning engines, for GPUsG Varoquaux 22
@GaelVaroquaux
Lessons from scikit-learnSmall-computer machine-learning trying to scale
Python gets us very farEnables focusing on algorithmic optimizationGreat to grow a communityCan easily drop to compiled code
Statistical algorithmics
Distributed data computing
If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic
@GaelVaroquaux
Lessons from scikit-learnSmall-computer machine-learning trying to scale
Python gets us very far
Statistical algorithmicsAlgorithms operate on expectancies
Stochastic Gradient DescentRandom projections
Can bring data locality
Distributed data computing
If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic
@GaelVaroquaux
Lessons from scikit-learnSmall-computer machine-learning trying to scale
Python gets us very far
Statistical algorithmics
Distributed data computingData acces is centralMust be optimized for algorithmFile system and memory no longer suffice
03878794797927
03878794797927
If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic
@GaelVaroquaux
Lessons from scikit-learnSmall-computer machine-learning trying to scale
Python gets us very far
Statistical algorithmics
Distributed data computing
If you know what your doing, you can scale scikit-learnThe challenge is to make this easy and generic
4 References I
N. Halko, P. G. Martinsson, and J. A. Tropp. Findingstructure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions. SIAMRev., 53, 2011. ISSN 0036-1445. doi: 10.1137/090771806.URL http://dx.doi.org/10.1137/090771806.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learningfor matrix factorization and sparse coding. Journal ofMachine Learning Research, 11:19, 2010.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux.Stochastic subsampling for factorizing huge matrices. Arxivpreprint, 2017.