Download - Scikit-learn: the state of the union 2016
![Page 1: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/1.jpg)
Scikit-learn The state of the unionGael Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris
![Page 2: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/2.jpg)
1 Some historyScikit-learn canal historique
G Varoquaux 2
![Page 3: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/3.jpg)
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
![Page 4: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/4.jpg)
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
![Page 5: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/5.jpg)
1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4
![Page 6: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/6.jpg)
1 scikit-learn growth: contributors
Contributors:
759 contributorshttps://www.openhub.net/p/scikit-learn
G Varoquaux 5
![Page 7: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/7.jpg)
1 Started as David Cournapeau’s failed PhD project
David then preferredimproving numpy/scipy
That’s David sprinting in 2011G Varoquaux 6
![Page 8: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/8.jpg)
1 2009: We (Inria Parietal) need machine learning
My team takes over thedevelopment
Hire a young guy(Fabian Pedregosa)
Put post-docs and PhDs(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7
![Page 9: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/9.jpg)
1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in thetutorial are pretty, butnot particularly usefulfor the serious user.”
“For the sustainability ofthe project it might be bet-ter to narrow the focus...”
G Varoquaux 8
![Page 10: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/10.jpg)
1 2011: NIPS sprint
People that I didn’t knowwere solving my problems
The project took off because of the community...
G Varoquaux 9
![Page 11: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/11.jpg)
1 2011: NIPS sprint
People that I didn’t knowwere solving my problems
The project took off because of the community...
G Varoquaux 9
![Page 12: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/12.jpg)
2 Upcoming cool stuffUpcoming 0.18 release
G Varoquaux 10
![Page 13: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/13.jpg)
2 Less code:
Lines of code:
Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm
= memory saver
Arthur Mensch
G Varoquaux 11
![Page 14: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/14.jpg)
2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11
![Page 15: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/15.jpg)
2 Faster code: better algorithmics
RandomizedPCA → PCAAutomatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K meansFor large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas Muller
G Varoquaux 12
![Page 16: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/16.jpg)
2 Faster code: better algorithmics
RandomizedPCA → PCAAutomatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K meansFor large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas MullerG Varoquaux 12
![Page 17: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/17.jpg)
2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o nimport S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R VG Varoquaux 13
![Page 18: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/18.jpg)
2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o nimport S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R VG Varoquaux 13
![Page 19: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/19.jpg)
2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesianoptimization / noisy fit problem.⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
https://github.com/scikit-learn/scikit-learn/pull/5491
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14
![Page 20: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/20.jpg)
3 Vision(s): the future
G Varoquaux 15
![Page 21: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/21.jpg)
Mission statement
Enable progress via data science
Lower the costs,less technicalities
Machine learningfor everybody andfor everything
Small hardware,medium data
G Varoquaux 16
![Page 22: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/22.jpg)
Mission statement
Enable progress via data science
Lower the costs,less technicalities
Machine learningfor everybody andfor everything
Small hardware,medium data
G Varoquaux 16
![Page 23: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/23.jpg)
3 Deep learningsklearn.neural network.MLPClassifier
architecture-specification languageGPUs unbound technicality
keras, caffe...
G Varoquaux 17
![Page 24: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/24.jpg)
3 Deep learningsklearn.neural network.MLPClassifier
architecture-specification languageGPUs unbound technicality
keras, caffe...
G Varoquaux 17
![Page 25: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/25.jpg)
3 AutoMLAutomatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18
![Page 26: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/26.jpg)
3 Better, faster, strongerFaster models
From lightning, back to sklearnInspiration from XGBoost the paper is out!
Larger dataMore partial fit online forests?Less copies
G Varoquaux 19
![Page 27: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/27.jpg)
3 Better, faster, strongerFaster models
From lightning, back to sklearnInspiration from XGBoost the paper is out!
Larger dataMore partial fit online forests?Less copies
G Varoquaux 19
![Page 28: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/28.jpg)
3 Scaling up (out?)
I don’t want java/scalaLess fluid prototypingCross-VM debugging hardNumerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20
![Page 29: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/29.jpg)
3 Scaling up (out?)
I don’t want java/scala
They have:Coupling distributed store to computationDistributed job management
Create new stack? Ride on this one?
G Varoquaux 20
![Page 30: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/30.jpg)
3 Scaling up (out?)
I don’t want java/scala
They have:Coupling distributed store to computationDistributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithmsdask promising for ETL
New backends for joblib parallel and storagedistributed, ssh
G Varoquaux 20
![Page 31: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/31.jpg)
Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)
Coding is not the only thingsprint, GSOC management, tutorials...
Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...
6= work on project
Passionate coders get a lot doneunless they get drowned by meetings
G Varoquaux 21
![Page 32: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/32.jpg)
Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)
Coding is not the only thingsprint, GSOC management, tutorials...
Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...
6= work on project
Passionate coders get a lot doneunless they get drowned by meetings
G Varoquaux 21
![Page 33: Scikit-learn: the state of the union 2016](https://reader034.vdocuments.net/reader034/viewer/2022042605/58ee61ae1a28ab406d8b4645/html5/thumbnails/33.jpg)
@GaelVaroquaux
Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC