new capabilities in the pydata ecosystem
TRANSCRIPT
![Page 1: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/1.jpg)
New Capabilities in the PyData Ecosystem
Peter Wang Continuum Analytics @pwang
![Page 2: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/2.jpg)
Data Science @ NYT
![Page 3: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/3.jpg)
Python & SciPy
• High performance linear algebra, image processing, optimization via NumPy, optimized C++, FORTRAN
• Large structured data via HDF5, memmap • Out of core processing, streaming & realtime • Distributed computing via MPI, IPython Parallel, etc. • GPU & heterogenous via OpenCL, PyCUDA, others • Massive adoption in research, national labs, industry
(engineering, finance, etc.)
• IPython Notebook: 2005-2011 • pandas: 2008-2009 • scikit-learn: 2007 • NumPy: 2006 • matplotlib: 2002 • IPython: 2001 • Numarray: 2001 • SciPy: 1999 • Numeric: 1995
Python has >15 year history in scientific computing
![Page 4: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/4.jpg)
"Python's Scientific Ecosystem"
@jakevdp
![Page 5: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/5.jpg)
"Many More Tools"
@jakevdp
![Page 6: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/6.jpg)
Focus On
• Bokeh
• Dask
![Page 7: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/7.jpg)
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
![Page 8: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/8.jpg)
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
![Page 9: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/9.jpg)
Bokeh
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript
• Support for R, Scala, Julia, Lua
http://bokeh.pydata.org
![Page 10: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/10.jpg)
Dashboards & Data Apps
![Page 11: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/11.jpg)
Static Notebooks/HTML, Interactive Plots
http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction
![Page 12: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/12.jpg)
Extensible Architecture
server.py BrowserApp Model
BokehJS object graph
bokeh-serverbokeh.py object graph
JSON
![Page 13: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/13.jpg)
![Page 15: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/15.jpg)
Dask
![Page 16: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/16.jpg)
Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
![Page 17: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/17.jpg)
Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
![Page 18: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/18.jpg)
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...
![Page 19: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/19.jpg)
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...
... better start chunking.
![Page 20: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/20.jpg)
DAG of Computation
![Page 21: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/21.jpg)
Dask: Out of Core Scheduler for Python
![Page 22: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/22.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework
![Page 23: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/23.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem
![Page 24: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/24.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling
![Page 25: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/25.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
![Page 26: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/26.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
![Page 27: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/27.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas
![Page 28: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/28.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism
![Page 29: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/29.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism
![Page 30: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/30.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality
![Page 31: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/31.jpg)
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality• Distributed scheduler coming
![Page 32: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/32.jpg)
Simple Architecture
![Page 33: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/33.jpg)
Core Concepts
![Page 34: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/34.jpg)
dask.array: OOC, parallel, ND array
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5
![Page 35: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/35.jpg)
dask.dataframe: OOC, parallel, ND array
Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
![Page 36: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/36.jpg)
More Complex Graphs
cross validation
![Page 37: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/37.jpg)
http://continuum.io/blog/xray-dask
![Page 38: New Capabilities in the PyData Ecosystem](https://reader030.vdocuments.net/reader030/viewer/2022032513/55d09b9fbb61ebb0058b45ff/html5/thumbnails/38.jpg)
PyData's Future
• Dozens of international meetup groups • Intl conferences each year, including collab
with EuroPython, Strata, and others • More companies investing in the ecosystem
• Dato - SFrame, SGraph, ... • Cloudera - Impyla, Ibis, ... • Microsoft - Python in AzureML • Databricks - PySpark • Continuum - *.*