python data ecosystem: thoughts on building for the future
TRANSCRIPT
![Page 1: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/1.jpg)
1 © Cloudera, Inc. All rights reserved.
Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21
![Page 2: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/2.jpg)
2 © Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects
• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)}
• Mostly work in Python and Cython/C/C++
![Page 3: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/3.jpg)
3 © Cloudera, Inc. All rights reserved.
In process: Python for Data Analysis: 2nd Edi4on Coming early 2017
![Page 4: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/4.jpg)
4 © Cloudera, Inc. All rights reserved.
Building open source communiWes
![Page 5: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/5.jpg)
5 © Cloudera, Inc. All rights reserved.
Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals.
Wikipedia
![Page 6: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/6.jpg)
6 © Cloudera, Inc. All rights reserved.
Step 1 Be open and transparent
![Page 7: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/7.jpg)
7 © Cloudera, Inc. All rights reserved.
Step 2 Reach out to others
![Page 8: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/8.jpg)
8 © Cloudera, Inc. All rights reserved.
Step 3 Strive for consensus
![Page 9: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/9.jpg)
9 © Cloudera, Inc. All rights reserved.
Step 4 Value contribuWons extending beyond lines of code
![Page 10: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/10.jpg)
10 © Cloudera, Inc. All rights reserved.
Step 5 Make things harder for bad actors
![Page 11: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/11.jpg)
11 © Cloudera, Inc. All rights reserved.
![Page 12: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/12.jpg)
12 © Cloudera, Inc. All rights reserved.
Handling problems carefully
![Page 13: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/13.jpg)
13 © Cloudera, Inc. All rights reserved.
http://numfocus.org
http://apache.org
![Page 14: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/14.jpg)
14 © Cloudera, Inc. All rights reserved.
Python packaging
![Page 15: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/15.jpg)
15 © Cloudera, Inc. All rights reserved.
Packaging is hard
• Reproducible infrastructure • Reproducible toolchains • Reproducible build scripts • IntegraWon tesWng • MulWple library version builds • MulWple Python versions • Dependency resoluWon • HosWng and distribuWon • MulWple environment management
![Page 16: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/16.jpg)
16 © Cloudera, Inc. All rights reserved.
ReflecWng on the past
![Page 17: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/17.jpg)
17 © Cloudera, Inc. All rights reserved.
![Page 18: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/18.jpg)
18 © Cloudera, Inc. All rights reserved.
conda-‐forge
• Community-‐curated conda package channel (on anaconda.org) • Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) • Automated GitHub helper tools
conda config --add channels conda-forge
![Page 19: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/19.jpg)
19 © Cloudera, Inc. All rights reserved.
What’s important to me right now?
![Page 20: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/20.jpg)
20 © Cloudera, Inc. All rights reserved.
Important things
• Building bridges with other data science communiWes (R, Julia, Scala, etc.) • Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) • Building Python tools for new and changing varieWes of data
![Page 21: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/21.jpg)
21 © Cloudera, Inc. All rights reserved.
RAM as the new disk?
• SSD – DRAM performance convergence
• NVM developments
(3D Xpoint) Memory working set
Consumer Consumer Consumer
![Page 22: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/22.jpg)
22 © Cloudera, Inc. All rights reserved.
Problems
• Memory (data structure) representaWons
• Metadata representaWons
• Memory ownership, life-‐cycle
![Page 23: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/23.jpg)
23 © Cloudera, Inc. All rights reserved.
NumPy solved this problem for Python scienWsts
• Common memory representaWon • ndarray strided, homogeneous buffer
• Common metadata • NumPy dtypes
• No well-‐defined memory sharing / messaging model: case by case basis
![Page 24: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/24.jpg)
24 © Cloudera, Inc. All rights reserved.
Problems NumPy doesn’t solve as well
• Nested data types (think JSON)
• Missing / NULL data
• Strings and category types
• Columnar memory representaWon for tables (think: analyWc SQL databases)
![Page 25: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/25.jpg)
25 © Cloudera, Inc. All rights reserved.
Apache Arrow
http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
![Page 26: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/26.jpg)
26 © Cloudera, Inc. All rights reserved.
Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project • Focused on Columnar In-‐Memory AnalyWcs
1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of
breed systems 3. Designed to work with any programming language 4. Support for both relaWonal and complex data as-‐is
• Developers from 13+ major open source projects involved
• A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
![Page 27: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/27.jpg)
27 © Cloudera, Inc. All rights reserved.
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional Memory Buffer
Arrow Memory Buffer
• Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access
• With minimal structure overhead
• Operate directly on columnar compressed data
![Page 28: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/28.jpg)
28 © Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-to-Arrow reader)
![Page 29: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/29.jpg)
29 © Cloudera, Inc. All rights reserved.
Arrow in acWon: Feather File Format for Python and R
• Problem: fast, language-‐agnosWc binary data frame file format
• By Wes McKinney (Python) and Hadley Wickham (R)
• Read speeds close to disk IO performance
![Page 30: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/30.jpg)
30 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R
library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)
import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
R Python
![Page 31: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/31.jpg)
31 © Cloudera, Inc. All rights reserved.
More on Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
![Page 32: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/32.jpg)
32 © Cloudera, Inc. All rights reserved.
Feather: the good and not-‐so-‐good
• Good • Language-‐agnosWc memory representaWon • Extremely fast • New storage features can be added without much difficulty
• Not-‐so-‐good
• Data must be convert to/from storage representaWon (Arrow) and in-‐memory “proprietary” data structures (R / Python data frames)
![Page 33: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/33.jpg)
33 © Cloudera, Inc. All rights reserved.
Apache Parquet: Python support is coming
• Collaborating with Uwe Korn from Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
![Page 34: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/34.jpg)
34 © Cloudera, Inc. All rights reserved.
Shared needs for Python, R, Julia, ...
• If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily
• Example: dplyr’s in-‐memory backend
• Other requirements • Permissive licensing (Python / Julia require MIT/Apache-‐like) • Common build/test/packaging for shared C/C++ library components
![Page 35: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/35.jpg)
35 © Cloudera, Inc. All rights reserved.
Real World Example: Python With Spark, Drill, Impala
in partition 0
…
in partition n - 1
SQL Engine
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
SQL Engine
![Page 36: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/36.jpg)
36 © Cloudera, Inc. All rights reserved.
Get Involved in Arrow • Join the community
• [email protected] • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow
![Page 37: Python Data Ecosystem: Thoughts on Building for the Future](https://reader033.vdocuments.net/reader033/viewer/2022052514/587adcf11a28ab542b8b59a3/html5/thumbnails/37.jpg)
37 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own