blaze: a large-scale, array-oriented infrastructure for python
DESCRIPTION
This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.TRANSCRIPT
Blaze: a large-scale, array-oriented infrastructure for
Python
PyData Silicon Valley 2013
Travis E. Oliphant
Tuesday, March 19, 13
Brief History
Person Package Year
Jim Fulton Matrix Object in Python
1994
Jim Hugunin Numeric 1995
Perry Greenfield, Rick White, Todd Miller Numarray 2001
Travis Oliphant NumPy 2005
Tuesday, March 19, 13
Early pieces of SciPy cephesmodule fftw wrappers
June 1998 November 1998
stats.py
December 1998Gary
Strangman
Tuesday, March 19, 13
1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999
multipack 0.6 (leastsq, ode, fsolve, quad)
29 Apr 1999
sparse plan described 30 May 1999
multipack 0.7 14 Jun 1999
SparsePy 0.1 5 Nov 1999
cephes 1.2 (vectorize) 29 Dec 1999
Plotting??
GistXPLOTDISLINGnuplot
Helping with f2py
Tuesday, March 19, 13
SciPy 2001
Founded in 2001 with Travis Vaught
Eric JonesweaveclusterGA*
Pearu Petersonlinalg
interpolatef2py
Travis Oliphantoptimizesparse
interpolateintegratespecialsignalstats
fftpackmisc
Tuesday, March 19, 13
Community effortmany, many others --- forgive me!• Chuck Harris
• Pauli Virtanen• David Cournapeau• Stefan van der Walt• Jake Vanderplas• Josef Perktold• Anne Archibald• Dag Sverre Seljebotn• Robert Kern• Matthew Brett• Warren Weckesser• Ralf Gommers• Joe Harrington --- Documentation effort• Andrew Straw --- www.scipy.org
Tuesday, March 19, 13
1,000,000 to 2,000,000 users of NumPy!
Tuesday, March 19, 13
Now What?
Tuesday, March 19, 13
What is good about NumPy?• Array-oriented• Extensive DType System (including structures)• C-API --- lots of libraries• Simple to understand data-structure• Memory mapping• Syntax support from Python• Large community of users• Ufuncs and more • Broadcasting• Easy to interface C/C++/Fortran code
Tuesday, March 19, 13
What is wrong with NumPy
• Dtype system is difficult to extend• Immediate mode creates huge temporaries
(spawning Numexpr)• “Almost” an in-memory data-base comparable
to SQL-lite (missing indexes)• Integration with sparse arrays • Lots of un-optimized parts• Minimal support for multi-core / GPU
Tuesday, March 19, 13
Improvements needed• NDArray improvements
• Indexes (esp. for Structured arrays)• SQL front-end• Multi-level, hierarchical labels• selection via mappings (labeled arrays)• Memory spaces (array made up of regions)• Distributed arrays (global array)• Compressed arrays• Standard distributed persistance• fancy indexing as view and optimizations• streaming arrays
Tuesday, March 19, 13
Improvements needed
• Dtype improvements• Enumerated types (including dynamic enumeration)• Derived fields• Specification as a class (or JSON)• Pointer dtype (i.e. C++ object, or varchar)• Missing data: masks and bit-patterns• Parameterized field names• Computed fields
Tuesday, March 19, 13
Improvements needed• Ufunc improvements
• Generalized ufuncs support more than just contiguous arrays
• Specification of ufuncs in Python• Move most dtype “array functions” to ufuncs• Unify error-handling for all computations• Allow lazy-evaluation and remote computation ---
streaming and generator data• Structured and string dtype ufuncs• Multi-core and GPU optimized ufuncs• Group-by reduction
Tuesday, March 19, 13
More Improvements needed• Miscellaneous improvements
• ABI-management• Eventual Move to library (NDLib)?• NDLib could serve as base for Javascript and other
high-level languages?• Integration with LLVM• Possible dtype / shape / stride unification into a “table
interface”• Remote computation• Fast I/O for CSV and Excel
Tuesday, March 19, 13
New Project
Blaze
NumPy
Out of Core,Distributed and Optimized
NumPy
Tuesday, March 19, 13
NumPy Array
shape
Tuesday, March 19, 13
Blaze: Different kinds of Arrays
Indexable
NDTable NDArray
Deferred Concrete Deferred Concrete
Record Type Primitive Type
Tuesday, March 19, 13
Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph• Represents deferred computation
Usually what you have when you have a Blaze Array
Tuesday, March 19, 13
Deferred allows handling large arrays
Can be handled out-of-core using chunks to
stream through memory.
Tuesday, March 19, 13
Blaze Concrete Array
URL URL URL URL URL
IndexesData Descriptor
Extensible Type System which includes shape
DataShape
Where are the bytes?
What do the bytes mean?
MetaData DictionaryLabels, provenance, etc.
Tuesday, March 19, 13
Multiple URLs comprising an array
Tuesday, March 19, 13
URLs Provide Bytes
Memory-Like Arbitrarily slicedRandom Seeks
Deal with in chunksRandom Seeks
Deal with in ChunksSequential Seeks
File-Like
Stream-Like
Tuesday, March 19, 13
Blaze Data Container
ByteProvider
Data BufferIndex Operation
NumPy BLZPersistent Format
RDBMS
Data DescriptorProtocol
CSVData Stream
Tuesday, March 19, 13
Indexes
Contiguous / Strided
Chunked / Tiled
Opaque Element-only
NumPy-Like
OpaqueIterator-access
Special Access
Tuesday, March 19, 13
Indexes allow for many orderings
Tuesday, March 19, 13
DataShape Type System
• A data description language• A super-set of NumPy’s dtype• Provides more flexibility
Shape DType
DataShape
Tuesday, March 19, 13
Allows for all kinds of containers
Tuesday, March 19, 13
Advanced Types
type Point = { x : int; y : int}
type Space = { a: Point; b: Point}
5, 10, Space
type SquareMatrix T = N, N, T
type IntMatrix N = N, N, int32
Parametrized Types
Alias Types
Tuesday, March 19, 13
Advanced Shapes
{1,2,4,2,1}, int32 [ [1], [1,2], [1,3,2,9], [3,2], [3]]
Could Represent
Tuesday, March 19, 13
Execution Model
• Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules)
• Many operations can be compiled with LLVM to machine-code• BLIR (simple typed expression syntax)• Numba (Python compiler)
Tuesday, March 19, 13
Blaze Agents
MongoDB
Vertica
HDFS
CSV Directory
Blaze Agent
Blaze Agent
Blaze Agent
Blaze Agent
Code Graph with Blaze Arrays
Code
Data
Tuesday, March 19, 13
“I think you should be more explicit here in step two.”
How?
Tuesday, March 19, 13
Team
Travis OliphantNumPy, SciPy
Peter WangChaco, Bokeh
Mark WiebeNumPy, DyND
Stephen Diehl
Mark FlorissonNumba
Francesc AltedPyTables
Oscar Villellas
Tuesday, March 19, 13
DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology TA-2: Visual user interface technology
Tuesday, March 19, 13
Status
Tuesday, March 19, 13
Type System = DataShape
Best Type System this side of Haskell!
Tuesday, March 19, 13
BLZ persistence
BLZ$layout$at$a$glance$
root$
meta$ data$
__0__.blp$ __1__.blp$
Header&
Offset$0$
Offset$1$
Offset$2$
<<<<<$
Block$0$
Block$1$
Block$2$
Header&
Offset$0$
Offset$1$
Offset$2$
<<<<<$
Chunk$0$
Chunk$1$
Chunk$2$
Chunk$Super<Chunk$
Blosc$format$Bloscpack$(BLP)$format$Blaze$(BLZ)$format$
Dataset$
Tuesday, March 19, 13
Blaze Server
https://wakari.io/nb/urls/raw.github.com/ContinuumIO/blaze-web/master/example/notebooks/Kiva-Tiny
%20Example.ipynb
Computed Columns!
Tuesday, March 19, 13
Out-of-core calculations
Tuesday, March 19, 13
Distributed Array
Coming soon....
Tuesday, March 19, 13
Roadmap
• 0.1 release expected in May • 0.3 release at end of August• 1.0 Release by PyData west-coast 2014• Now only get involved if you want to develop
• Then, continue building PyData ecosystem around scalable array.
Tuesday, March 19, 13
NumFOCUS
Num(Py) Foundation for Open Code for Usable Science
http://www.numfocus.org
Tuesday, March 19, 13