blaze: a large-scale, array-oriented infrastructure for python

42
Blaze: a large-scale, array- oriented infrastructure for Python PyData Silicon Valley 2013 Travis E. Oliphant Tuesday, March 19, 13

Upload: travis-oliphant

Post on 27-Jan-2015

114 views

Category:

Technology


0 download

DESCRIPTION

This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.

TRANSCRIPT

Page 1: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze: a large-scale, array-oriented infrastructure for

Python

PyData Silicon Valley 2013

Travis E. Oliphant

Tuesday, March 19, 13

Page 2: Blaze: a large-scale, array-oriented infrastructure for Python

Brief History

Person Package Year

Jim Fulton Matrix Object in Python

1994

Jim Hugunin Numeric 1995

Perry Greenfield, Rick White, Todd Miller Numarray 2001

Travis Oliphant NumPy 2005

Tuesday, March 19, 13

Page 3: Blaze: a large-scale, array-oriented infrastructure for Python

Early pieces of SciPy cephesmodule fftw wrappers

June 1998 November 1998

stats.py

December 1998Gary

Strangman

Tuesday, March 19, 13

Page 4: Blaze: a large-scale, array-oriented infrastructure for Python

1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis

environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.

In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would

be creating this uber-package which eventually became SciPy

Gaussian quadrature 5 Jan 1999

cephes 1.0 30 Jan 1999

sigtools 0.40 23 Feb 1999

Numeric docs March 1999

cephes 1.1 9 Mar 1999

multipack 0.3 13 Apr 1999

Helper routines 14 Apr 1999

multipack 0.6 (leastsq, ode, fsolve, quad)

29 Apr 1999

sparse plan described 30 May 1999

multipack 0.7 14 Jun 1999

SparsePy 0.1 5 Nov 1999

cephes 1.2 (vectorize) 29 Dec 1999

Plotting??

GistXPLOTDISLINGnuplot

Helping with f2py

Tuesday, March 19, 13

Page 5: Blaze: a large-scale, array-oriented infrastructure for Python

SciPy 2001

Founded in 2001 with Travis Vaught

Eric JonesweaveclusterGA*

Pearu Petersonlinalg

interpolatef2py

Travis Oliphantoptimizesparse

interpolateintegratespecialsignalstats

fftpackmisc

Tuesday, March 19, 13

Page 6: Blaze: a large-scale, array-oriented infrastructure for Python

Community effortmany, many others --- forgive me!• Chuck Harris

• Pauli Virtanen• David Cournapeau• Stefan van der Walt• Jake Vanderplas• Josef Perktold• Anne Archibald• Dag Sverre Seljebotn• Robert Kern• Matthew Brett• Warren Weckesser• Ralf Gommers• Joe Harrington --- Documentation effort• Andrew Straw --- www.scipy.org

Tuesday, March 19, 13

Page 7: Blaze: a large-scale, array-oriented infrastructure for Python

1,000,000 to 2,000,000 users of NumPy!

Tuesday, March 19, 13

Page 8: Blaze: a large-scale, array-oriented infrastructure for Python

Now What?

Tuesday, March 19, 13

Page 9: Blaze: a large-scale, array-oriented infrastructure for Python

What is good about NumPy?• Array-oriented• Extensive DType System (including structures)• C-API --- lots of libraries• Simple to understand data-structure• Memory mapping• Syntax support from Python• Large community of users• Ufuncs and more • Broadcasting• Easy to interface C/C++/Fortran code

Tuesday, March 19, 13

Page 10: Blaze: a large-scale, array-oriented infrastructure for Python

What is wrong with NumPy

• Dtype system is difficult to extend• Immediate mode creates huge temporaries

(spawning Numexpr)• “Almost” an in-memory data-base comparable

to SQL-lite (missing indexes)• Integration with sparse arrays • Lots of un-optimized parts• Minimal support for multi-core / GPU

Tuesday, March 19, 13

Page 11: Blaze: a large-scale, array-oriented infrastructure for Python

Improvements needed• NDArray improvements

• Indexes (esp. for Structured arrays)• SQL front-end• Multi-level, hierarchical labels• selection via mappings (labeled arrays)• Memory spaces (array made up of regions)• Distributed arrays (global array)• Compressed arrays• Standard distributed persistance• fancy indexing as view and optimizations• streaming arrays

Tuesday, March 19, 13

Page 12: Blaze: a large-scale, array-oriented infrastructure for Python

Improvements needed

• Dtype improvements• Enumerated types (including dynamic enumeration)• Derived fields• Specification as a class (or JSON)• Pointer dtype (i.e. C++ object, or varchar)• Missing data: masks and bit-patterns• Parameterized field names• Computed fields

Tuesday, March 19, 13

Page 13: Blaze: a large-scale, array-oriented infrastructure for Python

Improvements needed• Ufunc improvements

• Generalized ufuncs support more than just contiguous arrays

• Specification of ufuncs in Python• Move most dtype “array functions” to ufuncs• Unify error-handling for all computations• Allow lazy-evaluation and remote computation ---

streaming and generator data• Structured and string dtype ufuncs• Multi-core and GPU optimized ufuncs• Group-by reduction

Tuesday, March 19, 13

Page 14: Blaze: a large-scale, array-oriented infrastructure for Python

More Improvements needed• Miscellaneous improvements

• ABI-management• Eventual Move to library (NDLib)?• NDLib could serve as base for Javascript and other

high-level languages?• Integration with LLVM• Possible dtype / shape / stride unification into a “table

interface”• Remote computation• Fast I/O for CSV and Excel

Tuesday, March 19, 13

Page 15: Blaze: a large-scale, array-oriented infrastructure for Python

New Project

Blaze

NumPy

Out of Core,Distributed and Optimized

NumPy

Tuesday, March 19, 13

Page 16: Blaze: a large-scale, array-oriented infrastructure for Python

NumPy Array

shape

Tuesday, March 19, 13

Page 17: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze: Different kinds of Arrays

Indexable

NDTable NDArray

Deferred Concrete Deferred Concrete

Record Type Primitive Type

Tuesday, March 19, 13

Page 18: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze Deferred Arrays

+"

A" *"

B" C"

A + B*C

• Symbolic objects which build a graph• Represents deferred computation

Usually what you have when you have a Blaze Array

Tuesday, March 19, 13

Page 19: Blaze: a large-scale, array-oriented infrastructure for Python

Deferred allows handling large arrays

Can be handled out-of-core using chunks to

stream through memory.

Tuesday, March 19, 13

Page 20: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze Concrete Array

URL URL URL URL URL

IndexesData Descriptor

Extensible Type System which includes shape

DataShape

Where are the bytes?

What do the bytes mean?

MetaData DictionaryLabels, provenance, etc.

Tuesday, March 19, 13

Page 21: Blaze: a large-scale, array-oriented infrastructure for Python

Multiple URLs comprising an array

Tuesday, March 19, 13

Page 22: Blaze: a large-scale, array-oriented infrastructure for Python

URLs Provide Bytes

Memory-Like Arbitrarily slicedRandom Seeks

Deal with in chunksRandom Seeks

Deal with in ChunksSequential Seeks

File-Like

Stream-Like

Tuesday, March 19, 13

Page 23: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze Data Container

ByteProvider

Data BufferIndex Operation

NumPy BLZPersistent Format

RDBMS

Data DescriptorProtocol

CSVData Stream

Tuesday, March 19, 13

Page 24: Blaze: a large-scale, array-oriented infrastructure for Python

Indexes

Contiguous / Strided

Chunked / Tiled

Opaque Element-only

NumPy-Like

OpaqueIterator-access

Special Access

Tuesday, March 19, 13

Page 25: Blaze: a large-scale, array-oriented infrastructure for Python

Indexes allow for many orderings

Tuesday, March 19, 13

Page 26: Blaze: a large-scale, array-oriented infrastructure for Python

DataShape Type System

• A data description language• A super-set of NumPy’s dtype• Provides more flexibility

Shape DType

DataShape

Tuesday, March 19, 13

Page 27: Blaze: a large-scale, array-oriented infrastructure for Python

Allows for all kinds of containers

Tuesday, March 19, 13

Page 28: Blaze: a large-scale, array-oriented infrastructure for Python

Advanced Types

type Point = { x : int; y : int}

type Space = { a: Point; b: Point}

5, 10, Space

type SquareMatrix T = N, N, T

type IntMatrix N = N, N, int32

Parametrized Types

Alias Types

Tuesday, March 19, 13

Page 29: Blaze: a large-scale, array-oriented infrastructure for Python

Advanced Shapes

{1,2,4,2,1}, int32 [ [1], [1,2], [1,3,2,9], [3,2], [3]]

Could Represent

Tuesday, March 19, 13

Page 30: Blaze: a large-scale, array-oriented infrastructure for Python

Execution Model

• Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules)

• Many operations can be compiled with LLVM to machine-code• BLIR (simple typed expression syntax)• Numba (Python compiler)

Tuesday, March 19, 13

Page 31: Blaze: a large-scale, array-oriented infrastructure for Python

Blaze Agents

MongoDB

Vertica

HDFS

CSV Directory

Blaze Agent

Blaze Agent

Blaze Agent

Blaze Agent

Code Graph with Blaze Arrays

Code

Data

Tuesday, March 19, 13

Page 32: Blaze: a large-scale, array-oriented infrastructure for Python

“I think you should be more explicit here in step two.”

How?

Tuesday, March 19, 13

Page 33: Blaze: a large-scale, array-oriented infrastructure for Python

Team

Travis OliphantNumPy, SciPy

Peter WangChaco, Bokeh

Mark WiebeNumPy, DyND

Stephen Diehl

Mark FlorissonNumba

Francesc AltedPyTables

Oscar Villellas

Tuesday, March 19, 13

Page 34: Blaze: a large-scale, array-oriented infrastructure for Python

DARPA providing help

DARPA-BAA-12-38: XDATA

TA-1: Scalable analytics and data processing technology  TA-2: Visual user interface technology

Tuesday, March 19, 13

Page 35: Blaze: a large-scale, array-oriented infrastructure for Python

Status

Tuesday, March 19, 13

Page 36: Blaze: a large-scale, array-oriented infrastructure for Python

Type System = DataShape

Best Type System this side of Haskell!

Tuesday, March 19, 13

Page 37: Blaze: a large-scale, array-oriented infrastructure for Python

BLZ persistence

BLZ$layout$at$a$glance$

root$

meta$ data$

__0__.blp$ __1__.blp$

Header&

Offset$0$

Offset$1$

Offset$2$

<<<<<$

Block$0$

Block$1$

Block$2$

Header&

Offset$0$

Offset$1$

Offset$2$

<<<<<$

Chunk$0$

Chunk$1$

Chunk$2$

Chunk$Super<Chunk$

Blosc$format$Bloscpack$(BLP)$format$Blaze$(BLZ)$format$

Dataset$

Tuesday, March 19, 13

Page 39: Blaze: a large-scale, array-oriented infrastructure for Python

Out-of-core calculations

Tuesday, March 19, 13

Page 40: Blaze: a large-scale, array-oriented infrastructure for Python

Distributed Array

Coming soon....

Tuesday, March 19, 13

Page 41: Blaze: a large-scale, array-oriented infrastructure for Python

Roadmap

• 0.1 release expected in May • 0.3 release at end of August• 1.0 Release by PyData west-coast 2014• Now only get involved if you want to develop

• Then, continue building PyData ecosystem around scalable array.

Tuesday, March 19, 13

Page 42: Blaze: a large-scale, array-oriented infrastructure for Python

NumFOCUS

Num(Py) Foundation for Open Code for Usable Science

http://www.numfocus.org

Tuesday, March 19, 13