python blaze overview

22
Next Generation NumPy blaze.pydata.org

Upload: ian-stokes-rees

Post on 27-Jan-2015

120 views

Category:

Technology


0 download

DESCRIPTION

Blaze is a next-generation NumPy library to provide out of memory data access. This talk summarizes its features.

TRANSCRIPT

Page 1: Python Blaze Overview

Next Generation NumPy !

blaze.pydata.org

Page 2: Python Blaze Overview

BlazeNumPy

Out of Core, Distributed and Optimized

NumPy

Page 3: Python Blaze Overview

NumPy Array

shape

Page 4: Python Blaze Overview

Blaze: Different kinds of Arrays

Indexable

NDTable NDArray

Deferred Concrete Deferred Concrete

Record Type Primitive Type

Page 5: Python Blaze Overview

Blaze Deferred Arrays

+"

A" *"

B" C"

A + B*C

• Symbolic objects which build a graph • Represents deferred computation

Usually what you have when you have a Blaze Array

Page 6: Python Blaze Overview

Deferred allows handling large arrays������ ������

� �

�������� ���������

��

��

��

������

Can be handled out-of-core using chunks to

stream through memory.

Page 7: Python Blaze Overview

Blaze Concrete Array

URL URL URL URL URL

IndexesData Descriptor

Extensible Type System which includes shape

DataShape

Where are the bytes?

What do the bytes mean?

MetaData DictionaryLabels, provenance, etc.

Page 8: Python Blaze Overview

Multiple URLs comprising an array

�������������� �

������ �� �

Page 9: Python Blaze Overview

URLs Provide Bytes

Memory-Like Arbitrarily sliced Random Seeks

Deal with in chunks Random Seeks

Deal with in Chunks Sequential Seeks

File-Like

Stream-Like

Page 10: Python Blaze Overview

Blaze Data Container

ByteProvider

Data BufferIndex Operation

NumPy BLZ Persistent Format

RDBMS

Data Descriptor Protocol

CSVData Stream

Page 11: Python Blaze Overview

Indexes

Contiguous / Strided

Chunked / Tiled

Opaque Element-only

NumPy-Like

Opaque Iterator-access

Special Access

Page 12: Python Blaze Overview

Indexes allow for many orderings

������������ ���� � ���� ����������

�������� ����� �������� ������

���������� � ��������������

��� ������������� �������

Page 13: Python Blaze Overview

DataShape Type System

�������������

��������

����• A data description language • A super-set of NumPy’s dtype • Provides more flexibility

Shape DType

DataShape

Page 14: Python Blaze Overview

Allows for all kinds of containers

���������� ��������������� ���

�����������������������������������

�� ���������

��

��

� � �� �

!���������

������������� ����

Page 15: Python Blaze Overview

Advanced Types

type Point = { x : int; y : int}!type Space = { a: Point; b: Point}!5, 10, Space

type SquareMatrix T = N, N, T

type IntMatrix N = N, N, int32

Parametrized Types

Alias Types

Page 16: Python Blaze Overview

Advanced Shapes

{1,2,4,2,1}, int32 [ [1], [1,2], [1,3,2,9], [3,2], [3]]

Could Represent

Page 17: Python Blaze Overview

Execution Model

• Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules)

• Many operations can be compiled with LLVM to machine-code • BLIR (simple typed expression syntax) • Numba (Python compiler)

Page 18: Python Blaze Overview

Blaze Agents

MongoDB

Vertica

HDFS

CSV Directory

Blaze Agent

Blaze Agent

Blaze Agent

Blaze Agent

Code Graph with Blaze Arrays

Code

Data

Page 19: Python Blaze Overview

“I think you should be more explicit here in step two.”

How?

Page 20: Python Blaze Overview

Out-of-core calculations

Page 21: Python Blaze Overview

NumFOCUS

Num(Py) Foundation for Open Code for Usable Science

http://www.numfocus.org

Page 22: Python Blaze Overview

BLZ persistence

BLZ$layout$at$a$glance$

root$

meta$ data$

__0__.blp$ __1__.blp$

Header&

Offset$0$

Offset$1$

Offset$2$

<<<<<$

Block$0$

Block$1$

Block$2$

Header&

Offset$0$

Offset$1$

Offset$2$

<<<<<$

Chunk$0$

Chunk$1$

Chunk$2$

Chunk$Super<Chunk$

Blosc$format$Bloscpack$(BLP)$format$Blaze$(BLZ)$format$

Dataset$