python blaze overview
DESCRIPTION
Blaze is a next-generation NumPy library to provide out of memory data access. This talk summarizes its features.TRANSCRIPT
Next Generation NumPy !
blaze.pydata.org
BlazeNumPy
Out of Core, Distributed and Optimized
NumPy
NumPy Array
shape
Blaze: Different kinds of Arrays
Indexable
NDTable NDArray
Deferred Concrete Deferred Concrete
Record Type Primitive Type
Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph • Represents deferred computation
Usually what you have when you have a Blaze Array
Deferred allows handling large arrays������ ������
� �
�������� ���������
��
��
��
�
������
�
�
�
Can be handled out-of-core using chunks to
stream through memory.
Blaze Concrete Array
URL URL URL URL URL
IndexesData Descriptor
Extensible Type System which includes shape
DataShape
Where are the bytes?
What do the bytes mean?
MetaData DictionaryLabels, provenance, etc.
Multiple URLs comprising an array
�������������� �
������ �� �
URLs Provide Bytes
Memory-Like Arbitrarily sliced Random Seeks
Deal with in chunks Random Seeks
Deal with in Chunks Sequential Seeks
File-Like
Stream-Like
Blaze Data Container
ByteProvider
Data BufferIndex Operation
NumPy BLZ Persistent Format
RDBMS
Data Descriptor Protocol
CSVData Stream
Indexes
Contiguous / Strided
Chunked / Tiled
Opaque Element-only
NumPy-Like
Opaque Iterator-access
Special Access
Indexes allow for many orderings
������������ ���� � ���� ����������
�������� ����� �������� ������
���������� � ��������������
��� ������������� �������
DataShape Type System
�������������
��������
����• A data description language • A super-set of NumPy’s dtype • Provides more flexibility
Shape DType
DataShape
Allows for all kinds of containers
���������� ��������������� ���
�����������������������������������
�� ���������
��
��
� � �� �
!���������
������������� ����
Advanced Types
type Point = { x : int; y : int}!type Space = { a: Point; b: Point}!5, 10, Space
type SquareMatrix T = N, N, T
type IntMatrix N = N, N, int32
Parametrized Types
Alias Types
Advanced Shapes
{1,2,4,2,1}, int32 [ [1], [1,2], [1,3,2,9], [3,2], [3]]
Could Represent
Execution Model
• Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules)
• Many operations can be compiled with LLVM to machine-code • BLIR (simple typed expression syntax) • Numba (Python compiler)
Blaze Agents
MongoDB
Vertica
HDFS
CSV Directory
Blaze Agent
Blaze Agent
Blaze Agent
Blaze Agent
Code Graph with Blaze Arrays
Code
Data
“I think you should be more explicit here in step two.”
How?
Out-of-core calculations
NumFOCUS
Num(Py) Foundation for Open Code for Usable Science
http://www.numfocus.org
BLZ persistence
BLZ$layout$at$a$glance$
root$
meta$ data$
__0__.blp$ __1__.blp$
Header&
Offset$0$
Offset$1$
Offset$2$
<<<<<$
Block$0$
Block$1$
Block$2$
Header&
Offset$0$
Offset$1$
Offset$2$
<<<<<$
Chunk$0$
Chunk$1$
Chunk$2$
Chunk$Super<Chunk$
Blosc$format$Bloscpack$(BLP)$format$Blaze$(BLZ)$format$
Dataset$