out-of-core columnar datasets · 2017-07-14 · about me • i am the creator of tools like...
TRANSCRIPT
![Page 1: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/1.jpg)
Out-of-Core Columnar Datasets
Introducing bcolz, an In-Memory/On-Disk Columnar, Chunked and Compressed Data Container
Francesc Alted <[email protected]> Freelance Trainer And Consultant
!EuroPython 2014, July 25, Berlin
![Page 2: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/2.jpg)
About Me• I am the creator of tools like PyTables, Blosc,
bcolz, and a long-term maintainer of Numexpr
• I am an experienced developer and trainer in:
• Python (almost 15 years of experience)
• High Performance Computing and Storage
• Also available for consulting
![Page 3: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/3.jpg)
What? Yet Another Data Container?
• We are bound to live in a world of wildly different instances of data containers
• The NoSQL movement is an example of that
• Why? Mainly because the increasing gap between CPU and memory speeds
![Page 4: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/4.jpg)
See my article: “Why Modern CPUs Are Starving And What You Can Do
About It”
CPU vs Memory Speed
![Page 5: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/5.jpg)
Why Columnar?
• When querying tabular data, only the interesting data is accessed
• Less I/O required
![Page 6: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/6.jpg)
In-memory Row-Wise Table
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
Interesting column
Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line)
}N rows
![Page 7: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/7.jpg)
In-memory Column-Wise Table
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
String …String Int32 Float64 Int16
Interesting column
Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32)
}N rows
![Page 8: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/8.jpg)
Why Chunking?
• Chunking means more difficulty handling data, so why bother?
• Efficient enlarging and shrinking
• On-flight compression possible
![Page 9: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/9.jpg)
Appending Data in NumPy
Copy!
Array to be enlarged
Final array object
Data to appendNew memory
allocation• Both memory areas have to exist simultaneously
![Page 10: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/10.jpg)
Appending Data in bcolzFinal carray object
chunk 1
chunk 2
new chunk(s)
carray to be enlarged
chunk 1
chunk 2
data to append
XCompress
• Only a compression operation on new data is required
![Page 11: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/11.jpg)
Why Compression (I)?
Compressed Dataset
Original Dataset
3x more data
More data can be stored in the same amount of media
![Page 12: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/12.jpg)
Why Compression (II)?Less data needs to be transmitted to the CPU
Disk or Memory Bus
Decompression
Disk or Memory (RAM)
CPU Cache
Original Dataset
Compressed Dataset
Transmission + decompression faster than direct transfer?
![Page 13: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/13.jpg)
Blosc: Compressing Faster Than Memory Speed
![Page 14: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/14.jpg)
bcolz: Goals and Implementation
![Page 15: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/15.jpg)
–KISS Principle
“Keep It Simple, Stupid”
Feature inclusion driven by the:
![Page 16: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/16.jpg)
What bcolz Is?
• Columnar, chunked, compressed data containers for Python
• Offers `carray ` and `ctable` container flavors
• Uses the powerful Blosc compression library for on-the-flight compression/decompression
• 100% written in Python/Cython
![Page 17: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/17.jpg)
carray: Multidimensional Container for Homogeneous Data
...
NumPy container carray container
chunk 1
chunk 2
chunk N
Contiguous Memory Discontiguous Memory
![Page 18: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/18.jpg)
The ctable Object
.
.
.
.
.
.
.
.
.
.
.
.
chunkcarray
new rows to append
• Chunks follow column order • Very efficient for querying • Adding or removing columns is cheap too
![Page 19: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/19.jpg)
Persistency
• carray and ctable objects can live on disk, not only in memory
• The format for persistency is heavily based in bloscpack, a nascent library for compressing large datasets
• bcolz allows every operation to be executed entirely on-disk (out-of-core operations)
![Page 20: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/20.jpg)
Streaming Analytics With bcolz
bcolz container (disk or memory)
iter(), iterblocks(),where(), whereblocks(),
__getitem__()
map(), filter(), groupby(), sortby(),
reduceby(),join()
bcolz iterators/filters with blocking
itertools, PyToolz, CyToolz
![Page 21: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/21.jpg)
Interacting with Neighbors
bcolz
• Relational Databases • CSV files • HDF5/PyTables • Excel
• HDF5 format • Indexed queries • Long term storage • Blosc support
![Page 22: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/22.jpg)
Some Benchmarks With Real Data: The MovieLens Dataset
Materials in:https://github.com/Blosc/movielens-bench
![Page 23: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/23.jpg)
The MovieLens Dataset
• Datasets for movie ratings
• Different sizes: 100K, 1M, 10M ratings (the 10M will be used in benchmarks ahead)
• The datasets were collected over various periods of time
![Page 24: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/24.jpg)
Querying the MovieLens Dataset
import pandas as pdimport bcolz
# Parse and load CSV files using pandas
# Merge some files in a single dataframelens = pd.merge(movies, ratings)
# The pandas way of queryingresult = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)”)['user_id']
zlens = bcolz.ctable.fromdataframe(lens)
# The bcolz way of querying (notice the use of the `where` iterator)result = [r.user_id for r in dblens.where( "(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['user_id'])]
![Page 25: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/25.jpg)
Sizes of Datasets
• Compression means ~20x less space
• The uncompressed ctable is larger than pandas
![Page 26: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/26.jpg)
Query Times (laptop 1-year old)
• Compression leads to better query speeds(15% faster)
• Querying a disk-based ctable is fast!
![Page 27: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/27.jpg)
Query Times(laptop 3-year old, Core2)
• Compression still makes things slower on old boxes (15% slower)
• So, expect better improvements in the future
![Page 28: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/28.jpg)
Status and Overview• Version 0.7.0 released this week. Check it out!
• Focus on refining the API and tweaking knobs for making things even faster
• Better integration with bloscpack (super-chunks)
• bcolz main goal is to demonstrate that compression can help performance, even using in-memory data containers
![Page 29: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/29.jpg)
Tell Us About Your Experience!
• Which is your scenario?
• You are not getting the expected speed or compression ratio?
• Mailing list: http://groups.google.com/group/bcolz
• Bugs/patches, please file them at:http://github.com/Blosc/bcolz
![Page 30: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/30.jpg)
References
• Manual: http://bcolz.blosc.org/
• Bloscpack: https://github.com/Blosc/bloscpack
• The Blosc ecosystem: http://blosc.org/
![Page 31: Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer](https://reader033.vdocuments.net/reader033/viewer/2022050605/5facfd3288eeab1c3e7f36f6/html5/thumbnails/31.jpg)
Thank you! !
!
Questions?