python and hdf5: overview
DESCRIPTION
From a talk by Andrew Collette to the Boulder Earth and Space Science Informatics Group (BESSIG) on November 20, 2013. This talk explores how researchers can use the scalable, self-describing HDF5 data format together with the Python programming language to improve the analysis pipeline, easily archive and share large datasets, and improve confidence in scientific results. The discussion will focus on real-world applications of HDF5 in experimental physics at two multimillion-dollar research facilities: the Large Plasma Device at UCLA, and the NASA-funded hypervelocity dust accelerator at CU Boulder. This event coincides with the launch of a new O’Reilly book, Python and HDF5: Unlocking Scientific Data. As scientific datasets grow from gigabytes to terabytes and beyond, the use of standard formats for data storage and communication becomes critical. HDF5, the most recent version of the Hierarchical Data Format originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing and sharing large datasets. At the same time, many researchers who routinely deal with large numerical datasets have been drawn to the Python by its ease of use and rapid development capabilities. Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. In addition to stable core packages for handling numerical arrays, analysis, and plotting, the Python ecosystem provides a huge selection of more specialized software, reducing the amount of work necessary to write scientific code while also increasing the quality of results. Python’s excellent support for standard data formats allows scientists to interact seamlessly with colleagues using other platforms.TRANSCRIPT
Python and HDF5Andrew Collette
University of Colorado
What makes scientific data special?
What makes scientific data special?
It’s meant to be shared - collaborative
Ad-hoc or changing structure - flexible
Archived and preserved - robust
Python and HDF5 together address all three
High-level language
Almost no “boilerplate” code
“Exception” error handling
Fully object-oriented
First-class module/namespace support
Readable
Self-documenting
Free
(the language)
(the platform)
Python itself is “batteries included”
Mature numerical, plotting and scientific modules
Hundreds of specialized science packages
Thousands more general-purpose
Core analysis packages
NumPy - Array objects and basic operations
SciPy - Advanced science & engineering library
Matplotlib - Publication-quality plots (both rendered and interactive)
Thousands of others
Unit testing - unittest module in stdlib
Only need to write code for your problem
Web servers and development - literally hundreds
Interface: F2PY (Fortran), Cython (C), ctypes, others
Distribution - distutils/pip single-command installs
Python highlights
Readable
Iteration
C
IDL
Python
Speed
SpeedFFTs and optimized routines built in to NumPy/Scipy
SpeedFFTs and optimized routines built in to NumPy/Scipy
ctypes and Cython
ctypesAdvanced foreign function interface
Call C libraries from pure Python code
CythonExample from the HDF5 C Library:
HDF5
HDF5
Hierarchical Data Format
File specification and object model
C library
Ecosystem of users and developers
3 things:
Objects
Datasets - Homogenous arrays of data
Groups: containers holding datasets and groups
Attributes: arbitrary metadata on groups & datasets
Standard constructs using these, or make your own!
Dataset featuresPartial I/O: read and write just what you want
Automatic type conversion
On-the-fly compression
(In Python, we even use the array-access syntax!)
Parallel reads & writes with MPI
(Directly from Python!)
Metadata & OrganizationGroups form a POSIX-style “filesystem” in the file
Attributes can store arbitrary data on arbitrary objects
How should the file be organized?
You decide! !
Thousands of domain-specific “application formats” Anyone can read them because HDF5 is self-describing!
Example
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Demo
Real-world use
UCLA Large Plasma Device
UCLA Large Plasma Device
Image credit: Basic Plasma Science Facility
Laser Experiment
Image credit: Basic Plasma Science Facility
LAPD Data ProductsAcquisition file - “Planes” of data in HDF5
Metadata:timestamps, digitizer settings, probe positions,
background plasma conditions…
Packaged into HDF5 following “lab layout” Users take their data back home and analyze
Visualization
Python 2D plotting
A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Only 160 lines of code!
A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Python does 3D too!“MayaVi” 3D visualizer
Development sponsored by Enthought
Both offline (scripted) and interactive modes
A. Collette et al. Phys. Plasmas 18, 055705 (2011)
CU Accelerator
CU Accelerator
CU Accelerator
CU Accelerator
CU AcceleratorRaw data HDF5 Shot file
Automated speed/mass calculation
MySQLData search
HDF5 file for user
Where to get Python
Where to get PythonDistributions are the best way to get started
(they include HDF5/h5py!)
Anaconda (Windows, Mac, Linux): http://continuum.io
PythonXY (Windows) http://pythonxy.googlecode.com
Questions?