python and hdf5: overview

46
Python and HDF5 Andrew Collette University of Colorado

Upload: andrewcollette

Post on 13-May-2015

1.821 views

Category:

Technology


2 download

DESCRIPTION

From a talk by Andrew Collette to the Boulder Earth and Space Science Informatics Group (BESSIG) on November 20, 2013. This talk explores how researchers can use the scalable, self-describing HDF5 data format together with the Python programming language to improve the analysis pipeline, easily archive and share large datasets, and improve confidence in scientific results. The discussion will focus on real-world applications of HDF5 in experimental physics at two multimillion-dollar research facilities: the Large Plasma Device at UCLA, and the NASA-funded hypervelocity dust accelerator at CU Boulder. This event coincides with the launch of a new O’Reilly book, Python and HDF5: Unlocking Scientific Data. As scientific datasets grow from gigabytes to terabytes and beyond, the use of standard formats for data storage and communication becomes critical. HDF5, the most recent version of the Hierarchical Data Format originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing and sharing large datasets. At the same time, many researchers who routinely deal with large numerical datasets have been drawn to the Python by its ease of use and rapid development capabilities. Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. In addition to stable core packages for handling numerical arrays, analysis, and plotting, the Python ecosystem provides a huge selection of more specialized software, reducing the amount of work necessary to write scientific code while also increasing the quality of results. Python’s excellent support for standard data formats allows scientists to interact seamlessly with colleagues using other platforms.

TRANSCRIPT

Page 1: Python and HDF5: Overview

Python and HDF5Andrew Collette

University of Colorado

Page 2: Python and HDF5: Overview

What makes scientific data special?

Page 3: Python and HDF5: Overview

What makes scientific data special?

It’s meant to be shared - collaborative

Ad-hoc or changing structure - flexible

Archived and preserved - robust

Python and HDF5 together address all three

Page 4: Python and HDF5: Overview

High-level language

Almost no “boilerplate” code

“Exception” error handling

Fully object-oriented

First-class module/namespace support

Readable

Self-documenting

Free

(the language)

Page 5: Python and HDF5: Overview

(the platform)

Python itself is “batteries included”

Mature numerical, plotting and scientific modules

Hundreds of specialized science packages

Thousands more general-purpose

Page 6: Python and HDF5: Overview

Core analysis packages

NumPy - Array objects and basic operations

SciPy - Advanced science & engineering library

Matplotlib - Publication-quality plots (both rendered and interactive)

Page 7: Python and HDF5: Overview

Thousands of others

Unit testing - unittest module in stdlib

Only need to write code for your problem

Web servers and development - literally hundreds

Interface: F2PY (Fortran), Cython (C), ctypes, others

Distribution - distutils/pip single-command installs

Page 8: Python and HDF5: Overview

Python highlights

Page 9: Python and HDF5: Overview

Readable

Page 10: Python and HDF5: Overview

Iteration

C

IDL

Python

Page 11: Python and HDF5: Overview

Speed

Page 12: Python and HDF5: Overview

SpeedFFTs and optimized routines built in to NumPy/Scipy

Page 13: Python and HDF5: Overview

SpeedFFTs and optimized routines built in to NumPy/Scipy

ctypes and Cython

Page 14: Python and HDF5: Overview

ctypesAdvanced foreign function interface

Call C libraries from pure Python code

Page 15: Python and HDF5: Overview

CythonExample from the HDF5 C Library:

Page 16: Python and HDF5: Overview

HDF5

Page 17: Python and HDF5: Overview

HDF5

Page 18: Python and HDF5: Overview

Hierarchical Data Format

File specification and object model

C library

Ecosystem of users and developers

3 things:

Page 19: Python and HDF5: Overview

Objects

Datasets - Homogenous arrays of data

Groups: containers holding datasets and groups

Attributes: arbitrary metadata on groups & datasets

Standard constructs using these, or make your own!

Page 20: Python and HDF5: Overview

Dataset featuresPartial I/O: read and write just what you want

Automatic type conversion

On-the-fly compression

(In Python, we even use the array-access syntax!)

Parallel reads & writes with MPI

(Directly from Python!)

Page 21: Python and HDF5: Overview

Metadata & OrganizationGroups form a POSIX-style “filesystem” in the file

Attributes can store arbitrary data on arbitrary objects

How should the file be organized?

You decide! !

Thousands of domain-specific “application formats” Anyone can read them because HDF5 is self-describing!

Page 22: Python and HDF5: Overview

Example

Page 23: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 24: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 25: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 26: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 27: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 28: Python and HDF5: Overview

Open an HDF5 file

Extract a particular dataset

Read the data

Make an interactive plot

Close the file

Page 29: Python and HDF5: Overview

Demo

Page 30: Python and HDF5: Overview

Real-world use

Page 31: Python and HDF5: Overview

UCLA Large Plasma Device

Page 32: Python and HDF5: Overview

UCLA Large Plasma Device

Image credit: Basic Plasma Science Facility

Page 33: Python and HDF5: Overview

Laser Experiment

Image credit: Basic Plasma Science Facility

Page 34: Python and HDF5: Overview

LAPD Data ProductsAcquisition file - “Planes” of data in HDF5

Metadata:timestamps, digitizer settings, probe positions,

background plasma conditions…

Packaged into HDF5 following “lab layout” Users take their data back home and analyze

Page 35: Python and HDF5: Overview

Visualization

Page 36: Python and HDF5: Overview

Python 2D plotting

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

Page 37: Python and HDF5: Overview

Only 160 lines of code!

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

Page 38: Python and HDF5: Overview

Python does 3D too!“MayaVi” 3D visualizer

Development sponsored by Enthought

Both offline (scripted) and interactive modes

A. Collette et al. Phys. Plasmas 18, 055705 (2011)

Page 39: Python and HDF5: Overview

CU Accelerator

Page 40: Python and HDF5: Overview

CU Accelerator

Page 41: Python and HDF5: Overview

CU Accelerator

Page 42: Python and HDF5: Overview

CU Accelerator

Page 43: Python and HDF5: Overview

CU AcceleratorRaw data HDF5 Shot file

Automated speed/mass calculation

MySQLData search

HDF5 file for user

Page 44: Python and HDF5: Overview

Where to get Python

Page 45: Python and HDF5: Overview

Where to get PythonDistributions are the best way to get started

(they include HDF5/h5py!)

Anaconda (Windows, Mac, Linux): http://continuum.io

PythonXY (Windows) http://pythonxy.googlecode.com

Page 46: Python and HDF5: Overview

Questions?