fastensor: pain-free hdf5 data analysis at large scale

15
FasTensor: Pain-free HDF5 data analysis at large scale Bin Dong, PhD, Research Scientist, Lawrence Berkeley National Lab Other Contributors: Kesheng Wu, Suren Byna, Houjun Tang, Weijie Zhao, Florin Rusu HDF5 USER GROUP MEETING 2020 (HUG 2020)

Upload: others

Post on 24-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

FasTensor: Pain-free HDF5 data analysis at large scale

Bin Dong, PhD, Research Scientist, Lawrence Berkeley National Lab

Other Contributors: Kesheng Wu, Suren Byna, Houjun Tang, Weijie Zhao, Florin Rusu

HDF5 USER GROUP MEETING 2020 (HUG 2020)

Mountains of Scientific Data Wait for Analysis

Light Source180 PB/year (ALS-U at Berkeley Lab )

Genomics 10 PB/year

High Energy Physics 200 PB/year

Climate 100 EB/year

Sources: L. Nowell, D., Ushizima, S. Byna, JGI and ALS at LBNL, etc.

Most are multidimensional arrays, stored in file formats like HDF5, PNetCDF, ADIOS, etc

Challenges (Pains) in Developing Data Analysis Programs

HDF5/NetCDF/ADIOS/ ...

MPI / MPI-IO / ...

Lustre / PVFS / ...

Linux / Ext4 / ..

Data in Storage System

Dee

p So

ftwar

e St

ack

Many Computing Nodes and CPU Cores ( over 7 million cores at FUGAKU, No. 1 of Top500 on June 2020)

… ...

Data Management TasksParallelization

Insight

Infinite (∞) Number of Data Analysis Operations

Parallel Data Programming Model is a Lifesaver Data programming model: an programming abstraction, hiding complexities of hardware/software and being generic to many advanced analysis tasks

User-defined Functions

Data Programming Model e.g., MapReduce

Hardware and Software

Generic Operators

Abstract Data Type

e.g., Map, Reduce

e.g., Key-Value Tuple (as 1D List)

E.g., Searching, PageRank, SVD, Sorting, TF-IDF, BFS, ...

e.g., desktop, grids, HPC system, volunteer computing environments, cloud environments, mobile environment, ...

MapReduce’s Limitations in Tensor Data Analysis

Convolution on a 2 by 4 2D Tensor (Array)

Kernel is 2 by 2

1, Mismatched Data Model -- Convert Tensor to KV list at Map stage

2, Expensive reduce -- Duplicate KV for Reduce stage

HDF5 is Array based data model

More details at "SLOPE: Structural Locality-aware Programming Model for Composing Array Data Analysis", ISC 2019,

FasTensor (previously ArrayUDF): A new data programing model on array

1, An Array-native Data Programming Model namely SLOPE

2, A Stencil-based Abstract Data Type

3, A single Transform operation

4, HPC friendly ● MPI based Single Program Multiple Data (SPMD) Pattern ● Directly on scientific data formats, e.g., HDF5, ADIOS, PNetCDF● Manual/auto-chunking & ghost zone management● Distributed array cache● Async ghost zone exchange● Check-point● In-place modification semantic● Multi-arrays supports

Inspiring by: a Tensor can be defined as a multidimensional array and proper transformation rules

Transform

Transform

Stencil

A B

CBin Dong, Kesheng Wu, Surendra Byna, Jialin Liu, Weijie Zhao, Florin Rusu, "ArrayUDF: User-Defined Scientific Data Analysis on Arrays", HPDC 2017

Stencil: Abstract Data TypeStencil● An abstract data structure to represent a neighborhood of an Array● Definition: S(Base Cell, Neighbor Cells -- relative offsets)

Flexible geometric shapes/size to break an array into small units for atomic, out-of-core, or parallel processing

...

SLOPE Programming Model of FasTensor

Data Model: Multi-dimensional Array

Abstract Data Type: Stencil

Generic Operator: Transform

Data Model: 1D List

Abstract Data Type: Key-Value Pair

Generic Operator: Map, Reduce

v.s. MapReduce

Stencil Array

Transform:

User-Defined Function (UDF): Rules of the Transform

‘Structural Locality’-aware Programming model

S1 ↦ S2 , S1 ⊂ A , S2 ⊂ B

f

An Example of 3-point Moving Average

Vt-1 + Vt + Vt+1

3

Input Array A, a 2D 16 x 16 dataset in HDF5 file, where each row is a time series from a sensor

Output Array B, a 2D dataset in HDF5 file

Rules of the Transform from A to B

Execute the Transform,either sequentially or in parallel

Relative offsets

FasTensor in Earth Science (DAS) Xin Xing, etc, "Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction", MLHPC, 2018,

Bin Dong, etc. "DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection", IPDPS, 2020,

Local Similarity Calculation

FasTensor

FasTensor in Plasma physics (VPIC)

Bin Dong, Patrick Frank Heiner Kilian, Xiaocan Li, Fan Guo, Suren Byna and Kesheng Wu, "Terabyte-scale Particle Data Analysis: An ArrayUDF Case Study", SSDBM 2019, July 23, 2019,

Time to Fuse Fields and Particles Data

● Field Data Analysis● Particle Data Analysis● Fusing Particle Data and Field Data

FastTensor can express all these diverse analysis

Comparing FasTensor, TensorFlow, Spark

Single CPU Core

2 layers CNN with forward steps on a 64K by 64K 2D array

FasTensor

FasTensor

FasTensor Scales Perfectly on CNN Computation

Parallel efficiency =

Strong Scaling

Weak Scaling

ti is the time to finish the work with i CPU cores

Summary● A formal release of FasTensor is at

https://bitbucket.org/berkeleylab/fastensor/

● We are looking for more use cases to provide dedicated support and make FasTensor better

● HDF5 feature request ○ Parallel virtual dataset

Acknowledgement● U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing

Research under contract number DE-AC02-05CH11231 ● National Energy Research Scientific Computing Center (NERSC)

Presentation Title | BERKELEY LAB 15

Thank You & Questions

Bin Dong

[email protected]

https://crd.lbl.gov/dong

FasTensor: Pain-free HDF5 data analysis at large scale https://bitbucket.org/berkeleylab/fastensor/