damaris: how to efficiently leverage multicore parallelism ... · pdf filegabriel antoniu...

Gabriel Antoniu [email protected] KerData Project-Team Inria Rennes Bretagne Atlantique

Damaris: How To Efficiently Leverage Multicore Parallelism To Achieve Scalable, Jitter-Free I/O And Non-Intrusive In-Situ Visualization

Maison de la simulation, Saclay, January 2014

Work involving Matthieu Dorier (IRISA, ENS Cachan), Gabriel Antoniu (INRIA), Marc Snir (ANL), Franck Cappello (ANL), Leigh Orf (CMICH), Dave Semeraro (NCSA, UIUC), Roberto Sisneros (NCSA, UIUC), Tom Peterka (ANL)

Context: HPC simulations on Blue Waters

•  INRIA – UIUC - ANL Joint Lab for Petascale Computing

•  Blue Waters: sustained petaflop •  Large-scale simulation at unprecedented accuracy •  Terabytes of data produced every minute

- 2 Dealing with Big Data on post-petascale machines: the Damaris approach

Big Data challenge on post-petascale machines

- 3

•  How to efficiently store, move data? •  How to index, process, compress these data? •  How to analyze, visualize and finally understand them?

Dealing with Big Data on post-petascale machines: the Damaris approach

Outline

1.  Big Data challenge at exascale

2.  The Damaris approach

3.  Achieving scalable I/O

4.  Non-impacting in-situ visualization

5.  Conclusion


Big Data challenges at exascale When parallel file systems don’t scale anymore

1

The stardard I/O flow: offline data analysis

100.000+ cores

PetaBytes of data

~ 10.000 cores

Periodic data generation from the simulation

Storage in a parallel file system (Lustre, PVFS, GPFS,…)

Offline data analysis (on another cluster)

Dealing with Big Data on post-petascale machines: the Damaris approach - 6

Writing from HPC simulations: File per process vs. Collective I/O

Two main approaches for I/O in HPC simulations

Implemented in MPI-I/O, available in HDF5, NetCDF,…

•  Too many files •  High metadata overhead •  Hard to read back

•  Requires coordination •  Data communications


Periodic synchronous snapshots lead to I/O bursts and high variability

The “cardiogram” of a PVFS data server during a run of the CM1 simulation

Input Output

- 8

Visualizing throughput variability: -  Between cores -  Between iterations


The Damaris approach Dedicating cores to enable scalable asynchronous I/O

2

CORE

CORE

CORE

CORE

CORE

CORE

CORE

Main Memory

CORE

Network

Huge access contention, degraded performance

Writing without Damaris


CORE

CORE

CORE

CORE

CORE

CORE

CORE

Main Memory

CORE

Network

•  Dedicate cores to I/O •  Write in shared memory •  Process data asynchronously •  Produce results •  Output results

With Damaris: using dedicated I/O cores


Leave a core, go faster!

Moving I/O to a dedicated core

- 12

Time-Partitioning Space-Partitioning


Damaris at a glance

•  Dedicated Adaptable Middleware for Application Resources Inline Steering

•  Main idea: dedicate one or a few cores in each SMP node for data management

•  Features: - Shared-memory-based

communications - Plugin system (C,C++, Python) - Connection to visualization engines

(e.g. VisIt, ParaView) - XML external description of data


Damaris: current state of the software

•  Version 0.7.2 available at http://damaris.gforge.inria.fr/ (1.0 available soon J) - Along with documentation, tutorials, examples and a demo!

•  Written in C++, uses - Boost for IPC, Xerces-C and XSD for XML parsing

•  API for Fortran, C, C++

•  Tested on - Grid’5000 (Linux Debian), Kraken (Cray XT5 - NICS), Titan (Cray XK6 – Oak Ridge), JYC, Blue Waters (Cray XE6 - NCSA), Surveyor, Intrepid (BlueGene/P – Argonne)

•  Tested with - CM1 (climate), OLAM (climate), GTC (fusion), Nek5000 (CFD)


Results: achieving scalable I/O 3

Running the CM1 simulation on Kraken, G5K and BluePrint with Damaris

- 16

•  The CM1 simulation - Atmospheric simulation - One of the Blue Waters target applications - Uses HDF5 (file-per-process) and pHDF5 (for collective I/O

•  Kraken

- Cray XT5 at NICS - 12 cores/node - 16 GB/node - Luster file system

•  Grid 5000 - 24 cores/node - 48 GB/node - PVFS file system

•  BluePrint - Power5 - 16 cores/node - 64 GB/node - GPFS file system


Damaris achieves almost perfect scalability

- 17

0

2000

4000

6000

8000

10000

0 5000 10000 Sc

alab

ility

fact

or

Number of cores

Perfect scaling

Damaris

File-per-process

Collective-I/O

Weak scalability factor S = N TbaseT

N: number of cores Tbase: time of an iteration on one core w/o write T: time of an iteration + a write

0

200

400

600

800

1000

576 2304 9216

Run

tim

e (s

ec)

Number of cores

Kraken Cray XT5 Application run time

(50 iterations + 1 write phase)


- 18

0 100 200 300 400 500 600 700 800 900

576 2304 9216

Tim

e to

writ

e (s

ec)

Number of cores

Collective-I/O File-per-process Damaris

0

2

4

6

8

10

0 10 20 30

Tim

e to

writ

e (s

ec)

Total amount of data (GB)

Kraken Cray XT5 Average and maximum write time

28MB per process

BluePrint Power5, 1024 cores Average, max and min write time

Damaris hides the I/O jitter


- 19

0,0625 0,125

0,25 0,5

1 2 4 8

16

0 5000 10000

Agg

rega

te th

roug

hput

(G

B/s

)

Number of cores

File-per-process Damaris Collective-I/O

Average aggregate throughput from the writer processes

Damaris increases effective throughput

Kraken Cray XT5 Average aggregate throughput

from writers


Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf Proceedings of IEEE CLUSTER 2012 (Beijing, China)

More results…


- 21

0 50

100 150 200 250 300

576 2304 9216

Tim

e (s

ec)

Number of cores

0

50

100

150

200

250

0,05 5,8 15,1 24,7

Tim

e (s

ec)

Total amount of data (GB)

Spare time Used time

BluePrint Power5 (1024 cores) Kraken Cray XT5

Time spent by Damaris writing data and time spent waiting

Damaris spares time? Let’s use it!

Damaris spares time for data management


Recent work: non-impacting in-situ visualization Getting insights from running simulations

4

From offline to coupled visualization

•  Offline approach • I/O performance issues in the simulation

• I/O performance issues in visualization software

• Too much data!!!

•  Coupled approach • Direct insight in the simulation • Bypass the file system • Interactive BUT • Hardly accepted by users


Towards in-situ visualization

- 24

•  Loosely coupled strategy - Visualization runs on a separate, remote set of resources - Partially or fully asynchronous - Solutions include staging areas, file format wrappers (HDF5 DSM, ADIOS, …)

•  Tightly coupled strategy - Visualization is collocated with the simulation - Synchronous (time-partitioning): the simulation periodically stops - Solution by code instrumentation - Memory constrained


Four main challenges

- 25

Low impact on simulation code

Adaptability (to different simulations and visualization scenarios)

Low impact on simulation run time

Good resource utilization (low memory footprint, use of GPU,…)

User friendliness

Performance


In-situ visualization strategies

- 26

Coupling Tight Loose ? Impact on code High Low Minimal

Adaptability Interactivity Yes None Yes Instrumentation High Low Minimal

Impact on run time High Low Minimal Resource usage Good Non-optimal Better

•  Researchers seldom accept tightly-coupled in-situ visualization - Because of development overhead, performance impact… -  “Users are stupid, greedy, lazy slobs” [1]

•  Is there a solution achieving all these goals? •  Yes: Damaris

[1] D. Thompson, N. Fabian, K. Moreland, L. Ice, “Design issues for performing in-situ analysis of simulation data”, Tech. Report, Sandia National Lab


Node Node Node Node Node

File System

Connect to a visualization backend


Node Node Node Node Node

File System

Interact with your simulation


Let’s take a representative example

- 29

// rectilinear grid coordinates float mesh_x[NX]; float mesh_y[NY]; float mesh_z[NZ]; // temperature field double temperature[NX][NY][NZ];


“Instrumenting” with Damaris

- 30

!Damaris_write(“mesh_x”,mesh_x);!Damaris_write(“mesh_y”,mesh_y);!Damaris_write(“mesh_z”,mesh_z);!!Damaris_write(“temperature”,temperature);!!

(Yes, that’s all)


Now describe your data in an XML file

- 31

<parameter name="NX" type="int" value="4"/>!<layout name="px" type="float” sizes="NX"/>!<variable name="mesh_x" layout="px">!!!<layout name="data_layout" type="double” sizes="NX,NY,NZ"/>!<variable name="temperature" layout="data_layout” mesh=“my_mesh” />!!<mesh type=“rectilinear” name=“my_mesh” topology=“3”>! <coord name=“mesh_x” unit=“cm” label=“width” />! <coord name=“mesh_y” unit=“cm” label=“depth” />! <coord name=“mesh_z” unit=“cm” label=“height” />!</mesh>!

•  Unified data description for different visualization software •  Damaris translates this description into the right function calls to any

such software (right now: Python and VisIt) •  Damaris handles the interactivity through dedicated cores!

- 32

VisIt interacting with the CM1 simulation through Damaris

Plugin exposing vector fields in the Nek5000 simulation with VTK


Damaris has a low impact on the code

VisIt Damaris Curve.c 144 lines 6 lines Mesh.c 167 lines 10 lines Var.c 271 lines 12 lines Life.c 305 lines 8 lines

Number of lines of code required to instrument sample simulations with VisIt and with Damaris

- 33

Nek5000 VTK : 600 lines of C and Fortran

Damaris : 20 lines of Fortran, 60 of XML


Results with the Nek5000 simulation

•  The Nek5000 simulation •  CFD solver •  Based on spectral elements method •  Developed at ANL •  Written in Fortran 77 and MPI •  Scales over 250,000 cores

•  Data in Nek5000 •  Fixed set of elements constituting an unstructured mesh •  Each element is a curvilinear mesh

•  Damaris already knows how to handle curvilinear meshes and pass them to VisIt

•  Tested on up to 384 cores with Damaris so far •  Traditional approach does not even scale to

this number!


Damaris removes the variability inherent to in-situ visualization tasks

Experiments done with the turbChannel test-case in Nek5000, on 48 cores (2 nodes) of Grid’5000’s Reims cluster.


Conclusion: using Damaris for in-situ analysis

- 36

Coupling Tight Loose Damaris Impact on code High Low Minimal

Adaptability Interactivity Yes None Yes Instrumentation High Low Minimal

Impact on run time High Low Minimal Resource usage Good Non-optimal Better

•  Impact on code: 1 line per object (variable or event) • Adaptability to multiple visualization software (Python, VisIt, ParaView, etc.) •  Interactivity through VisIt •  Impact on run time: simulation run time independent of visualization • Resources usage:

•  preserves the “zero-copy” capability of VisIt thanks to shared-memory, •  can asynchronously use GPU attached to Cray XK6 nodes


Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro, In Proc IEEE LDAV 2013 - IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, United States.

More results…


Conclusion

- 38

The Damaris approach in a nutshell •  Dedicated cores •  Shared memory •  Highly adaptable system thanks to plugins and XML •  Connection to VisIt Results on I/O •  Fully hides the I/O jitter and I/O-related costs •  15x sustained write throughput (compared to collective I/O) •  Almost perfect scalability •  Execution time divided by 3.5 compared to collective I/O •  Enables 600% compression ratio without any overhead Recent work for in-situ visualization •  Efficient coupling of simulation and analysis tools •  Perfectly hides the run-time impact of in-situ visualization •  Minimal code instrumentation •  High adaptability


Gabriel Antoniu [email protected] KerData Project-Team Inria Rennes Bretagne Atlantique

Damaris: How To Efficiently Leverage Multicore Parallelism To Achieve Scalable, Jitter-Free I/O And Non-Intrusive In-Situ Visualization

Maison de la simulation, Saclay, January 2014

Work involving Matthieu Dorier (IRISA, ENS Cachan), Gabriel Antoniu (INRIA), Marc Snir (ANL), Franck Cappello (ANL), Leigh Orf (CMICH), Dave Semeraro (NCSA, UIUC), Roberto Sisneros (NCSA, UIUC), Tom Peterka (ANL)

Thank you!

damaris: how to efficiently leverage multicore parallelism ... · pdf filegabriel antoniu...

Documents