parallel visualization, data formatting, software overvie · • largest datasets require use of...

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of

the National Science Foundation

Parallel Visualization, Data Formatting,Software Overview

Sean Ahern – Remote Data Analysis and Visualization Center

Friday, July 9, 2010

“Who are you?”

ORNL co-I for the SciDAC Visualization and Analysis Center

for Enabling Technologies (VACET)

Director of the University of Tennessee Center for Remote Data

Analysis and Visualization

Visualization lead at ORNL’s Leadership Computing Facility


How do we handle the hard cases?

But what about large, distributed data?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or distributed rendering?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or distributed displays?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or all three?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010


Applications are pushing data set sizes toward Exascale

Spatial resolution

Astrophysics, Radiation transport, Combustion,

Chemistry, Molecular dynamics, Fusion, …

Temporal resolution

Molecular dynamics, Climate, Fusion,

…

Multivariate

Climate,Radiation transport, V&V,

…

Ensembles

Climate, V&V,

…


Data scale limits scientific understanding

• Spatial resolution increasing in many domains

• Too many zones to see on screen (with or without high-res Powerwall)

• Temporal complexity increasing

• Climate simulations will have 100,000s of time steps

• Manually finding temporal patterns is tedious and error-prone

• Multivariate overload

• Climate simulations have 100-200 variables

• Issues of data models and domain-specific data

• E.g. Multigroup radiation fields


HP

C S

ystem

Input

Query

Result

!

Sim

ulation

Datageneration

End-to-end schematic


Dataset size growth

Hero

Large

Medium

Small

IncreasingDataset Size


Data size categories

• Small:

• Data is small enough to easily move anywhere

• Analysis/vis is generally done on local workstations

• Medium:

• Data won’t fit on local workstations

• Generally have to process on “fat” SMP systems

• Large:

• Rather painful to move

• Requires distributed parallelism

• Hero:

• Functionally impossible to move

• Only approachable on largest computational platforms

Hero

Large

Medium

Small



Hero

Large

Medium

Small

“Large” data – remote visualization

• Largest datasets require use of institutional resources

• Reduces data movement issues

• Allows exploitation of multiple GPUs

• Provides visualization to remote users

• Exploited by VisIt, ParaView, EnSight


Hero

Large

Medium

Small

“Hero” data

• Purchasing separate analysis systems at the petascale can be prohibitively expensive: $5-20 million

• Working to move largest vis/analysistools to HPC architecture

• VisIt on Cray XT4/5

• See Cray User’s Group ‘08

• ParaView on Cray XT4/5

• See Cray User’s Group ’09

• Parallel R on Cray XT5

• This year (hopefully)


A lot of success has been through data flow networks (pipelines)


A lot of success has been through data flow networks (pipelines)

• Different data formats:

• NetCDF, HDF, text, CSV, PDB, …

• Different types of data operations:

• Slicing, resampling, mesh transforms, filtering, …

• Different ways of plotting:

• Pseudocolor, isosurfaces, volume rendering

These are independent of each other!


Make these independent modules

•Data reading

•Data operations

•Data plotting

Read data

Filter

Filter

Plot


Make these independent modules

•Data reading

•Data operations

•Data plotting

Read dataRead data

Filter

FilterFilter

Plot PlotThis is adata flow network


Many software systems use this method

•VTK

•AVS/Express

•SCIRun

•…


Fine for small to medium sized data

Hero

Large

Medium

Small



How do we handle the hard cases?

But what about large, distributed data?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or distributed rendering?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or distributed displays?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

Or all three?

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010

01001101011001 11001010010101 00101010100110 11101101011011 00110010111010


A good solution: data parallelization for all components

• Identical data flow networks on each processor.

• Networks differentiated by portion of data they operate on.

• Minimal synchronization

• “Scattered/gather”

• No distribution (i.e. scatter), because scatter is done through choice of what data to read.

• Gather: done when rendering

• Distribution of data is key

P1 P2 P3P0I/O

ParallelSimulationCode P0

P1P3

P2

ParallelAnalysisCode

Proc 0 Proc 1 Proc 2

Rendering/compositingFriday, July 9, 2010

Decomposition of data is key issue

• Parallelization strategy: each processor works on piece of data

• Key: Must have good decomposition of data

• Two I/O situations:

• Reading from file, file format enforces decomposition

• Reading from file, file format allows for arbitrary decomposition

• Three processing situations:

• Pre-calculated decomposition

• Must do “overloading” of chunks when more chunks than processors.

• Dynamic decomposition at I/O

• Must understand how to read out arbitrary chunks of data

• On-the-fly redistribution

• Processing-dependent

P0P1

P3

P2


This can take us a long way

2T cells, 32K procs

Machine Type Problem Size # coresJaguar

Franklin

Dawn

Juno

Ranger

Purple

Cray XT5 2T 32k

Cray XT4 1T, 2T 16k, 32k

BG/P 4T 64k

Linux 1T 16k, 32k

Sun Linux 1T 16k

AIX 0.5T 8k

• Weak scaling study (isocontouring, volume rendering): ~63M cells/core


This can take us a long way

2T cells, 32K procs


Franklin

Dawn

Juno

Ranger

Purple

Cray XT5 2T 32k


BG/P 4T 64k

Linux 1T 16k, 32k

Sun Linux 1T 16k

AIX 0.5T 8k

Since this work, people havereached 4 trillion and 8 trillion cells.



This is only part of the puzzle

P1 P2 P3P0I/O

ParallelSimulationCode P0

P1P3

P2

ParallelAnalysisCode

Proc 0 Proc 1 Proc 2

Rendering/compositingFriday, July 9, 2010

The job of turning geometry into imagery is complex

• One or more geometry generators

• Fed to one or more displays

geometry

geometry

geometry

geometry

geometry

…

display

display

display

display

geometry

geometry

geometry

geometry

geometry

…

display

geometry

display

display

display

display


geometry

geometry

geometry

geometry

geometry

…

display

Rendering to a single tile(sort last)

• Depth “sort” – Only good for opaque geometry

• For each set of geometry, generate image and depth

• When merging, compare depth and choose pixel that’s closest to the viewer

• Alpha blending – Parallel volume rendering / transparent geometry

• Process all geometry in back-to-front order (not always easy to do in parallel)

• Blend geometry with transparency, creating a final image

• Compositing algorithms:

• Binary tree

• Direct send

• Binary swap / n-way swap

• 2-3 swap

• Radix-swap

• …


geometry

display

display

display

display

Rendering to multiple tiles(sort first)

• Break up geometry based upon camera frustum

• Select only geometry that corresponds to each tile

• Send subset of geometry over network to be rendered


geometry

display

display

display

display

Rendering to multiple tiles(sort first)

• Break up geometry based upon camera frustum

• Select only geometry that corresponds to each tile

• Send subset of geometry over network to be rendered

Used for many PowerwallsFriday, July 9, 2010

Both at the same time

• Complicated combinations of sort first and sort last

• Some libraries to make this easier:

• Chromium

• IceT

• Equalizer

• Paracomp

• SAGE

geometry

geometry

geometry

geometry

geometry

…

display

display

display

display


Performance is highly dependent upon I/O rates

• Reading data into memory takes the most time

• Much more than data processing

• Much more than rendering and compositing

• Parallel filesystems are essential for large data

• Data formats designed for parallel filesystems are essential for large data


“Joule metric” run

Isocontour% time

spent in I/O

103 M cells

321 M cells

2.48 s 87.9%

11.62 s 94.9%

• Weak scaling study (isocontouring,volume rendering): 103M-321M cells on 4k-12k cores

• Superlinear scaling in processing

• But wallclock time is dominated by I/O


Trillion cell study confirms that I/O dominates

2T cells, 32K procs



Franklin

Dawn

Juno

Ranger

Purple

Cray XT5 2T 32k


BG/P 4T 64k

Linux 1T 16k, 32k

Sun Linux 1T 16k

AIX 0.5T 8k

-Approx I/O time: 2-5 minutes-Approx processing time: 10 seconds


VisIt: an end user visualization and analysis tool for extremely large data

• 5 major use cases: data exploration, data analysis, visual debugging, comparison, and presentation

• Demonstrated scaling to >100k cores

• Strong emphasis on building a usable, robust tool for scientists.

• Plugin model allows for wide range of capabilities

• Open source project, with direct support from DOE (SciDAC, NEAMS, ASC, BER) and NSF (TeraGrid XD)

• visitusers.org wiki with 400+ pages of documentation and over 80,000 page views

• Over 100,000 downloads

• R&D100 Award in 2005


ParaView: open-source, multi-platform data analysis and visualization application

• Based on VTK: Deep pipeline modification

• Quantitative and qualitative use cases

• Developed to process large datasets using distributed memory computing resources: laptops to supercomputers

• Superb, best-in-class rendering infrastructure

• Customizable GUI

• Access grid integration for collaboration

• Commercial support from Kitware


EnSight: leading commercial parallel visualization tool

• Specialized for engineering analysis

• CFD, FEA, etc.

• Commercial support from CEI

• Distributed rendering infrastruture

• Direct Powerwall support

• Virtual reality

• Distributed parallel data processing and rendering


Rmpi: Parallel statistical analysis

MPI (via modified Rmpi)

Lustre file system

Node 0 Node 1 Node 2

• Write data reader to bring data from parallel file system

• Develop data-parallel analysis with full capability of R on every node

Node 0 produces graphics files


Central file store

• Scalable parallel throughput

• Lustre, GPFS, Panassas, ...

• Shared between source and destination

• Provides “zero copy” access to simulation results

• Allows smartdeployment ofremotevisualizationcapabilities


Data formatting is important to data understanding

!e purpose of computing is insight, not numbers.

– Richard Hamming

!e most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' (I found it!) but '!at's funny ...'

– Isaac Asimov


We want our data model to reflect how the scientist “thinks” of the data

Simple Complex

bits

bytes

words

arraysnamedarrays

multidimensionalarrays

coordinatefields

scalars

meshes

vectors

atoms

materials

bonds

meshrefinement

collections

timeinterpolation

Each step gets us closer to the ideal.But each step is more domain-specific.


• All data lives on a mesh

• Discretizes space into points and cells

• 1D, 2D, 3D

• All of these over time (up to 4D)

• Can have lower-dimensional meshes in ahigher-dimensional space (e.g. 2D surface in 3D space)

• Provides a place for data to be located

• Defines how data is interpolated

Meshes


Mesh structure

• 1D Curves

• 2D/3D meshes

• Rectilinear

• Curvilinear

• Unstructured

• Points

• AMR

• Molecular

• Domain decomposed

• Ghost zones/nodes

Unstructured

RectilinearCurve

Curvilinear

Points

Molecular

AMR


Variables

• Scalars, Vectors, Tensors

• Sits on points or cells of a mesh

• Points: linear interpolation

• Cells: piecewise constant

• Could have different dimensionality than the mesh (e.g. 3D vector data on a 2D mesh)


Materials

• Sometimes used in engineering codes

• Describes disjoint spatial regions at a sub-grid level

• Volume/area fractions


Parallel meshes

• Provides aggregation for meshes

• A mesh may be composed of hundreds of thousands of mesh “blocks”.

• Allows data parallelism


AMR meshes

• Mesh blocks can be associated with patches and levels.

• Allows for aggregation of meshes into AMR hierarchy levels.


Data model goals

• Rich

• Encompasses computational domain as much as possible

• Sharability

• Being able to easily move data between tools

• Being able to give data to other people

• Self description

• Changes to data shouldn’t require a change in the format

• New data should be discoverable without human intervention

• Rapid I/O

• Minimize time dumping out data

• Minimize time reading data

• Parallel I/O

• Support access by multiple threads at once on parallel platforms.

• More…


Data Storage

NetCDF

HDF5Silo

EnSight Gold

FluentFlash

ADIOS

XDMF

PDBS3D

VTK

OpenFOAMCGNS

ChomboTecplot

ViSUS

STL


Underlying libraries

• NetCDF:

• Named array/scalar storage

• Generalized attributes

• Parallel access

• HDF5:

• Named array/scalar storage

• Hierarchical storage

• Parallel read and write

• ASCII

• Simple for humans to read and write

• Horrible for computers to read, especially in parallel


Higher-level libraries and formats

• ADIOS:

• Used for very fast I/O off HPC systems

• HDF5 and native storage

• Silo:

• Very rich (parallel) data model on top of HDF5

• PDB (Protein Data Bank)

• Used for storage of molecular data

• Based on ASCII

• VTK:

• Native storage for the VTK pipeline visualization system

• Both ASCII and binary formats


Higher-level libraries and formats (cont)

• XDMF:

• Neat format that splits “data” and “metadata”

• ASCII XML file for metadata

• HDF5 file for heavyweight data.


Community standards

• NetCDF Climate and Forecast (CF) convention:

• Promotes sharing of data among climate researchers

• Standard naming

• Coordinate systems

• Dimensions, missing data, units, etc.

• Many communities build formats on top of others:

• S3D is over NetCDF

• GTC, Chombo, Enzo, PFLOTRAN, Flash, etc. are over HDF5

• …


Summary

• Parallel visualization is an important capability for gaining insight from petascale (and soon to be exascale) data.

• It’s as difficult of a job as doing the original HPC simulation itself.

• Rendering is not yet a solved problem.

• Data storage and data models are critical to high performance and scientific understanding.


parallel visualization, data formatting, software overvie · • largest datasets require use of...

Documents