on the path to petascale: top challenges to scientific discovery

Presented by

On the Path to Petascale:Top Challenges to Scientific Discovery

Scott A. Klasky

NCCS

Scientific Computing

End-to-End Task Lead

2

1. Code Performance

From 2004 - 2008, computing power for codes like GTC will go up 3 orders of magnitude!

2 Paths for Pscale computing for most simulations. More physics. Larger problems. Code Coupling.

My personal definition of leadership class computing. “Simulation runs on >50% of

cores, running for >10 hours.” One ‘small’ simulation will cost

$38,000 on a Pflop computer. Science scales with processors.

XGC and GTC fusion simulations will run on 80% of cores for 80 hours ($400,000/simulation).

3

Data Generated.

MTF will be ~2 days.

Restarts contain critical information to replay the simulation at different times. Typical Restarts = 1/10 of memory. Dumps every 1 hour. (Big 3 apps support

this claim).

Analysis files dump every physical timestep. Typically every 5 minutes of simulation. Analysis files vary. We estimate for ITER size simulations data output will be

roughly 1GB/5 minutes.

DEMAND I/O < 5% of calculation.

Total simulation will potentially produce =1280TB + 960GB.

Need > (16*1024+12)/(3600 * .05) = 91GB/sec.

Asynchronous I/O is needed!!! (Big 3 apps (combustion, fusion, astro allow buffers). Reduces I/O rate to (16*1024+12)/3600 = 4.5GB/sec. (with lower overhead). Get the data off the HPC, and over to another system!

Produce HDF5 files on another system (too expensive for HPC system).

4

Workflow Automation is desperately needed. (with high-speed data-in-transit techniques). Need to integrate Autonomics into workflows….

Need to make it easy for the scientists.

Need to make it fault tolerant/robust.

5

A few days in the life of Sim Scientist.Day 1 -morning.

8:00AM Get Coffee, Check to see if job is running. Ssh into jaguar.ccs.ornl.gov (job 1) Ssh into seaborg.nersc.gov (job 2) (this is running yea!) Run gnuplot to see if run is going ok on seaborg. This looks ok.

9:00AM Look at data from old run for post processing. Legacy code (IDL, Matlab) to analyze most data. Visualize some of the data to see if there is anything interesting. Is my job running on jaguar? I submitted this 4K processor job 2 days

ago!

10:00AM scp some files from seaborg to my local cluster. Luckily I only have 10 files (which are only 1 GB/file).

10:30AM first file appears on my local machine for analysis. Visualize data with Matlab.. Seems to be ok.

11:30AM see that the second file had trouble coming over. Scp the files over again… Dohhh

6

Day 1 evening. 1:00PM Look at the output from the second file.

Opps, I had a mistake in my input parameters. Ssh into seaborg, kill job. Emacs the input, submit job. Ssh into jaguar, see status. Cool, it’s running. bbcp 2 files over to my local machine. (8 GB/file). Gnuplot data.. This looks ok too, but still need to see more information.

1:30PM Files are on my cluster. Run matlab on hdf5 output files. Looks good. Write down some information in my notebook about the run. Visualize some of the data. All looks good. Go to meetings.

4:00PM Return from meetings. Ssh into jaguar. Run gnuplot. Still looks good. Ssh into seaborg. My job still isn’t running……

8:00PM Are my jobs running? ssh into jaguar. Run gnuplot. Still looks good. Ssh into seaborg. Cool. My job is running. Run gnuplot. Looks good this time!

7

And Later

4:00AM yawn… is my job on jaguar done? Ssh into jaguar. Cool. Job is finished. Start bbcp files over to

my work machine. (2 TB of data).

8:00AM @@!#!@. Bbcp is having troubles. Resubmit some of my bbcp from jaguar to my local cluster.

8:00AM (next day). Opps still need to get the rest of my 200GB of data over to my machine.

3:00PM My data is finally here! Run Matlab. Run Ensight. Oppps…. Something’s

wrong!!!!!!!!! Where did that instability come from?

6:00PM finish screaming!

8

Need metadata integrated into the high-performance I/O, and integrated for simulation monitoring. Typical Monitoring

Look at volume averaged quantities. At 4 key times this quantity looks

good. Code had 1 error which didn’t appear

in the typical ascii output to generate this graph.

Typically users run gnuplot/grace to monitor output.

More advanced monitoring 5 seconds move 600MB, and process

the data.• Really need to use FFT for 3D data,

and then process data + particles• 50 seconds (10 time steps) move

& process data.• 8 GB for 1/100 of the 30 billion

particles.• Demand low overhead <5%!

9

Parallel Data Analysis.

Most applications use scalar data analysis. IDL Matlab. Ncar graphics.

Need techniques such as PCA

Need help, since data analysis is written quickly, and changed often… No harden versions…. Maybe….

11

New Visualization Challenges.

Finding the needle in the haystack. Feature identification/tracking!

Analysis of 5D+time phase-space (with 1x1012) particles!

Real-time visualization of codes during execution.

Debugging Visualization.

12

Where is my data?

ORNL, NERSC, HPSS (NERSC,ORNL), local cluster, laptop?

We need to keep track of multiple copies?

We need to query the data. Query based visualization methods.

Don’t want to distinguish between different disks/tapes.

on the path to petascale: top challenges to scientific discovery

Documents