vtf applications performance and scalability sharon brunett cacr/caltech asci site review october...

VTF Applications Performance and Scalability

Sharon Brunett

CACR/Caltech

ASCI Site Review

October 28, 29 2003

ASCI Platform Specifics

• LLNL’s IBM SP3 (frost)

– 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node)

– 16 GB memory/node

– ~ 20 TB global parallel file system

– SP switch2, colony switch

• 2 GB/sec node-to-node bandwidth

• bi-directional

• LANL’s HP/Compaq Alphaserver ES45 (QSC)

– 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)

– 16 GB memory/node

– ~ 12 TB global file system

– Quadrics network interconnect (QsNet)

• 2 mus latency

• 300 MB/sec bandwidth

Multiscale Polycrystal Studies• Quantitative assessment of microstructural effects in

macroscopic material response through the computation of full-field solutions of polycrystals• Inhomogeneous plastic deformation fields• Grain-boundary effects:

– Stress concentration– Dislocation pile-up– Constraint-induced multislip

• Size dependence: (inverse) Hall-Petch effect

• Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing

• Enable full-scale simulation of engineering systems incorporating micromechanical effects.

Mesh Generation

• Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. – texture simulation results

agree well with experimental results

• Mesh generation method keeps the topology of individual grain shapes– Enables effective

interactions between grains • Increasing of the grain count in

polycrystals gives a more stable mechanical response.

Single grain corresponding to a single cell in a crystal

1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation

Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost

Multiscale Polycrystal Performance

• Aggregate parallel performance – LANL’s QSC

• Floating point operations 10.67% of peak

• Integer operations 15.39% of peak

• Memory operations 22.08 % of peak

– DCPI hardware counters used to collect data

– Qopcounter tool used to analyze DCPI database

– LLNL’s Frost

• L1 cache hit rate 98%

– Load/store instructions executed w/o main memory access

• Load Store Unit idle 36%

• Floating point operations 4.47% of peak

– Hpmcount tool used to count hardware events during program execution

Multiscale Polycrystal Performance II

• MPI routines can consume ~ 30% of runtime for large runs on Frost– Workload imbalance as grains are distributed across nodes

– MPI_Waitall every step dominating communications time• Nearest neighbor sends take longer from nodes with computationally heavy

grains

– Routines taking the most CPU time on QSC• resolved_fcc_cuitino 18.85%

• upslip_fcc_cuitino_explicit 11.74%

• setafcc 9.16%

• matvec 8.5 %

– ~50% of execution time in 4 routines

– Room for performance improvement with better load balancing and routine level optimization

Multiscale Polycrystal Scaling on LLNL’s IBM SP3, Frost

0.1

1

10

4 5 6 7 8 9 10

Processors (log 2)

seco

nds

per

iter

atio

n 196608

384000

663552

1053696

1572864

3072000

8429568

elements

Multiscale Polycrystal Scaling on LANL’s HP/Compaq, QSC

0.1

1

10

4 5 6 7 8 9 10

Processors (log 2)

seco

nds

per

iter

atio

n

196608

1572864

elements

Scaling for Polycrystalline Copper in a Shear Compression Specimen

Configuration

0.01

0.1

1

10

6 7 8 9

Processors (Log 2)

seco

nds

per

iter

atio

n

150,0001,200,0009,600,000

LANL’s HP/Compaq QSC system

elements

3D Converging Shock Simulations in a Wedge

• 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. – The 2D elliptical interface is

computed using local shock polar analysis to yield a perfectly circular transmitted shock

• Resolution: 2000x400x400 with over 1T Byte of data generated.

Density Pressure

Density Field in a 3D Wedge

Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.

Wedge3D Performance on LLNL’s IBM SP3, Frost

• Aggregate parallel performance for 1400x280x280 grid– LLNL’s Frost

• Floating point operations 5.8 to 10% of peak, depending on node– Hpmcount tool used to count hardware events during program

execution

– Most time consuming communication calls• MPI_Wait() and MPI_Allreduce • Accounting for 3 to 30% of runtime on 128 way run

– 175x70x70 grid per processor– Occasional high MPI time on a few nodes seem to be caused by

system daemons competing for resources

Wedge3D Scaling

on LLNL’s IBM SP3, Frost

1

10

100

7 8 9 10

CPUs (log 2)

seco

nds

/ite

rati

on

1600x320x3201800x360x3602000x400x400

grid size XxYxZ

Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC

0.01

0.1

1

5 6 7

Processors (Log 2)

seco

nds/

iter

atio

n level 6level 7level 8level 9

Levels of subdivision450K to 1.1M elements

85K -> 1.1Melements

61K -> 915Kelements

450Kelements

Crack Patterns in the Configuration Occurring During Scalability Studies on

QSC

Fragmentation 2D Performance on LANL’s HP/Compaq, QSC

• Procedures with highest CPU cycle consumption – element_driver 14.9%– assemble 13.9% – NewNeohookean 8.12%

• 16 processor run with 2 levels of subdivision (60K elements)• Dcpiprof too used to profile run• Problems processing dcpi database FLOP rates for large runs

– reported to LANL support– small runs yield 3% FLOP peak

– Only ~ 10% spend in fragmentation routines!

• Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])

vtf applications performance and scalability sharon brunett cacr/caltech asci site review october...

Documents

frost slide

crystal slide

multiscale polycrystal

mbsec bandwidth slide

program execution slide

frost elements

qsc elements

single grain