vtf applications performance and scalability sharon brunett cacr/caltech asci site review october...
Post on 22-Dec-2015
220 views
TRANSCRIPT
VTF Applications Performance and Scalability
Sharon Brunett
CACR/Caltech
ASCI Site Review
October 28, 29 2003
ASCI Platform Specifics
• LLNL’s IBM SP3 (frost)
– 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node)
– 16 GB memory/node
– ~ 20 TB global parallel file system
– SP switch2, colony switch
• 2 GB/sec node-to-node bandwidth
• bi-directional
• LANL’s HP/Compaq Alphaserver ES45 (QSC)
– 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)
– 16 GB memory/node
– ~ 12 TB global file system
– Quadrics network interconnect (QsNet)
• 2 mus latency
• 300 MB/sec bandwidth
Multiscale Polycrystal Studies• Quantitative assessment of microstructural effects in
macroscopic material response through the computation of full-field solutions of polycrystals• Inhomogeneous plastic deformation fields• Grain-boundary effects:
– Stress concentration– Dislocation pile-up– Constraint-induced multislip
• Size dependence: (inverse) Hall-Petch effect
• Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing
• Enable full-scale simulation of engineering systems incorporating micromechanical effects.
Mesh Generation
• Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. – texture simulation results
agree well with experimental results
• Mesh generation method keeps the topology of individual grain shapes– Enables effective
interactions between grains • Increasing of the grain count in
polycrystals gives a more stable mechanical response.
Single grain corresponding to a single cell in a crystal
1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation
Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost
Multiscale Polycrystal Performance
• Aggregate parallel performance – LANL’s QSC
• Floating point operations 10.67% of peak
• Integer operations 15.39% of peak
• Memory operations 22.08 % of peak
– DCPI hardware counters used to collect data
– Qopcounter tool used to analyze DCPI database
– LLNL’s Frost
• L1 cache hit rate 98%
– Load/store instructions executed w/o main memory access
• Load Store Unit idle 36%
• Floating point operations 4.47% of peak
– Hpmcount tool used to count hardware events during program execution
Multiscale Polycrystal Performance II
• MPI routines can consume ~ 30% of runtime for large runs on Frost– Workload imbalance as grains are distributed across nodes
– MPI_Waitall every step dominating communications time• Nearest neighbor sends take longer from nodes with computationally heavy
grains
– Routines taking the most CPU time on QSC• resolved_fcc_cuitino 18.85%
• upslip_fcc_cuitino_explicit 11.74%
• setafcc 9.16%
• matvec 8.5 %
– ~50% of execution time in 4 routines
– Room for performance improvement with better load balancing and routine level optimization
Multiscale Polycrystal Scaling on LLNL’s IBM SP3, Frost
0.1
1
10
4 5 6 7 8 9 10
Processors (log 2)
seco
nds
per
iter
atio
n 196608
384000
663552
1053696
1572864
3072000
8429568
elements
Multiscale Polycrystal Scaling on LANL’s HP/Compaq, QSC
0.1
1
10
4 5 6 7 8 9 10
Processors (log 2)
seco
nds
per
iter
atio
n
196608
1572864
elements
Scaling for Polycrystalline Copper in a Shear Compression Specimen
Configuration
0.01
0.1
1
10
6 7 8 9
Processors (Log 2)
seco
nds
per
iter
atio
n
150,0001,200,0009,600,000
LANL’s HP/Compaq QSC system
elements
3D Converging Shock Simulations in a Wedge
• 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. – The 2D elliptical interface is
computed using local shock polar analysis to yield a perfectly circular transmitted shock
• Resolution: 2000x400x400 with over 1T Byte of data generated.
Density Pressure
Density Field in a 3D Wedge
Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.
Wedge3D Performance on LLNL’s IBM SP3, Frost
• Aggregate parallel performance for 1400x280x280 grid– LLNL’s Frost
• Floating point operations 5.8 to 10% of peak, depending on node– Hpmcount tool used to count hardware events during program
execution
– Most time consuming communication calls• MPI_Wait() and MPI_Allreduce • Accounting for 3 to 30% of runtime on 128 way run
– 175x70x70 grid per processor– Occasional high MPI time on a few nodes seem to be caused by
system daemons competing for resources
Wedge3D Scaling
on LLNL’s IBM SP3, Frost
1
10
100
7 8 9 10
CPUs (log 2)
seco
nds
/ite
rati
on
1600x320x3201800x360x3602000x400x400
grid size XxYxZ
Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC
0.01
0.1
1
5 6 7
Processors (Log 2)
seco
nds/
iter
atio
n level 6level 7level 8level 9
Levels of subdivision450K to 1.1M elements
85K -> 1.1Melements
61K -> 915Kelements
450Kelements
Crack Patterns in the Configuration Occurring During Scalability Studies on
QSC
Fragmentation 2D Performance on LANL’s HP/Compaq, QSC
• Procedures with highest CPU cycle consumption – element_driver 14.9%– assemble 13.9% – NewNeohookean 8.12%
• 16 processor run with 2 levels of subdivision (60K elements)• Dcpiprof too used to profile run• Problems processing dcpi database FLOP rates for large runs
– reported to LANL support– small runs yield 3% FLOP peak
– Only ~ 10% spend in fragmentation routines!
• Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])