c o m p u t a t i o n a l r e s e a r c h d i v i s i o n bips scientific application performance on...

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPSBIPS

Scientific Application Performance on Candidate

PetaScale Applications

Lenny Oliker

http://crd.lbl.gov/~oliker

Future Technologies GroupLawrence Berkeley National Laboratory

Winner Best Paper, IPDPS 07, Long Beach, CA, March 26-30, 2007

BIPSBIPS Overview

Stagnating application performance is well-know problem in scientific computing By end of decade numerous mission critical applications expected to have 100X

computational demands of current levels Many HEC platforms are poorly balanced for demands of leading applications

Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology

Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining,

Clock frequency limited by power concerns

BIPSBIPSTraditional Sources of Performance

Improvement are Flat-Lining New Constraints

15 years of exponential clock rate growth has ended

But Moore’s Law continues How do we use all of those

transistors to keep performance increasing at historical rates?

Industry Response: #cores per chip doubles every 18 months instead of clock frequency!

Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith

BIPSBIPS

Current Hardware/Lithography Constraints Power limits leading edge chip designs

• Intel Tejas Pentium 4 cancelled due to power issues Yield on leading edge processes dropping dramatically

• IBM quotes yields of 10 – 20% on 8-processor Cell

Design/validation leading edge chip is becoming unmanageable• Verification teams > design teams on leading edge processors

Solution: Small Is Beautiful Expect modestly pipelined (5- to 9-stage)

CPUs, FPUs, vector, SIMD PEs• Small cores not much slower than large cores

Parallel is energy efficient path to performance:• Lower threshold and supply voltages lowers energy per op

Redundant processors can improve chip yield• Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs

Small, regular processing elements easier to verify

Hardware: What are the problems?

BIPSBIPS How Small is “Small”

Power5 (Server) 389mm^2 120W@1900MHz

Intel Core2 sc (laptop) 130mm^2 15W@1000MHz

ARM Cortex A8 (automobiles) 5mm^2 0.8W@800MHz

Tensilica DP (cell phones / printers) 0.8mm^2 0.09W@600MHz

Tensilica Xtensa (Cisco router) 0.32mm^2 for 3! 0.05W@600MHz

Intel Core2

ARM

TensilicaDP

Xtensa x 3

Power 5

Each core operates at 1/3 to 1/10th efficiency of largest chip, but you can pack 100x more cores onto a chip and consume 1/20 the power

BIPSBIPSThe Future of

HPC System ConcurrencyTotal # of Processors in Top15

0

50000

100000

150000

200000

250000

300000

350000

Jun-93Dec-93Jun-94Dec-94Jun-95Dec-95Jun-96Dec-96Jun-97Dec-97Jun-98Dec-98Jun-99Dec-99Jun-00Dec-00Jun-01Dec-01Jun-02Dec-02Jun-03Dec-03Jun-04Dec-04Jun-05Dec-05Jun-06

List

Processors

Must ride exponential wave of increasing concurrency for foreseeable future

BIPSBIPS

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

Jun-93Jun-94Jun-95Jun-96Jun-97Jun-98Jun-99Jun-00Jun-01Jun-02Jun-03Jun-04Jun-05Jun-06Jun-07Jun-08Jun-09Jun-10Jun-11Jun-12Jun-13Jun-14Jun-15Jun-16Jun-17Jun-18Jun-19Jun-20

# processors .

Manycore Concurrency Levels

BIPSBIPS Application Evaluation

Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency

However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches

Our evaluation work emphasizes full applications, with real input data, at the appropriate scale

Requires coordination of computer scientists and application experts from highly diverse backgrounds

Our initial efforts have focused on comparing performance between high-end vector and scalar platforms

Currently evaluating ultra-scale systems (soon to be petascale) NERSC 0.1PF/s (XT4) in FY07, 1PF/s in FY10 ORNL 1PF/s in FY08/09 (BAKER) SNL 1PF/s in FY 08 (XT4) ANL 0.5PF/s in FY08/09 (BG/P) LANL 1PF/s in FY08 (RoadRunner) Hypothetical 1PT/s sustained (2011) will contain >1M processors!

Important understand algorithmic aspects allow/prevent petascale computation

BIPSBIPS Benefits of Evaluation

Full scale application evaluation lead to more efficient use of the community resources For both current installation and future designs

Head-to-head comparisons on full applications: Help identify the suitability of a particular architecture for a given

application class Give application scientists information about how well various

numerical methods perform across systems Reveal performance-limiting system bottlenecks that can aid designers

of the next generation systems.• Science Driven Architecture

In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale to achieve high performance.

BIPSBIPS Application Overview

NAME Discipline Problem/Method Structure

CACTUS Astrophysics Theory of GR, ADM-BSSN Grid

ELB3D Fluid Dynamics Lattice Bolzmann, Navier-Stokes Lattice/Grid

GTC Magnetic Fusion Particle-in-Cell, Vlasov-Poisson Particle/Grid

BB3D High Energy Physics Particle-in-Cell, Vlasov-Poisson Particle/Grid

PARATEC Materials Science Density Functional Theory, FFT Fourier/Grid

HyperClaw Gas Dynamics Hyperbolic, High-order Godunov Grid AMR

Examining set of applications with a variety of numerical methods and communication patterns with the potential to run at petascale

BIPSBIPSArchitectural Comparison

Name Node Type Where NetworkNetworkTopology

TotalProcs

CPU/Node

ClockGHz

PeakGFlop

Stream BW

GB/s/P

Stream byte/flop

MPIBW

GB/s/P

MPI Latency

sec

Bassi Power5 NERSCFederatio

nFat-tree 888 8 1.9 7.6 6.8 0.85 0.69 4.7

Jaguar Opteron ORNL XT3 3D-Torus 10,404 2 2.6 5.2 2.5 0.48 1.2 5.5

Jacquard Opteron NERSCInfiniBan

dFat-tree 640 2 2.2 4.4 2.3 0.51 0.7 5.2

BG/L PPC440 ANL Custom 3D-Torus 2,048 2 0.7 2.8 0.9 0.31 0.16 2.2

BGW PPC440 TJW Custom 3D-Torus 40,960 2 0.7 2.8 0.9 0.31 0.16 2.2

Phoenix X1E ORNL Custom 4D-Hcube 768 8 1.1 18.0 9.7 0.54 2.9 5.0

These architectures represent main stream petaflop candidate systems Power5 aggregates 4 DDR233 for high 6.8 GB/s Stream bandwidth Jaguar XT3 dual-core w/ SeaStar routing (Catamount) Jacquard single-core Opteron with non-custom communication integration BG/L, power eff, 5 networks, dual-core w/o L1 cohere, 512MB/node, 2 SIMD FPU Phoenix, X1E custom HPC platform, X1 upgrade (double MSP w/o BW increase) Yes Cost is a critical metric - however we are unable to provide such data

Proprietary, pricing varies based on customer and time frame

Poorly balanced systems cannot solve important problems/resolutions regardless of cost!

BIPSBIPSAstrophysics: CACTUS

Numerical solution of Einstein’s equations from theory of general relativity

Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms

CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes

Evolves PDE’s on regular grid using finite differences

Visualization of grazing collision of two black holes

Developed at Max Planck Institute, vectorized/ported by John Shalf & Tom Goodale

BIPSBIPSCactus Performance:

Weak Scaling

Bassi shows highest raw and sustained performance Remains to be seen if scalability will continue to thousands of processors

Jacquard/Jaguar show more modest scaling - Phoenix X1 shows lowest % of peak (Cactus crashes on X1E)

Small fraction unvectorizable code (boundary condition) leads to large penalty• Uses only 1 of 4 SSP within an MSP for an unvectorizable code portion

BG/L lowest raw performance - expected simple (power efficient) dual-issue in-order PPC440 Achieves near perfect scalability to highest concurrency ever attained Virtual node mode (32K procs) showed no performance degradation (smaller problem) Topology mapping attempts did not improve performance Overall shows potential of Cactus to run at Petascale

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

16 64 256 1024 4096 8192 16384

Processors

Gflops/Processor

BassiJacquardJaguarBG/LPhoenix

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

16 64 256 1024 4096 8192 16384

Processors

Percent of Peak


BIPSBIPSFluid Dynamics: ELB3D

LBM concept: develop simplified kinetic model with incorporated physics, to reproduce correct macroscopic averaged properties

Unlike explicit LBM methods, entropic approach not prone to nonlinear instabilities

ELBM develop to simulate Navier-Stokes turbulence Non-linear equation is solved at each grid point and time step

to satisfy constraints, followed by streaming of data The equation is solved via a Newton-Raphson iteration

(heavy use of log function) Spatial grid overlayed with phase-space velocity lattice Block distributed over processor grid

Developed by George Vahala’s group College of William & Mary, ported Jonathan Carter

Evolution of vorticity into turbulent structures

2

11

7

1

8

4

9 6

0

105

3

14

18

19 17

13

12

15

16

22

25

21

20

23

24

BIPSBIPSELB3D Performance:

Strong Scaling

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

64 128 256 512 1024

Processors

Gflops/Processor


15%

17%

19%

21%

23%

25%

27%

29%

31%

64 128 256 512 1024

Processors

Percent of Peak


Specialized log functions (ASSV, ACML) used on all platforms w/ significant improvement Shows high % of peak and good scaling across all platforms - good load balance Results show that high comp cost of entropic algorithm can be designed in efficient manner X1E attains the highest raw performance:

Innermost gridpoint loop taken inside nonlinear equation to allow for vectorization Bassi shows highest fraction of peak: high memory BW, large caches, advanced prefetch BG/L shows lowest efficiency, but value is based on peak with double hummer

Except for highly-tuned libraries, peak will almost always be 50% of potential Overall results show promising performance of ELB methods at PetaScale

BIPSBIPS

17.27

8.69

0.07

9.66

5.65 5.45

0.32 0.310.60.79

0

2

4

6

8

10

12

14

16

18

20

16xSPEs*(3.2GHz)

8xSPEs*(3.2GHz)

1xPPE*(3.2GHz)

SX8(2.0GHz)

X1E(1.13GHz)

EarthSimulator

(1GHz)

Power5(1.9GHz)

Opteron(2.2GHz)

Itanium2(1.4GHz)

BlueGene/L(0.7GHz)

Cell Vector Processors Scalar Processors

GFLOP/s

LBMHD Cell Performance

Multi-core scientific kernel optimization work by Samuel Williams (UCB/LBNL) Cell achieves impressive performance for MHD problem (explicit LB method)

Ability to explicitly control local memory (more programming complexity) Results shown only for collision phase (>>85% of overall time) Multi-blade (MPI) Cell results to come

BIPSBIPSMagnetic Fusion: GTC

Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)

Goal magnetic fusion is burning plasma power plant producing cleaner energy

GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)

PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid

Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)

Vectorization inhibited since multiple particles may attempt to concurrently update same grid point

Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies

Developed at PPPL, vectorized/optimized by Stephane Ethier

BIPSBIPSGTC Performance:

Weak Scaling

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

64 128 256 512 960 1K 2K 4K 8K 10K 16K 32K

Processors

Gflops/Processor


6%

8%

10%

12%

14%

16%

18%

64 128 256 512 960 1K 2K 4K 8K 10K 16K 32K

Processors

Percent of Peak


Generally expect low % of peak due to scatter gather Extensive X1E optimization (such as reversing array dimensions) paid off

Although dropping at higher concurrency Three commodity superscalers (Power5, Opteron) achieve similar raw performance

However Bassi, only attains 1/2 %peak of Opteron High Opteron performance partially due to low main memory latency

Bassi, Jaguar, and BG/L show excellent scalability BG/L scales to 32K processors: virtual node mode only 5% slower than coprocessor

Optimizations: vector MASSV functions (60%) and mapping onto 3D Torus (30%) Results show GTC prime candidate for Petascale systems

Perfect load balancing up to 32K processors and low communication

BIPSBIPSHigh Energy Physics: BeamBeam3D

BB3D models collisions of counter-rotating charge particle beams

Used to study beam-beam collisions at world’s highest energy accelerators

Particle-in-cell method, where particles are deposited on 3D grid to calculate charge density distribution

At collision points electric/magnetic fields calculated using Vlasov-Poisson via FFT

Particle are advances using computed fields and accelerator forces

High communication requirements: Global gather charge density Broadcast electric/magnetic fields Global FFT transpose

Developed by Ji Qiang, ported/optimized by Hongzhang Shan

GTC and BB3D both PIC w/ charged particle interactions, but: GTC requires relatively small particle movements

Allows local solve of Poisson equation BB3D has longer range particle forces

Requires high communication, global FFTs Limited number of subdomains available (2048 our case) Pure domain decomposition not possible

BIPSBIPSBeamBeam3D Performance:

Strong Scaling

0.0

0.2

0.4

0.6

0.8

1.0

1.2

64 128 256 512 1024 2048

Processors

Gflops/Processor

BassiJacquardJaguarBG/L Phoenix

0%

1%

2%

3%

4%

5%

6%

7%

8%

64 128 256 512 1024 2048

Processors

Percent of Peak


Phoenix attains fastest per-processor performance, however scalability is poor at high P At P=256 > 50% communication, at high P vector length decreases (fixed size problem)

At P=512 Bassi surpasses Phoenix - outperforms Opterons by 1.8x, BG/L by 4.5X Jacquard and Jaguar show similar behavior, even though vastly different interconnects Notice all platforms achieve < 5% of peak for high concurrencies!

Indirect addressing, global all-to-all comm, extensive (non-flop) data movement BG/L achieves just 1% of peak at high concurrencies

Higher scalability experiments not possible due to limited number of domains To reach Petascale for BB3D requires extensive code reengineering:

Additional decomposition schemes, reduce communication bottleneck

BIPSBIPSMaterials Science: PARATEC

PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set

Density Functional Theory to calc structure & electronic properties of new materials using QM principles

DFT calc are one of the largest consumers of supercomputer cycles in the world

33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space

Uses specialized 3D FFT to transform wavefunction QDOTs have important photo-luminscent propertiesConduction band minimum electron state for

CdSe quantum dot

Developed by Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

BIPSBIPSPARATEC Performance:

Strong Scaling

0

1

2

3

4

5

6

64 128 256 512 1024 2048

Processors

Gflops/Processor

BassiJacquardJaguarBG/LPhoenix 15%

25%

35%

45%

55%

65%

75%

64 128 256 512 1024 2048

Processors

Percent of Peak


PARATEC attains high % peak due to 1D FFTs and BLAS3 Bassi achieves highest performance and good scaling up to 512 processors

1024 Power5 performance from LLNL Purple system Jaguar outperforms Jacquard due partially to high interconnect bandwidth BG/L lowest overall raw performance - but uses smaller system due to mem constraints

Performance drops between 512 and 1024, due to moving off topology half-plane Phoenix shows lowest % peak

Low scalar/vector ratio, vector length drops for custom F90 routines and BLAS3/FFT PARATEC scaling decreases and limited to ~2K processors , due to 1 level decomposition For petascale must introduce second level of decomposition, over electronic band indices

Will increase scaling and reduce memory requirements for systems like BG/L

Purple data courtesy of Tom Spelce and Bronis de Supinski

BIPSBIPSGas Dynamics: HyperCLaw

Adaptive Mesh Refinement (AMR): powerful technique to solve otherwise intractable problems - typically applied to physical systems solving PDEs

AMR dynamically refines underlying grid in regions of scientific interest Naively increasing grid resolution prohibitive in computation and memory

Price paid for additional power is complexity: significant software infrastructure Regridding, interpolation between coarse/fine grids, dynamic load balancing

Generally written in flexible/modular fashion Causing performance reduction even for unrefined grid calculations Complex communication looks more like many-to-many than stencil exchange

Berger-Colella Hyperbolic Conservation laws C++/Fortran hybrid programming model

Godunov - Advances the solution at a given refinement level TimeStep - Prepares grids at next level of Godunov solver Regrid - Replace existing grid hierarchy to maintain numerical accuracy Knapsack - Load balancing algorithm

Developed by CCSE LBNL, ported/optimized by Michael Lijewski and Michael Welcome

BIPSBIPS X1E Optimization

Altx Pwr5 X1E Altx Pwr5 X1E Altx Pwr5 X1E0

200

400

600

800

1000

1200

1400

P=32 P=64 P=128

Seconds

GodunovTstep-MPITstep-CompKnapsackRegridOther

Two X1E optimizations undertaken since our original study in CF06 Knapsack and Regridding originally prevents X1E scalability Knapsack optimized by swapping pointers instead of lists

Results in almost cost free X1E Knapsack algorithm Regridding originally required O(N^2) box list intersection calculation

Updated hashing scheme reduced overhead to O(NlogN) Significantly reduced X1E regridding overhead

BIPSBIPSHyperCLaw Performance:

Weak Scaling

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

16 32 64 128 256 512 1024

Processors

Gflops/Processor


0%

1%

2%

3%

4%

5%

6%

16 32 64 128 256 512 1024

Percent of Peak

Percent of Peak


Due to code complexity, all platforms achieve < 5% of peak Increasing adaptivity level introduces grid management overheads, reducing performance

Bassi shows highest raw performance, BG/L the lowest Jacquard and Jaguar shoe similar raw performance X1E shows lowest fraction of peak ~1%, before optimization this was much lower

Hierarchical data structures management is scalar work Higher concurrency reduces vector length AMR designed for small grid patches, causes short vector length Adding more complexity (elliptic, chemistry) would certainly reduce vector performance May be possible to design AMR parallel vector code, but only from the ground up

Despite low efficiency, systems show good scalability - potential for petascale systems

BIPSBIPS Performance Summary

Evaluating HPC on realistic problems is complex task, requires diverse group domain scientists Work presents one of most extensive performance analyses to date

On average the Power5 (Bassi) achieves highest raw and sustained performance High mem BW, tight interconnect integration, latency hiding via advanced prefetching

Phoenix X1E achieved high perf for GTC and ELB3D, but showed poor results for several codes Opteron systems show similar performance

However, Jaguar’s XT3 interconnect outperformed Jacquard for GTC & PARATEC BG/L - lowest performance, but achieved unprecedented concurrency on several codes Potential to run at petascale:

GTC, Cactus - very encouraging, linear scaling to 32K processors ELB3D, Hyperclaw - promising scaling at lower concurrencies BB3D and Paratec - require reengineering to incorporate additional levels of parallelism

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Bassi Pwr5

Jacquard Opteron

Jaguar Opteron

BG/L PPC440

Phoenix X1E (MSP)

Relative Performance

HCLaw (P=128)BB3D (P=512)Cactus (P=256)GTC (P=512)ELB3D (P=512)PARATEC (P=512)AVERAGE

67% 45% 43%55%

0%

5%

10%

15%

20%

25%

30%

Bassi Pwr5

Jacquard Opteron

Jaguar Opteron

BG/L PPC440

Phoenix X1E (MSP)

Percent of Peak

BIPSBIPS System Balance

Halving the memory bandwidth has virtually no effect on performance for a broad variety of applications

Single vs. Dual Core Performance

(wallclock time at fixed concurrency and problem size)

0

500

1000

1500

2000

2500

3000

3500

4000

CAM MILC GTC GAMESS PARATEC PMEMD MadBench BB3D Cactus

application code

XT3 SC

XT3 DC

BIPSBIPS Memory Bandwidth(maintaining system balance)

Distribution of Time Spent in ApplicationIn Dual Core Opteron/XT4 System

0%

20%

40%

60%

80%

100%

CAM MILC GTC GAMESS PARATEC PMEMD MadBench

Application

Percent Time Spent

other

flops

memory contention

• Neither memory bandwidth nor FLOPs dominate runtime• The “other” category dominated by memory latency stalls• Points to inadequacies in current CPU core design (inability to

tolerate latency)

BIPSBIPS Future and Related Work

Future Evaluation Work: Explore more complex methods and irregular data structures Continue evaluating leading HPC systems as they come online Perform in-depth application characterizations

Related research activities: Developing an application-derived I/O benchmark based on CMB Investigating automatic tuning for stencil computations Investigating scientific kernel performance on multi-core systems Investigating potential of dynamically reconfigurable interconnects

to achieve fat-tree performance at fraction of switch components

Papers available at http://crd.lbl.gov/~oliker

BIPSBIPS CMB Science

The CMB is a unique probe of the very early Universe

Tiny fluctuations in its temperature (1 in 100K) and polarization (1 in 100M) encode the fundamental parameters of cosmology, including the geometry, composition (mass-energy content), and ionization history of the Universe

Combined with complementary supernova measurements tracing the dynamical history of the Universe, we have an entirely new “concordance” cosmology: 70% dark energy + 25% dark matter + 5% ordinary matter

Dark Matter + Ordinary Matter

Dark

Energ

y

c o m p u t a t i o n a l r e s e a r c h d i v i s i o n bips scientific application performance on...

Documents