scientific computations on modern parallel vector systems

35
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division Lawrence Berkeley National Laboratories loliker @lbl.gov http://crd.lbl.gov/~oliker/paperlinks.html

Upload: ojal

Post on 10-Feb-2016

49 views

Category:

Documents


5 download

DESCRIPTION

Scientific Computations on Modern Parallel Vector Systems. Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division Lawrence Berkeley National Laboratories [email protected] http://crd.lbl.gov/~oliker/paperlinks.html. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scientific Computations  on Modern Parallel  Vector Systems

Scientific Computations on Modern Parallel

Vector Systems

Leonid OlikerComputer Staff Scientist

Future Technologies GroupComputational Research Division

Lawrence Berkeley National Laboratories

[email protected]://crd.lbl.gov/~oliker/paperlinks.html

Page 2: Scientific Computations  on Modern Parallel  Vector Systems

Overview Superscalar cache-based architectures dominate US HPC market Leading architectures are commodity-based SMPs due to generality and

perspection of cost effectiveness Growing gap between peak & sustained performance is well known in

scientific computing Modern parallel vectors may bridge gap this for many important

applications In April 2002, the Earth Simulator (ES) became operational:

Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps

Currently conducting evaluation study of DOE applications on modern parallel vector architectures: final year of three year project

09/2003 MOU between NERSC and ES was completedVisited ES center December 8th-17th, 2003First international team to conduct performance evaluation study at ES

Page 3: Scientific Computations  on Modern Parallel  Vector Systems

Vector Paradigm

High memory bandwidth• Allows systems to effectively feed ALUs (high byte to flop ratio)

Flexible memory addressing modes• Supports fine grained strided and irregular data access

Vector Registers• Hide memory latency via deep pipelining of memory load/stores

Vector ISA• Single instruction specifies large number of identical operations

Vector architectures allow for:• Reduced control complexity • Efficiently utilize large number of computational resources• Potential for automatic discovery of parallelism

However: only effective if sufficient regularity discoverable in program

structure• Suffers greatly even if small % of code non-vectorizable (Amdahl’s Law)

Page 4: Scientific Computations  on Modern Parallel  Vector Systems

ES Processor Overview 8 Gflops per CPU 8 CPU per SMP 8 way replicated

vector pipe 72 vec registers,

256 64-bit words Divide Unit 32 GB/s pipe to

FPLRAM 4-way superscalar

o-o-o @ 1 Gflop 64KB I$ & D$ Earth Simulator:

640 nodes ES: newly developed FPLRAM (Full Pipelined RAM) SX6: DDR-SDRAM 128/256Mb ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes

Page 5: Scientific Computations  on Modern Parallel  Vector Systems

Earth Simulator OverviewMachine type : 640 nodes, each node is 8-way SMP vector processors (5120 total procs) Machine Peak: 40TF/s (proc peak 8GF/s)OS : Extended version of Super-UX: 64 bit Unix OS based on System V-R3Connection structure : a single stage crossbar network (1500 miles of cable), 83,000 copper cables: 7.9 TB/s aggregate switching capacity 12.3 GB/s bi-di between any two nodesGlobal Barrier Counter within interconnect allows global barrier synch <3.5usecStorage: 480 TB Disk, 1.5 PB TapeCompilers : Fortran 90, HPF, ANSI C, C++Batch: similar to NQS, PBSParallelization: vectorization processor level OpenMP, Pthreads, MPI, HPF

Page 6: Scientific Computations  on Modern Parallel  Vector Systems

Earth Simulator Cost

Approx costs

Development: $400MBuilding: $70MMaintenance: $50M/yearElectricity: $8M/year

Page 7: Scientific Computations  on Modern Parallel  Vector Systems

ES Programming Environment

Only benchmarking size runs were submitted (no production runs) ES not connected to Internet Interactive, S cluster, L cluster

(2 nodes, 14 nodes, 624 nodes) No global file system Few numerical libraries Using >10 nodes requires minimum

vectorization ratio: 95% parallelization efficiency: 50% Examples of required parallelization ratio (as defined by Amdahl’s Law):

16 nodes 99.21% ; 64 nodes 99.80% ; 256 nodes 99.95% Lack of external access and usage hurdles inhibits scientific productivity All codes were ported/vectorized on single node SX6 at ARSC

(Oliker et al, SC2003) Multi-node vector runs first attempted at ES center

Page 8: Scientific Computations  on Modern Parallel  Vector Systems

Cray X1 OverviewSSP: 3.2GF computational core VL = 64, dual pipes (800 MHz)2-way scalar 0.4 GF (400MHz)

MSP: 12.8 GF combines 4 SSPshares 2MB data cache (unique)

Node: 4 MSP w/ flat shared mem

Interconnect: modified 2D torusfewer links then full crossbar butsmaller bisection bandwidth

Globally addressable: procs can directly read/write to global mem

Parallelization: Vectorization (SSP) Multistreaming (MSP) Shared mem (OMP, Pthreads) Inter-node (MPI2, CAF, UPC) Node

SSP MSP

Page 9: Scientific Computations  on Modern Parallel  Vector Systems

Altix3000 Overview

Itanium2@ 1.5GHz (peak 6 GF/s) 128 FP registers, 32K L1, 256K L2, 6MB L3• Cannot store FP in values in L1

EPIC Bundles instruction• Bundles processed in-order, instructions within bundle processed in parallel

Consists of “Cbricks” : 4 Itanium2, memory, 2 controller ASICS called SHUB Uses high bandwidth, low latency Numalink3 interconnect (fat-tree) Implements CCNUMA protocol in hardware

• A cache miss caused data to be communicated/replicated via SHUB Uses 64-bit Linux with single system image (256 processor / few for OS

services) Scalability to large numbers of processors ?

Page 10: Scientific Computations  on Modern Parallel  Vector Systems

Architectural Comparison

Node Type Where CPU/

NodeClockMHz

PeakGFlop

Mem BW GB/s

Peak byte/fl

op

NetwkBW

GB/s/P

BisectBW

byte/flop

MPI Latenc

yusec

NetworkTopolog

yPower3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-treePower4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree

Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-treeES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 CrossbarX1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus

Custom vector architectures have •High memory bandwidth relative to peak•Superior interconnect: latency, point to point, and bisection bandwidth

Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architecturesA key ‘balance point’ for vector systems is the scalar:vector ratio

Page 11: Scientific Computations  on Modern Parallel  Vector Systems

Memory PerformanceTriad Mem Test:

A(i) = B(i) + s*C(i)

NO MachineSpecific

Optimizations

•For strided access, SX6 achieves 10X, 100X, 1000X improvement over X1, Pwr4, Pwr3•For gather/scatter, SX6/X1 show similar performance, exceed scalar at higher data sizes•All machines performance can be improved via architecture specific optimizations

•Example: On X1 using non-cachable & unroll(4) pragma improves strided BW by 20X

1

10

100

1000

10000

100000

0 100 200 300 400 500

Stride

Tria

d (M

B/s

)

Pow er3Pow er4SX-6X1

1

10

100

1000

10000

100

300

700

1.5K

4.5K 10

K25

K

65K

150K

375K

900K

2.2M

5.5M 13

M

Data Size (Bytes)

Tria

d G

ath/

Sca

t (M

B/s

)

Pow er3 Pow er4 SX-6 X1

Page 12: Scientific Computations  on Modern Parallel  Vector Systems

Analysis using‘Architectural Probe’

Developed Architectural Probe: allows stress the balance points of processor design (PMEO-04) Tunable parameters to mimic behavior of important scientific kernel

Gather/Scatter expensive on commodity cache-based systems

Power4 can is only 1.6% (1 in 64)Itanium2: much less sensitive at 25% (1 in 4)

Huge amount of computation may be required to hide overhead of irregular

data access

Itanium2 requires CI of about 9 flops/wordPower4 requires CI of almost 75!

Interested in developing application driven architectural probes for evaluation of emerging petascale systems

What % of memory access can be random before performance decreases by half?How much computational intensity is

required to hide the penalty of all random access?Reducing performance by 50%

1.6%

25%

6.3%

0.8%

0%

1%

10%

100%Itanium 2 Opteron Power3 Power4

% In

dire

ctio

n

CI required to hide indirection

9.3

149.3

18.7

74.7

0

50

100

150

200

Itanium 2 Opteron Power3 Power4Com

puta

tiona

l Int

ensi

ty

(CI)

Page 13: Scientific Computations  on Modern Parallel  Vector Systems

Sample Kernel Performance

NPB FT Class B Nbody (Barnus-Hut)

Interested in exploring advanced algorithmic optimizations on emerging systems -

preliminary work described in CUG04

FFT computationally intensive with data parallel operationsWell suited for vectorization: 17X, 4X faster than Power3/4Fixed cost interprocessor communication hurts scalability

Nbody requires irregular, unstructured data access, control flow and communicationPoorly suited for vectorization:2X and 5X slower than Power/4Vector architectures not general purpose systems

0.0

0.5

1.0

1.5

2.0

1 2 4 8 16 32 64Processors

GFl

ops/

s

Power3Power4SX-6X1

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 2 4 8 16 32 64Processors

GFl

ops/

s

Power3Power4SX-6X1

Page 14: Scientific Computations  on Modern Parallel  Vector Systems

Applications studied

Applications chosen with potential to run at ultrascale

CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes LBMHD Plasma Physics 1,500 lines grid basedLattice Boltzmann approach for magneto-hydrodynamics GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation MADCAP Cosmology 5,000 lines dense linear algebraExtracts key data from Cosmic Microwave Background Radiation

Page 15: Scientific Computations  on Modern Parallel  Vector Systems

Astrophysics: CACTUS Numerical solution of Einstein’s equations

from theory of general relativity Among most complex in physics: set of

coupled nonlinear hyperbolic & elliptic systems with thousands of terms

CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes

Evolves PDE’s on regular grid using finite differences Uses ADM formulation: domain decomposed into 3D

hypersurfaces for different slices of space along time dimension

Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe

What are gravitational waves? Ripples in spacetime curvature, caused by matter motion, causing distances to change:

Developed at Max Planck Institute, vectorized by John Shalf

Visualization of grazing collision of two black holes

Communication at boundariesExpect high parallel efficiency

Page 16: Scientific Computations  on Modern Parallel  Vector Systems

CACTUS: Performance

ES achieves fastest performance to date: 45X faster than Power3! Vector performance related to x-dim (vector length) Scalar performance better on smaller problem size (cache effects)

Excellent scaling on ES using fixed data size per proc (weak scaling) X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector

Note boundary radiation vectorized for X1 but not on ES giving the X1 an advantage

Unvectorized boundary, required 15-20% of runtime on ES (30+% on X1) < 5% for the scalar version: unvectorized code can quickly dominate cost

ProblemSize P

Power 3 Power 4 Altix ES X1Gflops/

P%pea

kGflops/

P%pea

kGflops/

P%pea

kGflops/

P %peak Gflops/P

%peak

80X80x80

perprocessor

16 0.31 21% 0.58 11% 0.89 15% 1.5 18% 0.54 4%64 0.22 14% 0.50 10% 0.70 12% 1.4 17% 0.43 3%

256 0.22 14% 0.48 9% --- --- 1.4 17% 0.41 3%250x80x8

0per

processor

16 0.10 6% 0.56 11% 0.51 9% 2.8 35% 0.81 6%64 0.08 6% --- --- 0.42 7% 2.7 34% 0.72 6%

256 0.07 5% --- --- --- --- 2.7 34% 0.68 5%

Page 17: Scientific Computations  on Modern Parallel  Vector Systems

Material Science: PARATEC

PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set

Density Functional Theory to calc structure & electronic properties of new materials

DFT calc are one of the largest consumers of supercomputer cycles in the world

Uses all-band CG approach to obtain wavefunction of electrons 33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space

• Uses specialized 3D FFT to transform wavefunction Computationally intensive - generally obtains high percentage of peak Developed w/ Louie and Cohen’s groups (UCB, LBNL), A Canning

Induced current and chargedensity in crystallized glycine

Page 18: Scientific Computations  on Modern Parallel  Vector Systems

Transpose from Fourier to real space 3D FFT done via 3 sets of

1D FFTs and 2 transposes Most communication in

global transpose (b) to (c) little communication (d) to (e)

Many FFTs done at the same timeto avoid latency issues

Only non-zero elements communicated/calculated

Much faster than vendor 3D-FFT

PARATEC:Wavefunction Transpose(a) (b)

(e)

(c)

(f)

(d)

Page 19: Scientific Computations  on Modern Parallel  Vector Systems

PARATEC: Performance

DataSize P

Power 3 Power4 Altix ES X1

Gflops/P%peak

Gflops/P %peak Gflops/

P %peakGflops/P

%peak

Gflops/P

%peak

432Atom

32 0.95 63% 2.0 39% 3.7 62% 4.7 60% 3.0 24%64 0.85 57% 1.7 33% 3.2 54% 4.7 59% 2.6 20%128 0.74 49% 1.5 29% --- --- 4.7 59% 1.9 15%256 0.57 38% 1.1 21% --- --- 4.2 52% --- ---512 0.41 28% --- --- --- --- 3.4 42% --- ---

686Atom

128 4.9 62% 3.0 24%256 4.6 57% 1.3 10%

ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs X1 3.5X slower than ES (although peak is 50% higher)

Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1) Lower bisection bandwidth to computation ratio

Limited scalability due to increasing cost of global transpose and reduced vector length

Plan to run larger problem size next ES visit Scalar architectures generally perform well due to high computational intensity

Power3 8X slower than ES Power4 4X slower - Federation has increased speed 2X compared with Colony Altix 1.5X slower - high memory and interconnect bandwidth, low latency switch

Page 20: Scientific Computations  on Modern Parallel  Vector Systems

PARATEC Scaling: ES vs. Power3

ES can run the same system about 10 times faster than the IBM SP (on any number of processors) Main advantage of ES for these types of codes is the fast communication network Fast processors require less fine-grain parallelism in code to get same performance as RISC machinesVector arch allow opportunity to simulate systems not possible on scalar platforms

10

100

1000

10000

32 64 128 256 512 1024Processors

GFl

ops

309 QD - Ideal309 QD - Pwr3432 Si - Pwr3432 Si - ES686 Si - ES

Page 21: Scientific Computations  on Modern Parallel  Vector Systems

Plasma Physics: LBMHD LBMHD uses a Lattice Boltzmann method to

model magneto-hydrodynamics (MHD) Performs 2D simulation of high temperature

plasma Evolves from initial conditions and decaying

to form current sheets 2D spatial grid is coupled to octagonal

streaming lattice Block distributed over 2D proc grid

Main computational components: Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors,

at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices

Developed George Vahala’s group College of William and Mary, ported Jonathan Carter

Current density decays of two cross-shaped structures

Page 22: Scientific Computations  on Modern Parallel  Vector Systems

LBMHD: Porting Details

Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather

than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and

vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts

Stream routine performs well: Array shift operations, block copies, 3rd-degree polynomial eval

Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies

(left) octagonal streaming lattice coupled with square spatial grid

(right) example of diagonal streaming vector updating three spatial cells

Page 23: Scientific Computations  on Modern Parallel  Vector Systems

LBMHD: Performance

ES achieves highest performance to date: over 3.3 Tflops for P=1024 X1 comparable absolute speed up to P=64 (lower % peak) But performs 1.5X slower at P=256 (decreased scalability)

CAF improved X1 to slightly exceed ES (up to 4.70 Gflop/P) ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix

• Low CI and high memory requirement (30GB) hurt scalar performance Altix best scalar due to: high memory bandwidth, fast interconnect

DataSize P

Power 3 Power4 Altix ES X1Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P

%peak

Gflops/P %peak

4096 x

4096

16 0.11 7% 0.28 5% 0.60 10% 4.6 58% 4.3 34%64 0.14 9% 0.30 6% 0.62 10% 4.3 54% 4.4 34%256 0.14 9% 0.28 5% --- --- 3.2 40% --- ---

8192x

8192

64 0.11 7% 0.27 5% 0.65 11% 4.6 58% 4.5 35%256 0.12 8% 0.28 5% --- --- 4.3 53% 2.7 21%

1024 0.11 7% --- --- --- --- 3.3 41% --- ---

Page 24: Scientific Computations  on Modern Parallel  Vector Systems

LBMHD on X1 MPI vs CAF

X1 well-suited for one-sided parallel languages (globally addressable mem)• MPI hinders this feature and requires scalar tag matching

CAF allows much simpler coding of boundary exchange (array subscripting):• feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]

MPI requires non-contiguous data copies into buffer, unpacked at destination Since communication about 10% of LBMHD, only slight improvements However, for P=64 on 40962 performance degrades. Tradeoffs:

• CAF reduced total message volume 3X (eliminates user and system buffer copy)• But CAF used more numerous and smaller sized message

Interested in research of CAF and UPC performance and optimization

DataSize P

X1-MPI X1-CAFGflops/P %peak Gflops/

P%peak

40962 16 4.32 34% 4.55 36%64 4.35 34% 4.26 33%

81922 64 4.48 35% 4.70 37%256 2.70 21% 2.91 23%

Page 25: Scientific Computations  on Modern Parallel  Vector Systems

LBMHD: Performance

Preliminary time breakdown shown relative to each architecture Cray X1 has highest % spent in communication (P=64), CAF version reduced this ES shows best memory bandwidth performance (stream) Communication increases at higher scalability (as expected)

–0–10–20–30–40–50–60–70–80

–P3 –P4 –ES –X1

– % t

ime

–collision–stream–comm

8192 x 8192 Grid 64 processors

–0–10–20–30–40–50–60–70–80

– % t

ime

–P3 –P4 –ES

8192 x 8192 Grid 256 processors

Page 26: Scientific Computations  on Modern Parallel  Vector Systems

Magnetic Fusion: GTC Gyrokinetic Toroidal Code: transport of thermal

energy (plasma microturbulence) Goal magnetic fusion is burning plasma power

plant producing cleaner energy GTC solves 3D gyroaveraged gyrokinetic

system w/ particle-in-cell approach (PIC) PIC scales N instead of N2 – particles interact w/

electromagnetic field on grid Allows solving equation of particle motion with

ODEs (instead of nonlinear PDEs) Main computational tasks:

Scatter deposit particle charge to nearest point Solve Poisson eqn to get potential for each point Gather calc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain

3D visualization of electrostatic potential in magnetic fusion device

Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

Page 27: Scientific Computations  on Modern Parallel  Vector Systems

GTC: Scatter operation Particle charge deposited amongst nearest grid points.

The particles can be anywhere inside the domain Several particles can contribute to same grid points, resulting in memory

conflicts (dependencies) that prevent vectorization Since particles are randomly localized - scatter also hinders cache reuse Solution: VLEN copies of charge deposition array w/ reduction after main

loop

Page 28: Scientific Computations  on Modern Parallel  Vector Systems

GTC: Porting Details

Large vector memory footprint requiried eliminate dependencies P=64 uses 42 GB on ES compared w/ 5 GB on Power3

Relatively small mem per processor (ES=2GB, X1=4GB) limits problem size runs

GTC has second level of parallelism via OpenMP (hybrid programming). However, on ES/X1 memory footprint increased: additional 8X, about 320GB

Non-vectorized “Shift” routine accounted for: 54% X1, 11% on ES • Due to high penalty of serialized sections on X1 when multistreaming

The shift routine vectorized on X1, but NOT on ES - X1 has advantage Limited time at ES prevented vectorization of shift routine• Now shift account for only 4% of X1 runtime

Page 29: Scientific Computations  on Modern Parallel  Vector Systems

GTC: Performance

Number

Particles

P

Power 3 Power4 Altix ES X1

Gflops/P

%peak

Gflops/P %peak Gflops/

P%pea

kGflops/

P %peak Gflops/P

%peak

10/cell20M

32 0.13 9% 0.29 5% 0.29 5% 0.96 12% 1.00 8%64 0.13 9% 0.32 5% 0.26 4% 0.84 10% 0.80 6%

100/cell200M

32 0.13 9% 0.29 5% 0.33 6% 1.34 17% 1.50 12%64 0.13 9% 0.29 5% 0.31 5% 1.25 16% 1.36 11%

1024 0.06 4%

Vectors achieve fastest per-processor performance of any tested architecture!• P=64 on X1 is 35% faster than P=1024 on Power3!• X1 9% faster than ES (but has additional code section vectorized)

Advantage of ES for PIC codes may reside in higher statistical resolution simulations. The greater speed allows for more particles per grid cell

Larger testes could not be performed at ES due to parallelization/vectorization efficiency hurdles

Low Altix performance due under investigation (random # generation)

Page 30: Scientific Computations  on Modern Parallel  Vector Systems

GTC: Performance

With increasing processors, and fixed problem size, the vector length decreases

Limited scaling due to decreased vector efficiency rather than communications overhead.

MPI communication by itself has near perfect scaling.

Page 31: Scientific Computations  on Modern Parallel  Vector Systems

Cosmology: MADCAP Microwave Anisotropy Dataset

Computational Analysis Package Optimal general algorithm for

extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)

Anisotropies in the CMB contains early history of the Universe

Calculates maximum likelihood two-point angular correlation function Recasts problem in dense linear algebra: ScaLAPACK

Steps include: mat-mat, matrix-inv, mat-vec, Cholesky decomp, data redistribution

Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are computed in inner loop

Developed at NERSC by Julian Borrill

Temperature anisotropies inCMB measured by Boomerang

Page 32: Scientific Computations  on Modern Parallel  Vector Systems

MADCAP: Performance

Only partially ported due to code’s requirements of global file system All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops

• Complex tradeoffs: architectural paradigm, interconnect technology, and I/O filesystem

Detailed analysis presented HiPC 2004 Further work is required to: reduce I/O, remove system calls, and remove

global file system requirements Plan to implement new MADCAP version for next ES visit

PPower 3 Power4 ES X1

Gflops/P %peak Gflops/P %pea

k Gflops/P %peak Gflops/P %pea

k16 0.62 41% 1.5 29% 4.1 32% 2.2 27%64 0.54 36% 0.81 16% 1.9 23% 2.0 16%

Page 33: Scientific Computations  on Modern Parallel  Vector Systems

Overview

Tremendous potential of vector architectures: 4 codes running faster than ever before Vector systems allows resolution not possible with scalar arch (regardless of # procs) ES shows much higher sustained and often higher raw performance compared with X1

• Limited X1 specific optimization - optimal programming approach still unclear (CAF, etc)• Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)

Vectors potentially at odds w/ emerging techniques (sparse, irregular, multi-physics)• GTC example code at odds with data-parallelism• Much more difficult to evaluate codes poorly suited for vectorization

Return to ES in October - evaluate new codes and higher scalability studies Potential opportunity of large-scale scientific runs (not just benchmarking)

Code (P=64) % peak (P=Max avail) Speedup ES vs.

Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5CACTUS 6% 11% 7% 34% 6% 45.0 5.1 6.4 4.0

GTC 9% 6% 5% 16% 11% 9.4 4.3 4.1 0.9PARATE

C 57% 33% 54% 58% 20% 8.2 3.9 1.4 3.9MADCAP 61% 40% --- 53% 19% 3.4 2.3 --- 0.9

Page 34: Scientific Computations  on Modern Parallel  Vector Systems

Future directionsLeverage evaluation suite, (unclassified) application expertise, emerging arch

research

Develop application driven architectural probes for evaluation of emerging petascale systems

Research the enhancement of commodity scalar processors with vector features for increased scientific productivity (including investigation into VIVA2 with IBM)

Software-controlled scratchpad, programmable prefetch/preload Investigate algorithmic optimizations for leading vector systems and examine an

architecture-independent algorithmic analysis: Fundamental resource requirements of key algorithms (FPU, locality, bdwith, latency-tolerance)

Explore new application areas on leading parallel systems Evaluate codes traditionally at odds with vector architectures

Study the potential of implicit parallel programming languages: UPC and CAF Especially codes difficult to express via MPI (portability requirement tradeoffs)

Evaluate soon-to-be-released supercomputing systems and identify classes of applications best suited for their unique architectural balance

Cray X1e, XD1 & Red-Storm, IBM Power5, Bluegene/*, Hitachi SR11000, NEC SX8,

Page 35: Scientific Computations  on Modern Parallel  Vector Systems

Publications L. Oliker, A. Canning, J. Carter, J. Shalf, and S. Ethier.

“Scientific Computations on Modern Parallel Vector Systems”, Supercomputing 2004, to appear.Nominated for Best Paper award

J. Carter, J. Borrill, and L. Oliker. “Performance Characteristics of a Cosmology Package on Leading HPC Architectures”, International Conference on Higher Performance Computing: HIPC 2004, to appear. Nominated for Best Paper award

L. Oliker, J. Borrill, A. Canning, J. Carter, H. Shan, D. Skinner, R. Biswas, J. Djomheri, “Performance Evaluation of the SX-6 Vector Architecture”, Journal of Concurrency and Computation 2004, to appear.

L. Oliker and Rupak Biswas, “Performance Modeling and Evaluation of Ultra-Scale Systems”, Minisymposium organized at SIAM Conference on Parallel Processing for Scientific Computing: SIAMPP 2004.

L. Oliker, J. Borrill, A. Canning, J. Carter, H. Shan, D. Skinner, R. Biswas, J. Djomheri, “A Performance Evaluation of the Cray X1 for Scientific Applications”, International Meeting on High Performance Computing for Computational Science: VECPAR 2004.

H. Shan, E. Strohmaier, L. Oliker, “Optimizing Performance of Superscalar Codes For a Single Cray X1 MSP Processor”, 46th Cray User Group Conference, CUG 2004.

G. Griem, L. Oliker, J. Shalf, K. Yelick, “Identifying Performance Bottlenecks on Modern Microarchitectures using an Adaptable Probe”, Performance Modeling, Evaluation, Optimization of Parallel & Distributed Systems PMEO 2004

L. Oliker, J. Carter, J. Shalf, D. Skinner, S. Ethier, R. Biswas, J. Djomehri, R. Van der Wijngaart. “Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations”, Supercomputing 2003.