c o m p u t a t i o n a l r e s e a r c h d i v i s i o n bips the potential of the cell processor...

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPSBIPS

The Potential of the Cell Processor

for Scientific Computing

Leonid Oliker

Samuel Williams, John ShalfShoaib Kamil, Parry Husbands, Katherine Yelick

Computational Research Division

Lawrence Berkeley National Laboratory

Motivation

Stagnating application performance is well-know problem in scientific computing By end of decade numerous mission critical applications expected to have 100X computational

demands of current levels Many HEC platforms are poorly balanced for demands of leading applications

Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology

Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining, clock frequency limited by power wall

Specialized HPC market cannot support huge tech investments HPC community looking for alternative high-performance architectures with healthy market outside

scientific computing (approx 0$B market) Sophistication of gaming technology is demanding more floating-point intense computation

However, ultimately limited by level human recognition not scientific fidelity

Recently-released Cell processor has tremendous computational capability This work examines four key scientific algorithms on Cell

Cell will be used in the PS3: compelling because it will be produced in high volume Radical departure from conventional designs including the XBOX 360’s Xenon

Cell/PS3 XBOX 360

Heterogeneous Homogeneous

PPC + 8 SIMD cores 3 x PPC

Software-controlled Conventional Cache-based

memory architecture memory hierarchy (1MB)

221mm2 (+30%) 168mm2

Key limiting factors current platforms: off-chip memory bandwidth and power usage Memory hurdles: latency and bandwidth utilization

Homogenous cache-based CMP do not address these deficiencies Software controlled memory improves memory bandwidth and power usage:

Allows finely tuned deep prefetching Data double buffering: hides memory latency, potential fully utilizing bandwidth

More efficient cache utilization policies (smaller “caches” required) Memory fetching takes advantage of application-level information More predictable performance for performance modeling, real time systems, etc Less architectural complexity vs. automatic caching memory = less power

However, software controlled memory increases programming complexity

Introduction to Cell

Cell Processor Architecture All units connected via EIB

4 x 128b rings @ 1.6GHz PPC core @ 3.2GHz 8 x SPE’s (128b SIMD core)

Off-load engine

between a P and a coP

• 128 x 128b single cycle register file 256KB Local store

• 16K x 128b 6-cycle local store

• Private address space

• access to global store via DMA

• No unaligned access (only via permutes) Dual SIMD issue (private PC)

• one arithmetic/float/etc…

• one load/store/permute/branch/channel Statically scheduled Execution: in-order 7 cycle pipelines 3W @ 3.2 GHz

Memory controller (25.6GB/s dual XDR) ~40W total if PPC is idle

Cell architecture benefits include: Reduced pipeline length

Lower branch mis-predict penalty Lower memory latency

Numerous in-flight DMAs Hide memory latency Utilize available mem BW Overlap comp with comm

Impressive performance/power ratio

Microarchitectural Issues

PPE (PowerPC Processing Element)

512KB cache: coherent with DMAs, not Local Store (LS)

Dual thread, dual issue, in order

VMX unit + scalar DP FMA

SPE (Synergistic Processing Element) 7 cycle in order dual SIMD pipelines

Single Precision

• 4 FMA datapaths, 6 cycle latency

• 25.6 Gflop/s per SPE, 204.8 Gflop/s overall peak

Double Precision

• 1 FMA datapath, 13 cycle latency

• 13 cycle pipeline doesn’t fit in a 7 cycle forwarding network, so 6 cycle stall after issuing for correctness = 1.83 GFLOP/s, 14.6 Gflop/s overall peak

One DP issue every 7 cycles (1/14th peak of SP)

• Prohibit dual issuing DP instructions = 1.6 GFLOP/s (12.8 Gflop/s overall)

For streaming apps (loads, permutes, etc) one issue every 8 cycles

• DP is obviously not a first-class citizen in Cell microarchitecture

Software managed branch hints

Programming Cell

Local store appears to be the SPU’s entire memory space. However, with DMAs, it can be programmed as a software (user) controlled memory. A series of DMAs can be issued to transfer data from global store (DRAM) to local store (SRAM)

remote get Allows expressing a list of addresses & sizes for amenable algorithms

Analogous to vector loads from DRAM to a large SRAM (vector register file) Local store has constant 6 cycle latency (no cache misses)

Greatly simplifies performance estimation vs. out-of-order, cache-based Double buffering allows explicit overlapping of computation and communication Our work does not benchmark the compiler’s ability to SIMDize

For all critical sections, wrote code with SIMD/quadword intrinsics Ideal performance, somewhat more work than C, far less work than assembly Programming overhead:

SpMV: 1 month, 600 lines Learning programming model, architecture, compiler, tools, algorithm choice

Stencil versions: about 1 weeks, 450 lines (2 versions) - original 15 lines Required significant loop unrolling and intrinsics use

• SP time-skewed Stencil version: one day, total rewrite 450 lines attained 65 Gflop/s!

Parallel Programming Model

Possible programming paradigms include Task parallelism, with independent tasks on each SPE Pipelined parallelism, where large data blocks are passed from one

SPE to next Similar to Streaming model

Data parallelism: identical operations on distinct data (SPMD) We examine as hierarchical SPMD

Simplest and most direct way to decompose the problem Data parallel paradigm good match for many scientific algorithms Similar to OpenMP or Multistreaming Cray X1(E) parallelism PPE is used to partition and load balance PPE is not used for any computation

• Allows us to treat system as homogenous parallel machine

Estimation, Simulation and Exploration

Performance Modeling (PM): Double buffered + long DMAs + in order machine Use MAX( static timing analysis, memory traffic modeling)

Latency of operation, issue width limits, operand alignment of SIMD/quadword DMA load/store overheads (including constraints such as single DRAM controller)

For regular data structures, “spreadsheet” modeling works Some kernels (SPMV, FFT) requires more advanced modeling to capture input data pattern

Iteration performance varies based on matrix non-zero format

Full System Simulator (FSS): Based on IBM’s mambo, cycle accurate, includes static timing analyzer, compilers, etc…

Cell+ DP pipeline is not very important for video games

Redesigned pipeline helps HPC - but increases design complexity and power consumption Propose modest modification: alternate design forwarding network

How severely does DP throughput of 1 SIMD instruction every 7/8 cycles impair execution? Cell+ model fully utilizes the DP datapath: 1 SIMD instruction every 2 cycles Allows dual issuing of DP instructions with loads/stores/permutes/branch Same SP throughput, frequency, bandwidth and power as Cell

Comparing Processors

Cray X1E world’s most powerful vector processor Cell performance does not include the Power core Cell+ 51.2 Gflop/s peak (DP) - 3.5x Cell performance Impressive potential in performance and power efficiency

Dense Matrix-Matrix Multiplication

GEMM characterized by high comp intensity and regular data access Expect to reach close to peak on most platforms

Explored two blocking formats: Column major and Block data layout Column major: implicit blocking via gather stanzas

Issues: tile size within SPE Local store, multiple short DMAs Maximizes FLOP/byte, reduce TLB misses based on size of blocks

Block data layout (BDL): explicit blocking Two stage addressing scheme, requires single long DMA

Choose a block size large enough so that kernel is computationally bound 642 in single precision Much easier in DP (14x computational time, 2x transfer time)

Future work - cannon’s algorithm Reduce DRAM BW by using EIB Could significantly increase number of pages touched

GEMM - Results

CellPM represents results from our analytical performance model IBM’s published hardware numbers come very close to these (within 3%)

Cell results compared with highly optimized vendor GEMM libraries Impressive performance results and power efficiency (>200x Power vs IA64!)

SP: 69x, 26x, 7x faster X1E, IA64, AMD64 DP: 0.9x, 3,7x, 2.7x faster X1E, IA64, AMD64

Cell+ approach improves DP performance 3.5x w/ modest architectural mods 50Gflops!

Gflop/s Cell+PM CellPM X1E AMD64 IA64

DP 51.1 14.6 16.9 4.0 5.4

SP — 204.7 29.5 7.8 3.0

Mflop/W Cell+PM CellPM X1E AMD64 IA64

DP 1277 365 141 45 42

SP — 5117 245 88 23

Sparse Matrix-Vector Multiplication

SPMV most expense step in iteratively solving PDE of sparse linear/eigen systems Poses performance challenge for cache based system due to:

Low computational intensity and irregular (indexed) data accesses Potentially challenge on Cell: no caches or word-granularity gather/scatter support Potential advantages:

Low functional unit and local store latency Task parallelism of 8 SPEs, 8 independent load/store units Ability to stream nonzeros via DMA Local store is not write-back cache: overwrite temps without DRAM BW usage

Cell implementation work examines CSR or Block CSR SIMDization

Requires all row lengths to be a multiple of 4 (simplifies quadword alignment) Explicitly cache block columns

Exploit spatial locality within the local store Implicitly cache block the rows Cache block parallelization strategies:

Partition by rows: potential load imbalance (depends on matrix structure) Partition by nonzeros: each SPE contains copy of source + reduction across SPEs

Double buffer nonzeros Overlaps computation and communication Requires restarting in the middle of a row

Partially double buffer row pointers to find structure

Completely eliminate

empty blocks Prune empty rows

SpMV - example figure Explicitly choose column blocking via cost function

Cache block perimeter

is fixed (Local Store) What is optimal r x c?

SPU7

SPU6

SPU5

SPU4

SPU3SPU2SPU1SPU0

Parallelize across SPUs Cost function of

execution time Rows + NZ

SpMV - FSS Implementation

Use performance model estimates to guide actual implementation In DP row lengths must be even (QW aligned)

No BCSR software implementation yet Parallelization

Dynamically analyze costFunction(rows,NZs) ~ execution time Runtime blocking

Cost function based LS=256KB=32K doubles, max column block = 32K

Only need a 15b relative index to store absolute column index (not 32b) Runtime search for structure

Empty cache blocks, �search for first non empty row Itanium/AMD version: highly optimized OSKI used auto tuner Cray version best know to date: optimized CSRP and Jagged Diagonal Examined suite (un)symmetric matrices from real numerical calculations

SpMV - Results

Gflop/s CellFSS Cell+PM CellPM X1E AMD64 IA64

unsymmetricDP 3.04 2.46 2.34 1.14 0.36 0.36

SP - - 4.08 - 0.53 0.41

symmetricDP 3.38* 4.35 4.00 2.64 0.60 0.67

SP - - 7.68 - 0.80 0.83

Mflop/W CellFSS Cell+PM CellPM X1E AMD64 IA64

unsymmetricDP 76.0 61.5 58.5 9.50 4.04 2.77

SP - - 102 - 5.96 3.15

symmetricDP 84.5* 109 100 22.0 6.74 5.15

SP - - 192 - 8.99 6.38

Cell DP achieves impressive 6-8x speedup vs AMD/Itanium, 20x power efficiency• Even though mem BW 4x, Cell achieves higher performance via double buffering• Multicore systems will not see a performance increase w/o improved mem bandwidth

Cell outperforms X1E by around 2x, and is 5x more power efficient• X1 performance much more sensitive to #NNZ (affects vector length)

PM very close to FSS performance using static implementation - confirming PM accuracy• However FSS is 30% faster due to dynamic partitioning

*Unsymmetric kernel used on symmetric matrix, FSS = Full system simulator, PM = Performance Model

SpMV - Future Optimizations

Auto-tuning Other parallelization strategies BCSR (better for SIMD, worse for memory traffic) Other storage formats (DIA/JAG/etc…)

Symmetry (currently only present in the performance model) Easier to exploit in single precision & w/BCSR Cache blocking limits benefit (~50%)

Segmented Scan Reduces loop overhead at the expense of nonzero processing time Good if NZ/Row (within a cache block) is small Single segment (Vector Length=1) would be beneficial Make runtime decision for a given cache block Complicated by presence of empty rows within a cache block

Stencils on Structured Grids Stencil computations codes represent wide array of scientific applications

Each point in multidimensional grid is updated from subset of neighbors Finite difference operations used to solver complex numerical systems We examine simple heat equation and 3D hyperbolic PDE Relatively low computational intensity results in low % of peak on superscalars

Memory bandwidth bound Algorithm requires keeping 4 planes in local store to optimize performance

(Z-1,t), (Z,t), (Z+1,t) -> (Z,t+1) Cell approach utilizing double buffering: previous output with next input

(Z-1,t+1) & (Z+2,t) Cell algorithm virtually identical to traditional architectures

ISA forces explicit memory loads/stores rather than cache misses and evictions Parallelization - process one plane at a time

Break middle loop up among SPEs (divide each plane 8 ways) Maintains long DMAs (unit-stride direction) and double buffering in Z direction Computational intensity drops to decrease complexity

SIMDization: permutes required to pack left & right neighbors into a SIMD register Neighbor communication poor fit for aligned quadword loads requirements Potential Cell bottleneck: Unaligned loads emulated w/ permute instruction

Problem can be partially avoided with data padding• For SP permute is second bottleneck after BW, not the FPU

Stencils - Time Skewing

Low computational intensity limits performance - memory bandwidth Time Skewing: multiple steps combined to increase performance

Increased computational intensity with almost no additional memory traffic Note some numerical methods are not amenable to merging multiple timesteps

Stencils - ResultsGflop/s CellFSS Cell+

PM CellPM X1E AMD64 IA64

DP 7.25 21.12step 8.2 3.91 0.57 1.19

SP 65.84step - 21.2 3.26 1.07 1.97

Mflop/W CellFSS Cell+PM CellPM X1E AMD64 IA64

DP 181 5282step 205 32.6 6.4 9.15

SP 16454step - 530 27.2 12 15.2

In DP Cell is computationally bound in single time step In SP 2 timesteps are required before computationally bound Permute unit quickly becomes utilized (Quadword alignment)

In SP Cell achieves 6.5x, 11x, 20x compared w/ X1E, Itanium2, Opteron Using time skewing achieves 66Gflops! 60x faster and 130x power efficient vs Opteron

Note stencil code highly optimized on cache-based platforms In DP Cell achieves 2x, 7x, 14x compared w/X1E, Itanium2, Opteron

Even though DP peak throughput is only 1/14th (!) compared with SP Note Cell prohibits double issue of DP with loads or permutes

For codes with streaming behavior one DP SIMD instruction each 8 cycles Unlike scalar systems - time skewing can at least double performance on Cell Cell performance: software controlled memory for codes w/ predictable memory accesses

1D Fast Fourier Transforms

Fast Fourier transform (FFT) - is of great importance to a wide variety of applications One of the main techniques for solving PDEs Relatively low computational intensity with non-trivial volume data movement

1D FFT: Naïve Algorithm - cooperatively executed across the SPEs Load roots of unity, load data (cyclic) 3 stages: local work, on-chip transpose, local work No double buffering (ie no overlap of communication or computation)

2D FFT: 1D FFTs are each run on single SPE Each SPE performs 2 * (N/8) FFTs Double buffer (2 incoming and 2 outgoing) Straightforward algorithm (N2 2D FFT): N simultaneous FFTs, transpose, Transposes represent about 50% of SP execution time, but only 20% of DP

Cell performance compared with highly optimized FFTW and vendor libraries

FFT - Resultsaveraged Gflop/s Cell+

PM CellPM X1E AMD64 IA64

1DDP 13.4 5.85 4.53 1.61 2.70

SP - 33.7 5.30 3.24 1.72

2DDP 16.2 6.65 7.05 0.69 0.31

SP - 38.2 7.93 1.32 0.42

averaged Mflop/W Cell+PM CellPM X1E AMD64 IA64

1DDP 335 146 37.8 18.1 20.8

SP - 843 44.2 36.4 13.2

2DDP 405 166 58.8 7.75 2.38

SP - 955 66.1 14.8 3.23 FSS implementation in progress In SP Cell is unparalleled - 91x faster and 300x more power efficient vs Itanium2! Cell+ offers significant performance advantage (2.5X versus Cell)

Cell+ 2D FFT: 30x faster than Itanium2, 23x AMD2, >2x X1E Cell DP performance approx equal to X1E (simple Cell implementation)

Cell performance underscores advantage of software controlled memory Does not suffer from associatively issues of cache architectures (powers of 2)

Effectively fully-associative cache Opportunity to explicitly overlap communication with computation

DP Performance Advantage

0

5

10

15

20

25

DGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT0

5

10

15

20

25

Cell+ vs X1E Cell+ vs AMD64 Cell+ vs IA64Cell vs X1E Cell vs AMD64 Cell vs IA64

52.3x

DP Power Advantage

0

10

20

30

40

50

DGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT0

10

20

30

40

50

Cell+ vs X1E Cell+ vs AMD64 Cell+ vs IA64Cell vs X1E Cell vs AMD64 Cell vs IA64

52x, 70x, 170x

Summary I

Our work presents broadest quantitative study of scientific kernels on Cell to date Developed analytic framework to predict performance and validated accuracy

Kernel times predicable, as load time from LS is constant Results show tremendous potential of Cell for scientific computing Proposed Cell+: modest microarch variant designed to improve DP performance

Fully utilizable DP pipeline greatly improves power and efficiency Cell’s heterogenous multicore seems more promising than emerging CMP designs

CMP w/ 2 cores < 2x performance, compare with 10-20x on Cell Would need serious increase in power and memory bandwidth to scale up

Of course ultimately need to evaluate full application performance

SP Performance Advantage

0

25

50

75

100

SGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT

X1E AMD64 IA64

SP Power Advantage

0

50

100

150

200

250

300

SGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT

X1E AMD64 IA64

Summary II

Cell’s 3 level Software controlled memory decouples L/S from computation: Extremely predictable kernel performance Long DMA transfers achieve high % of mem BW (like fully engaged prefetch) Ability to decoupled gather (ex. stream nonzeros) - large # concurrent mem fetches For predictable memory access can overlap comp with memory (double buffer) Future designs benefit from larger local store and lower DMA startup

Disadvantages: Increased programming complexity to specify local memory movement Lack unaligned load support: additions instructions necessary to permute Permute pipeline can become bottleneck (Stencil SP example)

Future work: Real Cell hardware, more algorithms, comparisons with modern CMPs

c o m p u t a t i o n a l r e s e a r c h d i v i s i o n bips the potential of the cell processor...

Documents

chip memory bandwidth

deep memory hierarchies

automatic caching memory

b single cycle

cycle latency25

cycle pipelines3w

cop128 x

eib4 x