message passing vs. shared address space on a cluster of smps leonid oliker nersc/lbnl oliker...

Message Passing Vs. Shared Address Space Message Passing Vs. Shared Address Space on a Cluster of SMPson a Cluster of SMPs

Leonid OlikerNERSC/LBNL

www.nersc.gov/~oliker

Hongzhang Shan, Jaswinder Pal Singh

Princeton University

Rupak Biswas

NASA Ames Research Center

2

OverviewOverview

Scalable computing using clusters of PCs has become an attractive platform for high-end scientific computing

Currently MP and SAS are the leading programming paradigms

MPI is more mature and provides performance & portability; however, code development can be very difficult

SAS provides substantial ease of programming, but performance may suffer due to poor spatial locality and protocol overhead

We compare performance of MP and SAS models using best implementations available to us (MPI/Pro and GeNIMA SVM)

Also examine hybrid programming (MPI + SAS)

Platform: eight 4-way 200 MHz Pentium Pro SMPs (32 procs)

Applications: regular (LU, OCEAN), irregular (RADIX, N-BODY)

Propose / investigate improved collective comm on SMP clusters

3

Architectural PlatformArchitectural Platform

32 Processor Pentium Pro System4-way SMP

200MHz

8Kb L1

512Kb L2

512Mb Mem Giganet or Myrinet

Single crossbar switch

Network Interface

33MHz processor Node-to-network bandwidth

constrained by 133MB/s PCI bus

4

MPI SAS

Naming for remote

dataCan not

Same as for local

variables

Data replication

and coherence

Explicit, need both source and destination

Implicit

AA

P0 P1

MPI

CommunicationLibrary

Send Receive

Send-Receive pair Load/Store

A1 = A0

SAS

A1A0

P0 P1

Comparison of programming modelsComparison of programming models

5

SAS ProgrammingSAS Programming

SAS in software: page-based shared virtual memory (SVM)

Use GeNIMA protocol built with VMMC on Myrinet network

VMMC – Virtual Memory Mapped Communication Protected reliable user-level comm; variable size packets Allows data transfer directly between two virtual memory

address spaces Single 16-way Myrinet crossbar switch

High-speed system area network with point-to-point links Each NI connects nodes to network with two unidirectional

links of 160 MB/s peak bandwidth What is the SVM overhead compared with hardware

supported cache-coherent system (Origin2000)?

6

GeNIMA ProtocolGeNIMA Protocol

GeNIMA (GEneral-purpose NI support in a shared Memory Abstraction): Synch home-based lazy-release consistency

Uses virtual memory mgmt sys for page-level coherence

Most current systems use asynchronous interrupts for both data exchange and protocol handling

Asynchronous message handling on network interface (NI) eliminates need to interrupt receiving host processor

Use general-purpose NI mechanism to move data between network and user-level memory & for mutual exclusion

Protocol handling on host processor at “synchronous” points – when a process is sending / receiving messages

Procs can modify local page copies until synchronization

7

MP ProgrammingMP Programming

Use MPI/Pro developed by VIA interface over Giganet

VIA - Virtual Interface Architecture

Industry standard interface for system area networks Protected zero-copy user-space inter-process communication

Giganet (like Myrinet) NI use single crossbar switch

VIA and VMMC have similar communication overhead

Time (secs)

8

Regular Applications:Regular Applications:LU and OCEANLU and OCEAN

LU factorization: Factors matrix into lower and upper tri

Lowest communication requirements among our benchmarks One-to-many non-personalized communication In SAS, each process directly fetches pivot block;

in MPI, block owner sends pivot block to other processes OCEAN: Models large-scale eddy and boundary currents

Nearest-neighbor comm patterns in a multigrid formation Red-black Gauss-Seidel multigrid equation solver High communication-to-computation ratio Partitioning by rows instead of by blocks (fewer but larger

messages) increased speedup from 14.1 to 15.2 (on 32 procs) MP and SAS partition subgrids in the same way;

but MPI involves more programming

P

9

Irregular Applications:Irregular Applications:RADIX and N-BODYRADIX and N-BODY

RADIX Sorting: Iterative sorting based on histograms

Local histogram creates global histogram then permutes keys Irregular all-to-all communication Large comm-to-comp ratio, and high memory

bandwidth requirement (can exceed capacity of PC-SMP) SAS uses global binary prefix tree to collect local histogram;

MPI uses Allgather (instead of fine-grained comm) N-BODY: Simulates body interaction (galaxy, particle, etc)

3D Barnes-Hut hierarchical octree method Most complex code, highly irregular fine-grained comm Compute forces on particles, then update their positions Significantly different MPI and SAS tree-building algorithms

10

SAS

MPI

Distribute / Collect

Cells / Particles

N-BODY Implementation DifferencesN-BODY Implementation Differences

11

Duplicatehigh-level cells

Duplicatehigh-level cells

• Algorithm becomes much more like message passing• Replication not “natural” programming style for SAS

Improving N-BODY SAS ImplementationImproving N-BODY SAS Implementation

SAS Shared Tree

12

Performance of LUPerformance of LU

Communication requirements small compared to our other apps

SAS and MPI have similar performance characteristics

Protocol overhead of running SAS version a small fraction of overall time (Speedups on 32p: SAS = 21.78, MPI = 22.43)

For applications with low comm requirements, it is possible to achieve high scalability on PC clusters using both MPI and SAS

6144 x 6144 matrix on 32 processors

SYNCRMEMLOCAL

140

0

120100

20

806040

SAS MPI

Tim

e (

sec)

13

Performance of OCEANPerformance of OCEAN

SAS performance significantly worse than MPI(Speedups on 32p: SAS = 6.49, MPI = 15.20)

SAS suffers from expensive synchronization overhead –after each nearest-neighbor comm, a barrier sync is required

50% of sync overhead spent waiting, rest is protocol processing

Sync in MPI is much lower due to implicit send / receive pairs

SYNCRMEMLOCAL

514 x 514 grid on 32 processors

SAS MPI42

0

35

7

282114

Tim

e (

sec)

14

Performance of RADIXPerformance of RADIX

MPI performance more than three times better than SAS(Speedups on 32p: SAS = 2.07, MPI = 7.78)

Poor SAS speedup due to memory bandwidth contention

Once again, SAS suffers from high protocol overhead of maintaining page coherence: compute diffs, create timestamps,generate write notices, and garbage collection

SYNCRMEMLOCAL

32M integers on 32 processors

SAS MPI12

0

10

2

864

Tim

e (

sec)

15

Performance of N-BODYPerformance of N-BODY

SAS performance about half that of MPI(Speedups on 32p: SAS = 14.30, MPI = 26.94)

Synchronization overhead dominates SAS runtime

82% of barrier time spent on protocol handling

If very high performance is the goal, message passing necessary for commodity SMP clusters

SYNCRMEMLOCAL

128K particles on 32 processors

SAS MPI

Tim

e (

sec)

7

0

65

1

432

16

Node Architecture Communication Architecture

Origin2000 (Hardware Cache Coherency)Origin2000 (Hardware Cache Coherency)

Memory

Hub

L2Cache

Directory

Dir (>32P)

R12K

Router

L2Cache

R12K

Previous results showed that on a hardware-supported cache-coherent multiprocessor platform, SAS achieved

MPI performance for this set of applications

17

Hybrid Performance Hybrid Performance on PC Clusteron PC Cluster

Latest teraflop-scale systems contain large number of SMPs;novel paradigm combines two layers of parallelism

Allows codes to benefit from loop-level parallelism and shared-memory algorithms in addition to coarse-grained parallelism

Tradeoff: SAS may reduce intra-SMP communication, but possibly incur additional overhead for explicit synchronization

Complexity example: Hybrid N-BODY requires two types of tree-building: MPI – distributed local tree, SAS – globally shared tree

Hybrid performance gain (11% max) does not compensate for increased programming complexity

18

MPI Collective Function:MPI Collective Function:MPI_AllreduceMPI_Allreduce

How to better structure collective communication on PC-SMP clusters?

We explore algorithms for MPI_Allreduce and MPI_Allgather

MPI/Pro version labeled “Original” (exact algorithms undocumented)

For MPI_Allreduce, structure of our 4-way SMP motivates us to modify the deepest level of the B-Tree to a quadtree (B-Tree-4)

No difference in using SAS or MPI communication at lowest level

Execution time (in secs) on 32 procs for one double-precision variablesecs) on 32 procs for one double-precision variable Original Original 11171117 B-Tree B-Tree 10351035 B-Tree-4 B-Tree-4 981981

19

MPI Collective Function:MPI Collective Function:MPI_AllgatherMPI_Allgather

Several algorithms were explored: Initially, B-Tree and B-Tree-4

B-Tree-4*: After a processor at Level 0 collects data, it sends it to Level 1 and below; however, Level 1 already contains data from its own subtree

Thus redundant to broadcast ALL the data back, instead only the necessary data needs to be exchanged (can be extended to the lowest level of the tree (bounded by size of SMP))

Improved communication functions result in up to 9% performance gain (most time spent in send / receive functions)

Time (secs) for P=32 (8 nodes)

20

ConclusionsConclusions

Examined performance for several regular and irregular applications using MP (MPI/Pro on Giganet by VIA) and SAS (GeNIMA on Myrinet by VMMC) on 32-processor PC-SMP cluster

SAS provides substantial ease of programming, esp. for more complex codes which are irregular and dynamic

Unlike previous research on hardware-supported CC-SAS machines, SAS achieved about half the parallel efficiency of MPI for most of our applications (LU was an exception, where performance was similar)

High overhead for SAS due to excessive cost of SVM protocol associated with maintaining page coherence and implementing synch

Hybrid codes offered no significant performance advantage over pure MPI, but increased programming complexity and reduced portability

Presented new algorithms for improved SMP communication functions

If very high performance is the goal, the difficulty of MPI programming appears to be necessary for commodity SMP clusters of today

message passing vs. shared address space on a cluster of smps leonid oliker nersc/lbnl oliker...

Documents