message passing vs. shared address space on a cluster of smps leonid oliker nersc/lbnl oliker...
TRANSCRIPT
Message Passing Vs. Shared Address Space Message Passing Vs. Shared Address Space on a Cluster of SMPson a Cluster of SMPs
Leonid OlikerNERSC/LBNL
www.nersc.gov/~oliker
Hongzhang Shan, Jaswinder Pal Singh
Princeton University
Rupak Biswas
NASA Ames Research Center
2
OverviewOverview
Scalable computing using clusters of PCs has become an attractive platform for high-end scientific computing
Currently MP and SAS are the leading programming paradigms
MPI is more mature and provides performance & portability; however, code development can be very difficult
SAS provides substantial ease of programming, but performance may suffer due to poor spatial locality and protocol overhead
We compare performance of MP and SAS models using best implementations available to us (MPI/Pro and GeNIMA SVM)
Also examine hybrid programming (MPI + SAS)
Platform: eight 4-way 200 MHz Pentium Pro SMPs (32 procs)
Applications: regular (LU, OCEAN), irregular (RADIX, N-BODY)
Propose / investigate improved collective comm on SMP clusters
3
Architectural PlatformArchitectural Platform
32 Processor Pentium Pro System4-way SMP
200MHz
8Kb L1
512Kb L2
512Mb Mem Giganet or Myrinet
Single crossbar switch
Network Interface
33MHz processor Node-to-network bandwidth
constrained by 133MB/s PCI bus
4
MPI SAS
Naming for remote
dataCan not
Same as for local
variables
Data replication
and coherence
Explicit, need both source and destination
Implicit
AA
P0 P1
MPI
CommunicationLibrary
Send Receive
Send-Receive pair Load/Store
A1 = A0
SAS
A1A0
P0 P1
Comparison of programming modelsComparison of programming models
5
SAS ProgrammingSAS Programming
SAS in software: page-based shared virtual memory (SVM)
Use GeNIMA protocol built with VMMC on Myrinet network
VMMC – Virtual Memory Mapped Communication Protected reliable user-level comm; variable size packets Allows data transfer directly between two virtual memory
address spaces Single 16-way Myrinet crossbar switch
High-speed system area network with point-to-point links Each NI connects nodes to network with two unidirectional
links of 160 MB/s peak bandwidth What is the SVM overhead compared with hardware
supported cache-coherent system (Origin2000)?
6
GeNIMA ProtocolGeNIMA Protocol
GeNIMA (GEneral-purpose NI support in a shared Memory Abstraction): Synch home-based lazy-release consistency
Uses virtual memory mgmt sys for page-level coherence
Most current systems use asynchronous interrupts for both data exchange and protocol handling
Asynchronous message handling on network interface (NI) eliminates need to interrupt receiving host processor
Use general-purpose NI mechanism to move data between network and user-level memory & for mutual exclusion
Protocol handling on host processor at “synchronous” points – when a process is sending / receiving messages
Procs can modify local page copies until synchronization
7
MP ProgrammingMP Programming
Use MPI/Pro developed by VIA interface over Giganet
VIA - Virtual Interface Architecture
Industry standard interface for system area networks Protected zero-copy user-space inter-process communication
Giganet (like Myrinet) NI use single crossbar switch
VIA and VMMC have similar communication overhead
Time (secs)
8
Regular Applications:Regular Applications:LU and OCEANLU and OCEAN
LU factorization: Factors matrix into lower and upper tri
Lowest communication requirements among our benchmarks One-to-many non-personalized communication In SAS, each process directly fetches pivot block;
in MPI, block owner sends pivot block to other processes OCEAN: Models large-scale eddy and boundary currents
Nearest-neighbor comm patterns in a multigrid formation Red-black Gauss-Seidel multigrid equation solver High communication-to-computation ratio Partitioning by rows instead of by blocks (fewer but larger
messages) increased speedup from 14.1 to 15.2 (on 32 procs) MP and SAS partition subgrids in the same way;
but MPI involves more programming
P
9
Irregular Applications:Irregular Applications:RADIX and N-BODYRADIX and N-BODY
RADIX Sorting: Iterative sorting based on histograms
Local histogram creates global histogram then permutes keys Irregular all-to-all communication Large comm-to-comp ratio, and high memory
bandwidth requirement (can exceed capacity of PC-SMP) SAS uses global binary prefix tree to collect local histogram;
MPI uses Allgather (instead of fine-grained comm) N-BODY: Simulates body interaction (galaxy, particle, etc)
3D Barnes-Hut hierarchical octree method Most complex code, highly irregular fine-grained comm Compute forces on particles, then update their positions Significantly different MPI and SAS tree-building algorithms
10
SAS
MPI
Distribute / Collect
Cells / Particles
N-BODY Implementation DifferencesN-BODY Implementation Differences
11
Duplicatehigh-level cells
Duplicatehigh-level cells
• Algorithm becomes much more like message passing• Replication not “natural” programming style for SAS
Improving N-BODY SAS ImplementationImproving N-BODY SAS Implementation
SAS Shared Tree
12
Performance of LUPerformance of LU
Communication requirements small compared to our other apps
SAS and MPI have similar performance characteristics
Protocol overhead of running SAS version a small fraction of overall time (Speedups on 32p: SAS = 21.78, MPI = 22.43)
For applications with low comm requirements, it is possible to achieve high scalability on PC clusters using both MPI and SAS
6144 x 6144 matrix on 32 processors
SYNCRMEMLOCAL
140
0
120100
20
806040
SAS MPI
Tim
e (
sec)
13
Performance of OCEANPerformance of OCEAN
SAS performance significantly worse than MPI(Speedups on 32p: SAS = 6.49, MPI = 15.20)
SAS suffers from expensive synchronization overhead –after each nearest-neighbor comm, a barrier sync is required
50% of sync overhead spent waiting, rest is protocol processing
Sync in MPI is much lower due to implicit send / receive pairs
SYNCRMEMLOCAL
514 x 514 grid on 32 processors
SAS MPI42
0
35
7
282114
Tim
e (
sec)
14
Performance of RADIXPerformance of RADIX
MPI performance more than three times better than SAS(Speedups on 32p: SAS = 2.07, MPI = 7.78)
Poor SAS speedup due to memory bandwidth contention
Once again, SAS suffers from high protocol overhead of maintaining page coherence: compute diffs, create timestamps,generate write notices, and garbage collection
SYNCRMEMLOCAL
32M integers on 32 processors
SAS MPI12
0
10
2
864
Tim
e (
sec)
15
Performance of N-BODYPerformance of N-BODY
SAS performance about half that of MPI(Speedups on 32p: SAS = 14.30, MPI = 26.94)
Synchronization overhead dominates SAS runtime
82% of barrier time spent on protocol handling
If very high performance is the goal, message passing necessary for commodity SMP clusters
SYNCRMEMLOCAL
128K particles on 32 processors
SAS MPI
Tim
e (
sec)
7
0
65
1
432
16
Node Architecture Communication Architecture
Origin2000 (Hardware Cache Coherency)Origin2000 (Hardware Cache Coherency)
Memory
Hub
L2Cache
Directory
Dir (>32P)
R12K
Router
L2Cache
R12K
Previous results showed that on a hardware-supported cache-coherent multiprocessor platform, SAS achieved
MPI performance for this set of applications
17
Hybrid Performance Hybrid Performance on PC Clusteron PC Cluster
Latest teraflop-scale systems contain large number of SMPs;novel paradigm combines two layers of parallelism
Allows codes to benefit from loop-level parallelism and shared-memory algorithms in addition to coarse-grained parallelism
Tradeoff: SAS may reduce intra-SMP communication, but possibly incur additional overhead for explicit synchronization
Complexity example: Hybrid N-BODY requires two types of tree-building: MPI – distributed local tree, SAS – globally shared tree
Hybrid performance gain (11% max) does not compensate for increased programming complexity
18
MPI Collective Function:MPI Collective Function:MPI_AllreduceMPI_Allreduce
How to better structure collective communication on PC-SMP clusters?
We explore algorithms for MPI_Allreduce and MPI_Allgather
MPI/Pro version labeled “Original” (exact algorithms undocumented)
For MPI_Allreduce, structure of our 4-way SMP motivates us to modify the deepest level of the B-Tree to a quadtree (B-Tree-4)
No difference in using SAS or MPI communication at lowest level
Execution time (in secs) on 32 procs for one double-precision variablesecs) on 32 procs for one double-precision variable Original Original 11171117 B-Tree B-Tree 10351035 B-Tree-4 B-Tree-4 981981
19
MPI Collective Function:MPI Collective Function:MPI_AllgatherMPI_Allgather
Several algorithms were explored: Initially, B-Tree and B-Tree-4
B-Tree-4*: After a processor at Level 0 collects data, it sends it to Level 1 and below; however, Level 1 already contains data from its own subtree
Thus redundant to broadcast ALL the data back, instead only the necessary data needs to be exchanged (can be extended to the lowest level of the tree (bounded by size of SMP))
Improved communication functions result in up to 9% performance gain (most time spent in send / receive functions)
Time (secs) for P=32 (8 nodes)
20
ConclusionsConclusions
Examined performance for several regular and irregular applications using MP (MPI/Pro on Giganet by VIA) and SAS (GeNIMA on Myrinet by VMMC) on 32-processor PC-SMP cluster
SAS provides substantial ease of programming, esp. for more complex codes which are irregular and dynamic
Unlike previous research on hardware-supported CC-SAS machines, SAS achieved about half the parallel efficiency of MPI for most of our applications (LU was an exception, where performance was similar)
High overhead for SAS due to excessive cost of SVM protocol associated with maintaining page coherence and implementing synch
Hybrid codes offered no significant performance advantage over pure MPI, but increased programming complexity and reduced portability
Presented new algorithms for improved SMP communication functions
If very high performance is the goal, the difficulty of MPI programming appears to be necessary for commodity SMP clusters of today