parallel programing languages and accelerations
TRANSCRIPT
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1
Parallel Programing Languages and Accelerations
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2
Mellanox ScalableHPC
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3
Offer high performing and scalable parallel programming libraries for HPC
Support a comprehensive set of MPIs and PGAS languages • Integration of Mellanox acceleration technology into broad list of languages
• Provide our own language library package when there is no open source alternative
Integrates Mellanox acceleration components into MPIs/PGAS languages • MXM – MellanoX Messaging Accelerator
• FCA – Mellanox Fabric Collective Accelerator
Mellanox ScalableHPC
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4
The I/O Bottleneck Paradigm
Network
Server/Storage
Application
Communication Libraries
Network
Server/Storage
Application
Communication Libraries
Bottleneck
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5
The I/O Bottleneck Paradigm
Network
Server/Storage
Application
Communication Libraries
Network
Server/Storage
Application
Communication Libraries
Bottleneck
Highest throughput
Lowest latency, Message rate
Low CPU overhead,
Hardware accelerations
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6
The I/O Bottleneck Paradigm – Scaling Issues
Network
Server/Storage
Application
Communication Libraries
Network
Server/Storage
Application
Communication Libraries
Bottleneck
Bottleneck
Highest throughput
Lowest latency, Message rate
Low CPU overhead,
Hardware accelerations
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7
The I/O Bottleneck Paradigm – Co-Design Architecture
Network
Server/Storage
Application
Communication Libraries
Network
Server/Storage
Application
Communication Libraries
Bottleneck
Bottleneck
Highest throughput
Lowest latency, Message rate
Low CPU overhead,
Hardware accelerations
Extension of I/O
communications
(RDMA, collectives,
synchronization etc)
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8
MPI/SHMEM/PGAS
Application
InfiniBand Network
InfiniBand Verbs
MPI/SHMEM/PGAS Architecture
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
MPI/SHMEM/PGAS
Mellanox ScalableHPC Architecture
Application
InfiniBand Network (with Hardware Offloading)
InfiniBand Verbs
Mellanox Collectives
Collectives accelerations
(FCA with CORE-Direct)
Mellanox Messaging (MXM)
One-sided/Two-sided
communication
Intra-Node Shared
Memory
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10
High performing and scalable accelerations for collective operations
• Topology aware collectives take advantage of optimized message coalescing
• Makes use of powerful multicast capabilities in network for one-to-many
communications
• Run collectives on separate service level so no interference with other
communications
• Utilizes Mellanox CoreDirect collective hardware offload to minimize system noise
Mellanox Fabric Collective Accelerations (FCA)
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11
High performance and scalability for send/receive (or put/get) messages • Proper management of HCA resources and memory structures
• Optimized intra-node communication
• Hybrid transport technology for large scale deployments
• Efficient memory registration
• Connection management
• Receive Side tag matching
• Fully utilizes hardware offloads and capabilities
• Incorporated in MLNX_OFED-1.5.3-300 and later
- Also provided as a stand-along package
MellanoX Messaging (MXM)
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 12
ScalableHPC Communication Libraries
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13
MPI - Message Passing Interface • Based on Send/Receive and collectives communication semantics
SHMEM - Shared Memory • Provides logically shared memory model and one-way put/get communications
PGAS - Partitioned Global Address Space • Message passing abstracted into a partitioned global address space • UPC (Unified Parallel C) is one example of a PGAS language
HPC Parallel Programming Models
Message Passing Model
P1 P2 P3
Memory Memory Memory
Distributed Memory Model
MPI (Message Passing Interface)
SHMEM
P1 P2 P3
Shared Memory Model
SHMEM, DSM
PGAS
P1 P2 P3
Memory Memory Memory
Logical shared memory
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, CAF, …
Memory
Logical shared memory
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14
SHMEM Details
SHared MEMory library • Library of functions somewhat similar to MPI (e.g. shmem_get()) • ….but SHMEM supports one-sided communication (puts/gets vs. MPI’s
send/receive)
SHMEM and PGAS both allow for a unique combination of using a ‘Distributed Memory Model’ (like MPI), and a ‘Shared Memory Model’ (like SMP machines)
Cray first introduced SHMEM in 1993
OpenSHMEM consortium formed to consolidate the various
SHMEM versions into a widely accepted standard
Mellanox ScalableSHMEM based on OpenSHMEM-1.0 specification
with FCA/MXM integration
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15
UPC Details
UPC, or ‘Unified Parallel C’ is another PGAS language
Higher level abstraction than MPI or SHMEM
Allows programmers to directly represent and manipulate distributed
data structures
Commercial compilers are available for Cray, SGI and HP machines
Open source compiler from LBNL/UCB (Berkeley UPC) available on
InfiniBand
Mellanox ScalableUPC based on BUPC with FCA/MXM integration
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16
FCA Details
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17
What are Collective Operations?
Collective Operations are Group Communications involving all processes
in job
Synchronous operations • By nature consume many ‘Wait’ cycles on large clusters
Popular examples • Barrier
• Reduce
• Allreduce
• Gather
• Allgather
• Bcast
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 18
Collective Operation Challenges at Large Scale
Collective algorithms are not topology aware
and can be inefficient
Congestion due to many-to-many
communications
Slow nodes and OS jitter affect
scalability and increase variability
Ideal Actual
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 19
Mellanox InfiniBand Switches High performance IB multicast for result distribution
FCA Manager
Topology-based collective tree Separate Virtual network IB multicast for result distribution
Mellanox Fabric Collectives Accelerations (FCA)
FCA Agents
Library integrated with MPI Intra-node optimizations CoreDirect integration
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20
Collective Example – Allreduce using Recursive Doubling
Collective Operations are Group Communications involving all
processes in job
7 8 5 6 3 4 1 2 15 11 7 3 3 10 11 27 10 37 37 37 37 37 37 37 37
A 4000 process Allreduce using recursive doubling is 12 stages
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21
Scalable Collectives with FCA
1 2
3 4
5 6
7 8 1 2
3 4
5 6
7 8 32
1 2
3 4
5 6
7 8 32
Intra-node processing
1 2
3 4
5 6
7 8 32
1st tier coalescing
32 648 648
2nd tier coalescing
(result at root)
11664 11664 11664 11664
Multicast Result Host1 Host18 Host324 Host306
SW-1 SW-18
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22
Performance Results
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23
FCA collective performance with OpenMPI
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24
FCA collective scalability for SHMEM
0.0
20.0
40.0
60.0
80.0
100.0
0 500 1000 1500 2000 2500
Late
nc
y (
us)
Processes (PPN=8)
Barrier Collective
Without FCA With FCA
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500
La
ten
cy (
us
)
processes (PPN=8)
Reduce Collective
Without FCA With FCA
0
2000
4000
6000
8000
10000
0 500 1000 1500 2000 2500
Ba
nd
wid
th (
KB
*pro
ces
se
s)
Processes (PPN=8)
8-Byte Broadcast
Without FCA With FCA
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25
Mellanox MXM – HPCC Random Ring Latency
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26
Thank You [email protected]