advanced mpi capabilities -...
TRANSCRIPT
Advanced MPI Capabilities
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] http://www.cse.ohio-state.edu/~panda
VSCSE Webinar (May 6-8, 2014)
Day 3
by
Karen Tomko Ohio Supercomputer Center
E-mail: [email protected] http://www.osc.edu/~ktomko
• Tuesday, May 6 – MPI-3 Additions to the MPI Spec – Updates to the MPI One-Sided Communication Model (RMA)
– Non-Blocking Collectives
– MPI Tools Interface
• Wednesday, May 7 – MPI/PGAS Hybrid Programming – MVAPICH2-X: Unified runtime for MPI+PGAS
– MPI+OpenSHMEM
– MPI+UPC
• Thursday, May 8 – MPI for many-core processor – MVAPICH2-GPU: CUDA-aware MPI for NVidia GPU
– MVAPICH2-MIC Design for Clusters with InfiniBand and Intel Xeon Phi
2
Plans for Thursday
VSCSE-Day3
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters •
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
3
Multi-core Processors
VSCSE-Day3
The Multi-core Era
2003
2013
2008
Performance share of cores/socket in the
Top500 systems
VSCSE-Day3 4
Advent of Accelerators
18.1% 16.2%
Performance share of accelerators in the Top500 systems
VSCSE-Day3 5
State-of-art in Interconnects
Performance share of Interconnects in the Top500 systems
VSCSE-Day3 6
VSCSE-Day3 7
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• We concentrated on MPI during Day 1
• PGAS and Hybrid MPI+PGAS during Day 2
Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE,
Aries, BlueGene)
Multi/Many-core Architectures
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Communication Library or Runtime for Programming Models
8
Accelerators (NVIDIA and MIC)
Middleware
VSCSE-Day3
Co-Design Opportunities
and Challenges
across Various Layers
Performance Scalability
Fault-Resilience
9
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
• Collaboration between Mellanox and NVIDIA to converge on one memory registration technique
• Both devices register a common
host buffer – GPU copies data to this buffer, and the network
adapter can directly read from this buffer (or vice-versa)
• Note that GPU-Direct does not allow you to bypass host memory
• Latest GPUDirect RDMA achieves this
VSCSE-Day3 10
GPU-Direct
PCIe
GPU
CPU
NIC
Switch
At Sender: cudaMemcpy(sbuf, sdev, . . .); MPI_Send(sbuf, size, . . .);
At Receiver: MPI_Recv(rbuf, size, . . .); cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
•Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
11 VSCSE-Day3
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,.
. .); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(sbuf + j * block_sz, blksz . . .); } MPI_Waitall();
Sample Code – User Optimized Code • Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity VSCSE-Day3 12
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details to the programmer
– Pipelined data transfer which automatically provides optimizations inside MPI library without user tuning
• A new Design incorporated in MVAPICH2 to support this functionality
13 VSCSE-Day3
At Sender: At Receiver: MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity • CUDA-Aware MPI
14 VSCSE-Day3
MPI_Send(s_device, size, …);
MPI-Level Two-sided Communication
• 45% improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (u
s)
Message Size (bytes)
Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
15 VSCSE-Day3
Better
MVAPICH2 1.8, 1.9, 2.0a-2.0rc1 Releases • Support for MPI communication from NVIDIA GPU device
memory • High performance RDMA-based inter-node point-to-point
communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point
communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective
communication from GPU device buffers
16 VSCSE-Day3
VSCSE-Day3 17
Applications-Level Evaluation
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios •one process/GPU per node, 16 nodes
• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
020406080
100120140
256*256*256 512*256*256 512*512*256 512*512*512
Step
Tim
e (S
)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
0102030405060708090
1 GPU/Proc per Node 2 GPUs/Procs per NodeTota
l Ex
ecut
ion
Tim
e (s
ec)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
LBM-CUDA
18
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication
– One-sided Communication
– MPI Datatype Processing
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
• Fastest possible communication between GPU and other PCI-E devices
• Network adapter can directly read data from GPU device memory
• Avoids copies through the host
• Allows for better asynchronous communication
VSCSE-Day3 19
GPU-Direct RDMA with CUDA 5/6
InfiniBand
GPU
GPU Memory
CPU
Chip set
System Memory
MVAPICH2-GDR 2.0b is available for users
• Hybrid design using GPUDirect RDMA
– GPUDirect RDMA and Host-based pipelining
– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
• Support for communication using multi-rail
• Support for Mellanox Connect-IB and ConnectX VPI adapters
• Support for RoCE with Mellanox ConnectX VPI adapters
VSCSE-Day3 20
GPUDirect RDMA (GDR) with CUDA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
21
Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Latency
Message Size (Bytes)
Late
ncy
(us)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
67 %
5.49 usec
VSCSE-Day3
22
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
5x
9.8 GB/s
VSCSE-Day3
23
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
0
5000
10000
15000
20000
25000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
4.3x
19 %
19 GB/s
VSCSE-Day3
24
Applications-level Benefits: AWP-ODC with MVAPICH2-GPU
Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU
0
50
100
150
200
250
300
350
Platform A Platform B
MV2 MV2-GDR11.6%
15.4%
Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3
Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB
• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010
• An initial version using MPI + CUDA for GPU clusters
• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem
• GPUDirect-RDMA delivers better performance with newer architecture
050
100150200250300350
Platform A Platfrom AGDR
Platform B Platfrom BGDR
Computation Communication
21% 30%
VSCSE-Day3
25
Continuous Enhancements for Improved Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
VSCSE-Day3
0
2
4
6
8
10
12
14
1 4 16 64 256 1K 4K 16K
MV2-2.0b-GDR
Improved
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
GPU-GPU Internode MPI Latency
• Reduced synchronization and while avoiding expensive copies
31 %
26
Dynamic Tuning for Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
VSCSE-Day3
GPU-GPU Internode MPI Performance
0
100
200
300
400
500
600
128K 512K 2M
Opt_latency
Opt_bw
Opt_dynamic
Latency
Message Size (Bytes)
Late
ncy
(us)
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Opt_latency
Opt_bw
Opt_dynamic
Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
27
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication
– One-sided Communication
– MPI Datatype Processing
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
28
One-sided communication
• Send/Recv semantics incur overheads • Distributed buffer information
• Message matching
• Additional copies or rendezvous exchange
• One-sided communication • Separates synchronization from communication
• Direct mapping over RDMA semantics
• Lower overheads and better overlap
4 bytes Host-Host GPU-GPU
IB send/recv 0.98 1.84
MPI send/recv 1.25 6.95
Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB
VSCSE-Day3
29
MPI-3 RMA Support with GPUDirect RDMA
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
MPI-3 RMA provides flexible synchronization and completion primitives
0
2
4
6
8
10
12
14
16
18
20
1 4 16 64 256 1K 4K
Send-Recv
Put+ActiveSync
Put+Flush
Small Message Latency
Message Size (Bytes)
Late
ncy
(use
c)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 4 16 64 256 1K 4K
Send-RecvPut+ActiveSync
Small Message Rate
Message Size (Bytes)
Mes
sage
Rat
e (M
sgs/
sec)
VSCSE-Day3
< 2usec
30
Communication Kernel Evaluation: 3Dstencil and Alltoall
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
0102030405060708090
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
3D Stencil with 16 GPU nodes
020406080
100120140
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
AlltoAll with 16 GPU nodes
VSCSE-Day3
42% 50%
31
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication
– One-sided Communication
– MPI Datatype Processing
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
Non-contiguous Data Exchange
32
• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other
dimensions
• Halo data exchange – Duplicate the boundary – Exchange the boundary in each
iteration
Halo data exchange
VSCSE-Day3
MPI Datatype Processing • Comprehensive support
– Targeted kernels for regular datatypes - vector, subarray, indexed_block
– Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library – Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels – Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
– Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_SUBARR_XDIM
• MV2_CUDA_KERNEL_SUBARR_YDIM
• MV2_CUDA_KERNEL_SUBARR_ZDIM
– Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM
33 VSCSE-Day3
34
Application-Level Evaluation (LBMGPU-3D)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
050
100150200250300350400
8 16 32 64
Tota
l Exe
cutio
n Ti
me
(sec
)
Number of GPUs
MPI MPI-GPU5.6%
8.2% 13.5%
15.5%
3D LBM-CUDA
VSCSE-Day3
How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements – Mellanox OFED 2.1
– NVIDIA Driver 331.20 or later
– NVIDIA CUDA Toolkit 5.5
– Plugin for GPUDirect RDMA
(http://www.mellanox.com/page/products_dyn?product_family=116)
• Has optimized designs for point-to-point communication using GDR
• Work under progress for optimizing collective and one-sided communication
• Contact MVAPICH help list with any questions related to the package [email protected]
• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!
35 VSCSE-Day3
36
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication
– One-sided Communication
– MPI Datatype Processing
• MVAPICH2 and OpenACC
• Programming Models with Intel MIC
• MVAPICH2-MIC Designs
• OpenACC is gaining popularity • Multiple sessions during GTC ‘14 • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators
#pragma acc region {
for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators
buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);
• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..
OpenACC
37 VSCSE-Day3
• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in
OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA
Using MVPICH2 with the new OpenACC 2.0
38 VSCSE-Day3
A = malloc(sizeof(int) * N); …...
#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop
MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }
…… free(A);
39
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication
– One-sided Communication
– MPI Datatype Processing
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
VSCSE-Day3 40
InfiniBand + MIC systems
Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)
VSCSE-Day3
Programming Models for Intel MIC Architecture
41
Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)
VSCSE-Day3 42
Offload Model for Intel MIC-based Systems
• MPI processes on the host
• Intel MIC used through offload – OpenMP, Intel Cilk Plus, Intel TBB and Pthreads
• MPI communication – Intranode and Internode
Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)
VSCSE-Day3 43
Many-core Hosted Model for Intel MIC-based Systems
• MPI processes on the Intel MIC
• MPI communication – IntraMIC, InterMIC (host-bypass/directly over the network)
Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)
VSCSE-Day3 44
Symmetric Model for Intel MIC-based Systems
• MPI processes on the host and the Intel MIC
• MPI communication – Intranode, Internode, IntraMIC, InterMIC, Host-MIC
Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)
45
Presentation Overview
VSCSE-Day3
• CUDA-Aware MPI (MVAPICH2-GPU)
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• MVAPICH2 and OpenACC
• Programming Models for Intel MIC
• MVAPICH2-MIC Designs
Data Movement on Intel Xeon Phi Clusters
CPU CPU QPI
MIC
PCIe
M
IC
MIC
CPU
MIC
IB
Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host
10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host
MPI Process
46
11. Inter-Node MIC-MIC with IB adapter on remote socket
and more . . .
• Critical for runtimes to optimize data movement, hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
VSCSE-Day3
47
Efficient MPI Communication on Xeon Phi Clusters
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only Mode
• Symmetric Mode
• Comparison with Intel’s MPI Library
VSCSE-Day3
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
• Offload mode: lesser overhead way to extract performance from Xeon Phi • MPI processes run only on the host
48
MPI Applications on MIC Clusters
VSCSE-Day3
MPI Data Movement in Offload Mode
CPU0 CPU1 QPI
MIC
PC
Ie
CPU1
MIC
IB
Node 0 Node 1
1. Intra-Socket 2. Inter-Socket 3. Inter-Node
MPI Process
49
CPU0
VSCSE-Day3
• MPI communication only from the host
• Existing MPI library like MVAPICH2 1.9 or 2.0rc1 can be used
VSCSE-Day3 50
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
VSCSE-Day3 51
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
52
Efficient MPI Communication on Xeon Phi Clusters
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only Mode
• Symmetric Mode
• Comparison with Intel’s MPI Library
VSCSE-Day3
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
• Xeon Phi as a many-core node
53 VSCSE-Day3
MPI Applications on Xeon Phi Clusters - IntraNode
MPI Data Movement in Coprocessor-Only Mode
CPU0 CPU1 QPI
MIC
PC
Ie
Node 0
Intra-MIC Communication
54
MPI Process
VSCSE-Day3
C
L2
C
L2
C
L2
C
L2 C
L2
C
L2
C
L2
C
L2 GDDR
GDDR
GDDR
GDDR
Xeon Phi
MVAPICH2-MIC Channels for Intra-MIC Communication
• MVAPICH2 provides a hybrid of two channels
– CH3-SHM
• Derives the design for shared memory communication on host
• Tuned for the Xeon Phi architecture
– CH3-SCIF (Hybrid)
• SCIF is a lower level API provided by MPSS
• Provides user control of the DMA
• Takes advantage of SCIF to improve performance
MVAPICH2
Xeon Phi
CH3-SHM
SCIF
55 VSCSE-Day3
POSIX Shared Memory
S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda - Efficient Intra-node Communication on Intel-MIC Clusters - International Symposium on Cluster, Cloud and Grid Computing, May 2013 (CCGrid’ 13).
S. Potluri, K. Tomko, D. Bureddy and D. K. Panda - Intra-MIC MPI Communication using MVAPICH2: Early Experience - TACC-Intel Highly-Parallel Computing Symposium (TI-HPCS), April 2012 - Best Student Paper
CH3-SCIF (Hybrid)
Experimental Setup
• Stampede@TACC – Dual socket oct-core Intel Sandy Bridge (E5-2680, 2.70 GHz)
– 32 GB Memory
– SE10P (B0-KNC) MIC Coprocessor
– Mellanox IB FDR MT4099 HCA
– CentOS release 6.3 (Final), MPSS 2.1.6720-13, Intel Composer_xe_2013.2.146
• Multi-MIC configuration – Dual socket oct-core Intel Sandy Bridge (E5-2670, 2.70 GHz)
– 32 GB Memory
– 5110P MIC Coprocessor
– Mellanox IB FDR MT4099 HCA
– RHEL 6.3 (Santiago), MPSS 2.1.6720-13, Intel Compiler 13.4.183
• MVAPICH2 1.9, MVAPICH2-MIC based on 1.9, Intel MPI 4.1.1.036
56 VSCSE-Day3
Intra-MIC - Point-to-Point Communication
osu_latency (small)
osu_bw osu_bibw
Better Be
tter
Bett
er
57
Better
0
5
10
15
20
25
0 2 8 32 128 512 2K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
02000400060008000
1000012000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
8000
10000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
02000400060008000
10000120001400016000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
osu_latency (large)
VSCSE-Day3
9137 14030
Intra-MIC - Collective Communication – Scatter
8 procs - osu_scatter (small)
Better
58
Better
VSCSE-Day3
0
20
40
60
80
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
8 procs - osu_scatter (large)
0
5000
10000
15000
20000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
020406080
100120140
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Better
16 procs - osu_scatter (small)
05000
100001500020000250003000035000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
16 procs - osu_scatter (large)
Better
Intra-MIC - Collective Communication – Allgather
8 procs – osu_allgather (small)
Better
59
Better
VSCSE-Day3
8 procs – osu_allgather (large)
Better
16 procs – osu_allgather (small) 16 procs – osu_allgather (large)
Better
0
50
100
150
200
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
0
10000
20000
30000
40000
50000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
0100200300400500600700800
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
0
20000
40000
60000
80000
100000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
Intra-MIC – 3DStencil Comm. Kernel and P3DFFT
• 3DStencil - near-neighbor communication – up to 6 neighbors – 64KB message size
• P3DFFT, a popular library for 3D Fast Fourier Transforms — (MPI + OpenMP) version — Alltoall communication — Test performs forward transform and a backward transform in each iteration
60 VSCSE-Day3
0123456
4 8 16Tim
e pe
r St
ep (m
sec)
Processes Count (MIC)
3DStencil Comm. Kernel – 64KB
05
1015202530
2 4 8Tota
l Tim
e pe
r St
ep(s
ec)
Processes on MIC - 16 Threads/Proc
P3DFFT – 512x512x512
MV2 MV2-MIC
71.6%
24.9%
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program MPI Program
MPI Program
Host-only
Offload
MIC-only
• MPI processes running on the MIC and the Host
61 VSCSE-Day3
MPI Applications on MIC Clusters - IntraNode
Symmetric
MPI Data Movement in Symmetric Mode – IntraNode
CPU0 CPU1 QPI
MIC
PC
Ie
Node 0
1. Intra-MIC
62
MPI Process
2. Intra-Socket MIC-Host 3. Inter-Socket MIC-Host
VSCSE-Day3
MVAPICH2-MIC Channels for MIC-Host Communication
• IB Channel (OFA-IB-CH3)
– High-performance InfiniBand channel in MVAPICH2
– Uses IB verbs – can use both DirectIB and IB-SCIF by interface selection – MV2_IBA_HCA
– DirectIB provides low latency path
– Limited by P2P bandwidth on Sandy Bridge nodes
• CH3-SCIF (Hybrid)
– SCIF can be used for communication between Xeon Phi and Host
– DirectIB is used for low latency
MVAPICH2
OFA-IB-CH3
CH3-SCIF (Hybrid)
SCIF
IB-HCA
IB-Verbs
PCI-E
Xeon Phi
MVAPICH2
Host
OFA-IB-CH3
SCIF
IB-Verbs
CH3-SCIF (Hybrid)
mlx4_0 mlx4_0
962.86 MB/s 5280 MB/s
P2P Read / IB Read from Xeon Phi P2P Write/IB Write to Xeon Phi
SNB E5-2670
63 VSCSE-Day3
Intra-Socket Host-MIC - Point-to-Point Communication
osu_latency (small)
osu_bw osu_bibw
Better Be
tter
Bett
er
64
Better
osu_latency (large)
VSCSE-Day3
0
5
10
15
0 2 8 32 128 512 2K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
0500
10001500200025003000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
8000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
0
2000
4000
6000
8000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
Similar for Inter-Socket Host-MIC
6651
Limited by
IB Write BW
Limited by
IB Read/Write BW
Intra-Node Symmetric Mode - Collective Performance
osu_scatter (small)
osu_allgather (small) osu_allgather (large)
Better
65
Better
osu_scatter (large)
VSCSE-Day3
Better
Better
16 Procs on Host and 16 Procs on MIC
0500
100015002000250030003500
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
0
200
400
600
800
1000
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
0
50000
100000
150000
200000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
020406080
100120
1 4 16 64 256 1K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
IntraNode Symmetric Mode – 3DStencil Comm. Kernel and P3DFFT
• Running across MIC and Host
• Benefits from shared memory tuning and hybrid SCIF+IB design
• To explore the best combination of MPI processes and threads ˘
66 VSCSE-Day3
MV2 MV2-MIC
3DStencil Comm. Kernel – 64KB
0123456
4+4 8+8 16+16Tim
e pe
r St
ep (m
sec)
Processes Count (Host + MIC)
P3DFFT – 512x512x512
0
5
10
15
20
2H-2M 2H-4M 2H-8M
Tim
e pe
r St
ep (s
ec)
Processes on Host-MIC - 8 Threads/Proc
67.4%
19.7%
67
Efficient MPI Communication on Xeon Phi Clusters
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only Mode
• Symmetric Mode
• Comparison with Intel’s MPI Library
VSCSE-Day3
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
68 VSCSE-Day3
MPI Applications on MIC Clusters - InterNode
MPI Data Movement in Coprocessor-Only Mode
CPU0 CPU1 QPI
MIC
PC
Ie
CPU1
MIC
IB
Node 0 Node 1
1. MIC-RemoteMIC
69
CPU0
MPI Process
VSCSE-Day3
MVAPICH2 Channels for MIC-RemoteMIC Communication
• CH3-IB channel – High-performance InfiniBand channel in MVAPICH2
– Implemented using IB Verbs - DirectIB
– Currently does not support advanced features like hardware multicast,etc.
– Limited by P2P Bandwidth offered on Sandy Bridge platform
70
MVAPICH2 Xeon Phi
IB Verbs
OFA-IB-CH3
mlx4_0
MVAPICH2 Xeon Phi
IB Verbs
CH3
mlx4_0
IB-HCA
VSCSE-Day3
Host Proxy-based Designs in MVAPICH2-MIC
• Direct IB channels is limited by P2P read bandwidth
• MVAPICH2-MIC uses a hybrid DirectIB + host proxy-based approach to work around this
71 VSCSE-Day3
962.86 MB/s 5280 MB/s
P2P Read / IB Read from Xeon Phi P2P Write/IB Write to Xeon Phi
6977 MB/s 6296 MB/s
IB Read from Host
Xeon Phi-to-Host
SNB E5-2670
MIC-RemoteMIC Point-to-Point Communication
osu_latency (small)
osu_bw osu_bibw
Better Be
tter
Bett
er
72
Better
osu_latency (large)
VSCSE-Day3
0
5
10
15
20
25
0 2 8 32 128 512 2K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
0
2000
4000
6000
8000
10000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
5236 8451
InterNode Coprocessor-only Mode - Collective Communication
osu_scatter (small)
osu_allgather (small) osu_allgather (large)
Better
73
Better
osu_scatter (large)
VSCSE-Day3
Better
Better
8 MICs with 8 Procs/MIC
0
200
400
600
800
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
MV2MV2-MIC
0100002000030000400005000060000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
0
500
1000
1500
2000
2500
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
050000
100000150000200000250000300000350000400000
8K 32K 128K 512K
Lat
ency
(use
c)
Message Size (Bytes)
InterNode - Coprocessor-only Mode - P3DFFT Performance
74 VSCSE-Day3
• MV2-INTRA-OPT includes the intra-node tuning and optimizations
• Comparison shows benefit of proxy
MV2-INTRA-OPT MV2-MIC
0
5
10
15
4 8 16Tim
e pe
r St
ep (m
sec)
Processes Per MIC (32 MICs)
3DStencil Comm. Kernel – 64KB
0
5
10
15
20
4 8Exe
cutio
n Ti
me
(sec
)
Processes Per MIC – 16 Threads/Process (64 MICs)
P3DFFT – 2Kx2Kx2K
65.6% 24.6%
75
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only Mode
• Symmetric Mode
• Comparison with Intel’s MPI Library
VSCSE-Day3
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
76 VSCSE-Day3
MPI Applications on MIC Clusters - InterNode
Direct-IB Communication
77
IB Write to Xeon Phi (P2P Write)
IB Read from Xeon Phi (P2P Read)
CPU
Xeon Phi
PCIe
QPI
E5-2680 v2 E5-2670
5280 MB/s (83%)
IB FDR HCA
MT4099
6396 MB/s (100%)
962 MB/s (15%) 3421 MB/s (54%)
1075 MB/s (17%) 1179 MB/s (19%)
370 MB/s (6%) 247 MB/s (4%)
(SandyBridge) (IvyBridge)
• IB-verbs based communication from Xeon Phi
• No porting effort for MPI libraries with IB support
Peak IB FDR Bandwidth: 6397 MB/s
Intra-socket
IB Write to Xeon Phi (P2P Write)
IB Read from Xeon Phi (P2P Read) Inter-socket
• Data movement between IB HCA and Xeon Phi: implemented as peer-to-peer transfers
VSCSE-Day3
MIC-Remote-MIC P2P Communication
78 VSCSE-Day3
Bandwidth
Better Bett
er
Bett
er
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
02000400060008000
10000120001400016000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
Better 5594
Bandwidth
Inter-Node Symmetric Mode Collective Communication
osu_scatter (small)
osu_allgather (small) osu_allgather (large)
Better
79
Better
osu_scatter (large)
VSCSE-Day3
Better
Better
8 Nodes with 8 Procs/Host and 8 Procs/MIC
0
50
100
150
1 4 16 64 256 1K 4K
Late
ncy
(use
c)
Message Size (Bytes)
MV2MV2-MIC
05000
10000150002000025000
8K 32K 128K 512K
Late
ncy
(use
c)
Message Size (Bytes)
010002000300040005000
1 4 16 64 256 1K 4K
Late
ncy
(use
c)
Message Size (Bytes)
0
200000
400000
600000
800000
8K 32K 128K 512K
Late
ncy
(use
c)
Message Size (Bytes)
InterNode – Symmetric Mode - P3DFFT Performance
80 VSCSE-Day3
• To explore different threading levels for host and Xeon Phi
3DStencil Comm. Kernel
P3DFFT – 512x512x4K
0
2
4
6
8
10
4+4 8+8
Tim
e pe
r St
ep (s
ec)
Processes on Host+MIC (32 Nodes)
0
5
10
15
20
2+2 2+8
Com
m p
er S
tep
(sec
)
Processes on Host-MIC - 8 Threads/Proc (8 Nodes)
50.2%
MV2-INTRA-OPT MV2-MIC
16.2%
81
Optimized MPI Collectives for MIC Clusters (Broadcast)
VSCSE-Day3
K. Kandalla , A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, "Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters; HotI’13
0
50
100
150
200
250
1 4 16 64 256 1K 4K
Late
ncy
(use
cs)
Message Size (Bytes)
64-Node-Bcast (16H + 60M) Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
2000
4000
6000
8000
10000
12000
14000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(use
cs)
Message Size (Bytes)
64-Node-Bcast (16H + 60M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
100
200
300
400
500
4 Nodes (16H + 16 M) 8 Nodes (8H + 8 M)Exec
utio
n Ti
me
(sec
s)
Configuration
Windjammer Performance MV2-MIC
MV2-MIC-Opt
• Up to 50 - 80% improvement in MPI_Bcast latency with as many as 4864 MPI processes (Stampede)
• 12% and 16% improvement in the performance of Windjammer with 128 processes
82
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
VSCSE-Day3
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14 (accepted)
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(use
cs)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M) Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(use
cs) x
1000
Message Size (Bytes)
32-Node-Allgather (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(use
cs) x 10
00
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
Exec
utio
n Ti
me
(sec
s)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance Communication
Computation
76% 58%
55%
83
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only Mode
• Symmetric Mode
• Comparison with Intel’s MPI Library
VSCSE-Day3
84
MVAPICH2 and Intel MPI – Intra-MIC
02468
101214
0 2 8 32 128 512 2K
Late
ncy
(use
c)
Message Size (Bytes)
Small Message Latency
Intel MPI
MVAPICH2-MIC
0
500
1,000
1,500
2,000
2,500
4K 16K 64K 256K 1M 4M
Late
ncy
(use
c)
Message Size (Bytes)
Large Message Latency
0
2,000
4,000
6,000
8,000
10,000
1 16 256 4K 64K 1M
Band
wid
th (M
Bps)
Message Size (Bytes)
Bandwidth
02,0004,0006,0008,000
10,00012,00014,00016,000
1 16 256 4K 64K 1M
Band
wid
th (M
Bps)
Message Size (Bytes)
Bi-directional Bandwidth
Better
Better
Bett
er
Bett
er
VSCSE-Day3
85
MVAPICH2 and Intel MPI – Intranode – Host-MIC
0
2
4
6
8
10
0 2 8 32 128 512 2K
Late
ncy
(use
c)
Message Size (Bytes)
Small Message Latency Intel MPIMVAPICH2-MIC
01,0002,0003,0004,0005,0006,0007,000
1 16 256 4K 64K 1M
Band
wid
th (M
Bps)
Message Size (Bytes)
Bandwidth
01,0002,0003,0004,0005,0006,0007,0008,000
1 16 256 4K 64K 1MBand
wid
th (M
Bps)
Message Size (Bytes)
Bi-directional Bandwidth
0200400600800
1,0001,200
4K 16K 64K 256K 1M 4M
Late
ncy
(use
c)
Message Size (Bytes)
Large Message Latency
Better
Better
Bett
er
Bett
er
VSCSE-Day3
86
MVAPICH2 and Intel MPI – Internode MIC-MIC
0200400600800
1,0001,2001,4001,600
4K 16K 64K 256K 1M 4M
Late
ncy
(use
c)
Message Size (Bytes)
Intel MPIMVAPICH2-MIC
01,0002,0003,0004,0005,0006,000
1 16 256 4K 64K 1M
Band
wid
th (M
Bps)
Message Size (Bytes)
Better
Bett
er
Bett
er
VSCSE-Day3
S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, D. K. Panda; MVAPICH-PRISM: A Proxy-Based Communication Framework Using InfiniBand and SCIF for Intel MIC Clusters; SC’13
Intra-socket P2P
Inter-socket P2P
0
5000
10000
15000
8K 32K 128K 512K 2M
Late
ncy
(use
c)
Message Size (Bytes)
Latency (Large Messages)
Latency (Large Messages)
0100020003000400050006000
8K 32K 128K 512K 2MBand
wid
th (M
Bps)
Message Size (Bytes)
Bandwidth
Bandwidth
Better
• An initial version is installed on TACC-Stampede for advanced users
• Based on the feedback, it is being enhanced
• Will be made available to the public in the near future
VSCSE-Day3 87
Status of MVAPICH2-MIC
VSCSE-Day3 88
Hands-on Exercises
PDF version of exercises available from http://www.cse.ohio-state.edu/~panda/vssce14/dk_karen_day3_exercises.pdf
• Measure the latency between Device-to-Device, Host-to-Device, and Host-to-Host using OMB micro benchmark.
• Additional compilation flag: – D_ENABLE_CUDA_
• Running Parameters: – D D (D2D), H D (H2D), H H (H2H)
VSCSE-Day3 89
Exercise #1
• Implement the basic point-to-point communication using MPI_Send/MPI_Recv across device buffer on each GPU node – Naïve Design
• cudaMemcpy(D2H)
• MPI_Send/Recv(H2H)
• cudaMemcpy(H2D)
– CUDA-Aware Design • MPI_Send/Recv(D2D)
VSCSE-Day3 90
Exercise #2
• Design a program to carry out vector addition using CUDA-aware MPI. Basic steps will be as follows: 1. Process 1 initializes vectors a and b on host buffer, and sends
them to device buffer on process 2 • MPI_Send(a, …), MPI_Send(b, …)
2. Process 2 then launches the kernel to add the vectors • __global__ void vecAdd(float *a, float *b, float *c, int vecsize)
3. Process 2 sends the result vector c back to process • MPI_Send(c,…)
VSCSE-Day3 91
Exercise #3
VSCSE-Day3 92
Solutions and Guidelines
• Solutions for these exercises available at: /nfs/02/w557091/mpi-gpu-exercises/ • See README file inside the above folder for Build and
Run instructions