mvapich2-gdr: pushing the frontier of hpc and deep...
TRANSCRIPT
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at NVIDIA booth (SC 2018)
by
SC ‘18 2Network Based Computing Laboratory
• Overview of the MVAPICH2 Project
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• What’s new with MVAPICH2-GDR
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
• Conclusions
Outline
SC ‘18 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,950 organizations in 86 countries
– More than 505,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 14th, 556,104 cores (Oakforest-PACS) in Japan
• 17th, 367,024 cores (Stampede2) at TACC
• 27th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the upcoming TACC Frontera System
SC ‘18 4Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000
Sep
-04
Feb
-05
Jul-
05
Dec
-05
May
-06
Oct
-06
Mar
-07
Au
g-0
7
Jan
-08
Jun
-08
No
v-0
8
Ap
r-0
9
Sep
-09
Feb
-10
Jul-
10
Dec
-10
May
-11
Oct
-11
Mar
-12
Au
g-1
2
Jan
-13
Jun
-13
No
v-1
3
Ap
r-1
4
Sep
-14
Feb
-15
Jul-
15
Dec
-15
May
-16
Oct
-16
Mar
-17
Au
g-1
7
Jan
-18
Jun
-18
Nu
mb
er o
f D
ow
nlo
ads
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
21
.0
MV
1.0
MV
21
.0.3
MV
1.1
MV
21
.4
MV
21
.5
MV
21
.6
MV
21
.7
MV
21
.8
MV
21
.9
MV
2-G
DR
2.0
b
MV
2-M
IC 2
.0
MV
2-G
DR
2.3
a
MV
2-X
2.3
b
MV
2V
irt
2.2
MV
2 2
.3O
SU IN
AM
0.9
.3
MVAPICH2 Release Timeline and Downloads
SC ‘18 5Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
ToleranceVirtualization
Active
MessagesJob Startup
Introspection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-
IOV
Multi
Rail
Transport Mechanisms
Shared
MemoryCMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM
SC ‘18 6Network Based Computing Laboratory
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
SC ‘18 7Network Based Computing Laboratory
CPU CPUQPI
GP
U
PC
IeG
PU
GP
U
CPU
GP
U
IB
Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host6. Inter-Socket GPU-Host
Memory buffers
8. Inter-Node GPU-GPU with IB adapter on remote socket
and more . . .• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters
SC ‘18 8Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
SC ‘18 9Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
• Unified memory
SC ‘18 10Network Based Computing Laboratory
• MVAPICH2-2.3 with GDR support can be downloaded from
https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements
• Mellanox OFED 3.2 or later
• NVIDIA Driver 367.48 or later
• NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later
• Plugin for GPUDirect RDMA
http://www.mellanox.com/page/products_dyn?product_family=116
• Strongly recommended
• GDRCOPY module from NVIDIA
https://github.com/NVIDIA/gdrcopy
• Contact MVAPICH help list with any questions related to the package
Using MVAPICH2-GPUDirect Version
SC ‘18 11Network Based Computing Laboratory
• Released on 11/10/2018
• Major Features and Enhancements
– Based on MVAPICH2 2.3
– Support for CUDA 10.0
– Add support for Volta (V100) GPU
– Support for OpenPOWER with NVLink
– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink
– Enhanced performance of GPU-based point-to-point communication
– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication
– Enhanced performance of MPI_Allreduce for GPU-resident data
– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications
• Basic support for IB-MCAST designs with GPUDirect RDMA
• Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
• Advanced reliability support for IB-MCAST designs
– Efficient broadcast and all-reduce designs for Deep Learning applications
– Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems
MVAPICH2-GDR 2.3 GA
SC ‘18 12Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us11X
SC ‘18 13Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32Ave
rage
Tim
e St
eps
pe
r se
con
d (
TPS)
Number of Processes
64K Particles 256K Particles
2X2X
SC ‘18 14Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32
No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content
/tasks/operational/meteoSwiss/
SC ‘18 15Network Based Computing Laboratory
• Overview of the MVAPICH2 Project
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• What’s new with MVAPICH2-GDR• Multi-stream Communication for IPC
• CMA-based Intra-node Communication Support
• Support for OpenPower and NVLink
• Maximal overlap in MPI Datatype Processing
• Streaming Support with IB Multicast and GDR
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
• Conclusions
Outline
SC ‘18 16Network Based Computing Laboratory
Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1
• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
• Up to 30% higher D2D bandwidth on DGX-1 with NVLink
•
0
5000
10000
15000
20000
25000
30000
35000
40000
128K 256K 512K 1M 2M 4M
Mill
ion
Byt
es
(MB
)/se
con
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
16% better
02000400060008000
100001200014000160001800020000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Mill
ion
Byt
es
(MB
)/se
con
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
30% better
Available since MVAPICH2-GDR-2.3a
SC ‘18 17Network Based Computing Laboratory
0
5000
10000
15000
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
CMA-based Intra-node Communication Support
0100200300400500600
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) LATENCY
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
MVAPICH2-GDR-2.3Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores
NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA
• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth
30% better30% better
SC ‘18 18Network Based Computing Laboratory
0
10
20
30
40
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)(N
od
es=1
, PP
N=2
0)
0
50
100
150
200
4 8 16 32 64 128256512 1K 2K 4K
Late
ncy
(u
s)
MVAPICH2-GDR
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
(No
des
=1, P
PN
=20
)
Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively
3.6X
Alltoall
Reduce
5.2X
3.2X3.3X
0
400
800
1200
1600
2000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
s)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(No
des
=1
, PP
N=
20
)
0
2500
5000
7500
10000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
s)
MVAPICH2-GDR
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
(No
des
=1, P
PN
=20
)
3.2X
1.4X
1.3X
1.2X
SC ‘18 19Network Based Computing Laboratory
0
0.5
1
1.5
0 1 2 4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(u
s)
MVAPICH2-2.3
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
Intra-node Point-to-Point Performance on OpenPower
Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA
Intra-Socket Small Message Latency Intra-Socket Large Message Latency
Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth
0.30us0
20
40
60
80
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
MVAPICH2-2.3
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
0
10000
20000
30000
40000
1 8 64 512 4K 32K 256K 2M
Ban
dw
idth
(M
B/s
)
MVAPICH2-2.3
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
0
20000
40000
60000
80000
1 8 64 512 4K 32K 256K 2M
Ban
dw
idth
(M
B/s
)
MVAPICH2-2.3
SpectrumMPI-10.1.0.2
OpenMPI-3.0.0
SC ‘18 20Network Based Computing Laboratory
0
5
10
15
20
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE LATENCY (SMALL)
INTRA-SOCKET(NVLINK) INTER-SOCKET
MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal)
010203040
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Ban
dw
idth
(G
B/s
ec)
Message Size (Bytes)
INTRA-NODE BANDWIDTH
INTRA-SOCKET(NVLINK) INTER-SOCKET
0
2
4
6
8
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4MB
and
wid
th (
GB
/sec
)
Message Size (Bytes)
INTER-NODE BANDWIDTH
Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and 4X-FDR InfiniBand Inter-connect
0
200
400
16K 32K 64K 128K256K512K 1M 2M 4M
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE LATENCY (LARGE)
INTRA-SOCKET(NVLINK) INTER-SOCKET
0
10
20
30
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
INTER-NODE LATENCY (SMALL)
0
200
400
600
800
1000
Late
ncy
(u
s)
Message Size (Bytes)
INTER-NODE LATENCY (LARGE)
Intra-node Bandwidth: 33.2 GB/sec (NVLINK)Intra-node Latency: 13.8 us (without GPUDirectRDMA)
Inter-node Latency: 23 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR)Available since MVAPICH2-GDR 2.3a
SC ‘18 21Network Based Computing Laboratory
0
1000
2000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
3X0
2000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
34%
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
0
1000
2000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
Optimized All-Reduce with XPMEM on OpenPOWER(N
od
es=1
, PP
N=2
0)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node
2X
(No
des
=2, P
PN
=20
)
4X48%
3.3X
2X
2X
SC ‘18 22Network Based Computing Laboratory
• Multi-dimensional data
• Row based organization
• Contiguous on one dimension
• Non-contiguous on other dimensions
• Halo data exchange
• Duplicate the boundary
• Exchange the boundary in each iteration
Halo data exchange
Non-contiguous Data Exchange
SC ‘18 23Network Based Computing Laboratory
MPI Datatype support in MVAPICH2• Datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);
MPI_Type_commit(&new_type);
…
MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks
- Efficiently move data between nodes using RDMA
- In progress - currently optimizes vector and hindexed datatypes
- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
SC ‘18 24Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support
• Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library
• Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels
• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray
• MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_SUBARR_XDIM
• MV2_CUDA_KERNEL_SUBARR_YDIM
• MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block
• MV2_CUDA_KERNEL_IDXBLK_XDIM
SC ‘18 25Network Based Computing Laboratory
CPU
Progress
GPU
Time
Initi
ate
Kern
el
Star
t Se
nd
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Init
iate
Ke
rnel
GPU
CPU
Initi
ate
Kern
el
Star
tSe
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Init
iate
Ke
rnel
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Init
iate
Ke
rnel
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Star
t Se
nd
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguousMPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
SC ‘18 26Network Based Computing Laboratory
• Streaming applications on HPC
systems
1. Communication (MPI)• Broadcast-type operations
2. Computation (CUDA)• Multiple GPU nodes as workers
Streaming ApplicationsData Source
SenderHPC resources for real-time analytics
Real-time streaming
Worker
CPU
GPU
GPU
Worker
CPU
GPU
GPU
Worker
CPU
GPU
GPU
Worker
CPU
GPU
GPU
Worker
CPU
GPU
GPU
Data streaming-like broadcast operations
SC ‘18 27Network Based Computing Laboratory
IB HCA
CPU
GPU
Source
IB Switch
Header
Data
IB HCA
CPU
GPU
Destination 1Header
Data
IB HCA
CPU
GPU
Destination NHeader
Data
1. IB Gather + GDR Read2. IB Hardware Multicast3. IB Scatter + GDR Write
• For GPU-resident data, using
– GPUDirect RDMA (GDR)
– InfiniBand Hardware Multicast (IB-MCAST)
• Overhead
– IB UD limit
– GDR limit
Hardware Multicast-based Broadcast
A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.
SC ‘18 28Network Based Computing Laboratory
• Proposed Intra-node Topology-Aware Broadcast
– CUDA InterProcess Communication (IPC)
Broadcast on Multi-GPU systems
Node 1
IB Switch
GPU 0 GPU 1 GPU N
Node NGPU
CPU
Source
GPU
CPU
CPUMulticast steps
cudaMemcpy(Device ↔ Device)
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
Available since
MVAPICH2-GDR 2.3a
SC ‘18 29Network Based Computing Laboratory
0
10
20
30
40
50
60
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Late
ncy
(μ
s)
Message Size (Bytes)
MCAST-GDR-OPT MCAST-GDR
• IB-MCAST + GDR + Topology-aware IPC-based schemes
– Up to 58% and 79% reduction
for small and large messages
Streaming Benchmark @ CSCS (88 GPUs)
58%
0
2000
4000
6000
8000
10000
12000
32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(μ
s)
Message Size (Bytes)
MCAST-GDR-OPT MCAST-GDR
79%
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
SC ‘18 30Network Based Computing Laboratory
• Overview of the MVAPICH2 Project
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• What’s new with MVAPICH2-GDR
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR• Benefits of CUDA-Aware MPI with TensorFlow
• Benefits with CNTK
• Optimized Collectives for Deep Learning
• Co-designed OSU-Caffe
• Out-of-core DNN Training
• Conclusions
Outline
SC ‘18 31Network Based Computing Laboratory
• Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance
• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit
communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-u
p P
erf
orm
ance
Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
cuDNN
gRPC
Hadoop
MPIMKL-DNN
DesiredNCCL1NCCL2
SC ‘18 32Network Based Computing Laboratory
• MVAPICH2-GDR offers excellent
performance via advanced designs for
MPI_Allreduce.
• Up to 22% better performance on
Wilkes2 cluster (16 GPUs)
Exploiting CUDA-Aware MPI for TensorFlow (Horovod)
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8 16
Imag
es/s
ec (
hig
her
is b
ette
r)
No. of GPUs (4GPUs/node)
MVAPICH2 MVAPICH2-GDR
MVAPICH2-GDR is up to 22% faster than MVAPICH2
SC ‘18 33Network Based Computing Laboratory
• @ RI2 cluster, 16 GPUs, 1 GPU/node
– Microsoft Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK]
Performance Benefits with CNTK Deep Learning Framework
0
50
100
150
200
250
300
8 16
Trai
nin
g Ti
me
(s)
Number of GPU nodes
AlexNet model
MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
0
500
1000
1500
2000
2500
8 16
Trai
nin
g Ti
me
(s)
Number of GPU nodes
VGG model
MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
Lower is better
15% 24%6%
15%
C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.
• Reduces up to 24% and 15% of latency for AlexNet and VGG models
• Higher improvement can be observed for larger system sizes
SC ‘18 34Network Based Computing Laboratory
0
10000
20000
30000
40000
50000
512K 1M 2M 4M
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
0
1000
2000
3000
4000
5000
6000
Late
ncy
(m
s)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
1
10
100
1000
10000
100000
4
16
64
25
6
10
24
40
96
16
38
4
65
53
6
26
21
44
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI
*Available since MVAPICH2-GDR 2.3a
~30X betterMV2 is ~2X better
than Baidu
~10X better OpenMPI is ~5X slower
than Baidu
~4X better
SC ‘18 35Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Reduce Operation
• Optimized designs offer better/comparable performance for most cases
• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~5X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~2.5X better
SC ‘18 36Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation
• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
SC ‘18 37Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across different
sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on
CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on
ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
nin
g Ti
me
(sec
on
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use caseOSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
SC ‘18 38Network Based Computing Laboratory
Out-of-Core Deep Neural Network Training with Caffe• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 is the state-of-the-art DNN architecture for Image
Recognition but current frameworks can only go up to a small batch
size of 45
– Next generation models like Neural Machine Translation (NMT) are
ridiculously large, consists of billions of parameters, and require even
more memory
– Can we design Out-of-core DNN training support using new software
features in CUDA 8/9 and hardware mechanisms in Pascal/Volta
GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows
the potential of managed memory designs that can provide
performance with negligible/no overhead.
– In addition to Out-of-core Training support, productivity can be
greatly enhanced in terms of DL framework design by using the new
Unified Memory features.
256512
1024
2048
4096
32 64256
512
1024
50 100 150 200 250
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Ba
tch
Siz
e
Trainability(MemoryRequirements)
AlexNet GoogLeNet VGG
Out-of-coreTrainingP100GPU MemoryLimit(16GB)
A. Awan, C. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ‘18
SC ‘18 39Network Based Computing Laboratory
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta
V100 GPU with CUDA9 and CUDNN7
• OC-Caffe allows efficient scale-up on DGX-1 system with Pascal P100 GPUs with CUDA9 and
CUDNN7
Performance Trends for OC-Caffe
Out-of-core (over-subscription)
0
5
10
15
20
Ima
ges/
sec
(Hig
her
is b
ett
er)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is80% better than
intel-caffecaffe-gpu
cannot
run
X
intel-caffe-opt
(N/A)
X
Scale-up on DGX-1
1
10
100
1000
10000
1 2 4 8
Imag
es
Per
Seco
nd
(H
igh
er
isb
ett
er)
NumberofGPUs(DGX-1)
Caffe OC-Caffe
A. Awan, C. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ‘18
SC ‘18 40Network Based Computing Laboratory
• Overview of the MVAPICH2 Project
• MVAPICH2-GPU with GPUDirect-RDMA (GDR)
• What’s new with MVAPICH2-GDR
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
• Conclusions
Outline
SC ‘18 41Network Based Computing Laboratory
• MVAPICH2-GDR Library provides optimized MPI communication on
InfiniBand and RoCE clusters with GPUs
• Supports both X86 and OpenPower with NVLink
• Takes advantage of CUDA features like IPC and GPUDirect RDMA families
• Allows flexible solutions for streaming applications with GPUs
• Provides optimized solutions (scale-up and scale-out) for High-Performance
Deep Learning
Conclusions
SC ‘18 42Network Based Computing Laboratory
Join us at the OSU Booth
• Join the OSU Team at SC’18 at Booth #4404 and grab a free T-Shirt!
• More details about various events available at the following link
• http://mvapich.cse.ohio-state.edu/conference/735/talks/
SC ‘18 43Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/