mvapich2-gdr: pushing the frontier of hpc and deep...

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at NVIDIA booth (SC 2018)

by

http://www.cse.ohio-state.edu/~panda

SC ‘18 2Network Based Computing Laboratory

• Overview of the MVAPICH2 Project

• MVAPICH2-GPU with GPUDirect-RDMA (GDR)

• What’s new with MVAPICH2-GDR

• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR

• Conclusions

Outline


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,950 organizations in 86 countries

– More than 505,000 (> 0.5 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘18 ranking)

• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 14th, 556,104 cores (Oakforest-PACS) in Japan

• 17th, 367,024 cores (Stampede2) at TACC

• 27th, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

Partner in the upcoming TACC Frontera System

http://mvapich.cse.ohio-state.edu/


0

100000

200000

300000

400000

500000

600000

Sep

-04

Feb

-05

Jul-

05

Dec

-05

May

-06

Oct

-06

Mar

-07

Au

g-0

7

Jan

-08

Jun

-08

No

v-0

8

Ap

r-0

9

Sep

-09

Feb

-10

Jul-

10

Dec

-10

May

-11

Oct

-11

Mar

-12

Au

g-1

2

Jan

-13

Jun

-13

No

v-1

3

Ap

r-1

4

Sep

-14

Feb

-15

Jul-

15

Dec

-15

May

-16

Oct

-16

Mar

-17

Au

g-1

7

Jan

-18

Jun

-18

Nu

mb

er o

f D

ow

nlo

ads

Timeline

MV

0.9

.4

MV

2 0

.9.0

MV

2 0

.9.8

MV

21

.0

MV

1.0

MV

21

.0.3

MV

1.1

MV

21

.4

MV

21

.5

MV

21

.6

MV

21

.7

MV

21

.8

MV

21

.9

MV

2-G

DR

2.0

b

MV

2-M

IC 2

.0

MV

2-G

DR

2.3

a

MV

2-X

2.3

b

MV

2V

irt

2.2

MV

2 2

.3O

SU IN

AM

0.9

.3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-

point

Primitives

Collectives

Algorithms

Energy-

Awareness

Remote

Memory

Access

I/O and

File Systems

Fault

ToleranceVirtualization

Active

MessagesJob Startup

Introspection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-

IOV

Multi

Rail

Transport Mechanisms

Shared

MemoryCMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM


MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications


CPU CPUQPI

GP

U

PC

IeG

PU

GP

U

CPU

GP

U

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters


At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

• Unified memory


• MVAPICH2-2.3 with GDR support can be downloaded from

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements

• Mellanox OFED 3.2 or later

• NVIDIA Driver 367.48 or later

• NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later

• Plugin for GPUDirect RDMA

http://www.mellanox.com/page/products_dyn?product_family=116

• Strongly recommended

• GDRCOPY module from NVIDIA

https://github.com/NVIDIA/gdrcopy

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GPUDirect Version

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

http://www.mellanox.com/page/products_dyn?product_family=116

https://github.com/NVIDIA/gdrcopy

mailto:[email protected]


• Released on 11/10/2018

• Major Features and Enhancements

– Based on MVAPICH2 2.3

– Support for CUDA 10.0

– Add support for Volta (V100) GPU

– Support for OpenPOWER with NVLink

– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink

– Enhanced performance of GPU-based point-to-point communication

– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication

– Enhanced performance of MPI_Allreduce for GPU-resident data

– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications

• Basic support for IB-MCAST designs with GPUDirect RDMA

• Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA

• Advanced reliability support for IB-MCAST designs

– Efficient broadcast and all-reduce designs for Deep Learning applications

– Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems

MVAPICH2-GDR 2.3 GA


0

2000

4000

6000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

1000

2000

3000

4000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Ban

dw

idth

(M

B/s

)


GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

Late

ncy

(u

s)


GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.85us11X


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768

MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Ave

rage

Tim

e St

eps

per

se

con

d (

TPS)

Number of Processes

MV2 MV2+GDR

0

500

1000

1500

2000

2500

3000

3500

4 8 16 32Ave

rage

Tim

e St

eps

pe

r se

con

d (

TPS)

Number of Processes

64K Particles 256K Particles

2X2X




Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

0

0.2

0.4

0.6

0.8

1

1.2

4 8 16 32

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content

/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content





• What’s new with MVAPICH2-GDR• Multi-stream Communication for IPC

• CMA-based Intra-node Communication Support

• Support for OpenPower and NVLink

• Maximal overlap in MPI Datatype Processing

• Streaming Support with IB Multicast and GDR


• Conclusions

Outline


Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect

• Up to 30% higher D2D bandwidth on DGX-1 with NVLink

•

0

5000

10000

15000

20000

25000

30000

35000

40000

128K 256K 512K 1M 2M 4M

Mill

ion

Byt

es

(MB

)/se

con

d


Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

16% better

02000400060008000

100001200014000160001800020000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Mill

ion

Byt

es

(MB

)/se

con

d


Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

30% better

Available since MVAPICH2-GDR-2.3a


0

5000

10000

15000

Late

ncy

(u

s)


INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

CMA-based Intra-node Communication Support

0100200300400500600

Late

ncy

(u

s)


INTRA-NODE Pt-to-Pt (H2H) LATENCY

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores

NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

30% better30% better


0

10

20

30

40

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(u

s)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)(N

od

es=1

, PP

N=2

0)

0

50

100

150

200

4 8 16 32 64 128256512 1K 2K 4K

Late

ncy

(u

s)

MVAPICH2-GDR

SpectrumMPI-10.1.0.2

OpenMPI-3.0.0

(No

des

=1, P

PN

=20

)

Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

3.6X

Alltoall

Reduce

5.2X

3.2X3.3X

0

400

800

1200

1600

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

s)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(No

des

=1

, PP

N=

20

)

0

2500

5000

7500

10000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

s)

MVAPICH2-GDR


OpenMPI-3.0.0

(No

des

=1, P

PN

=20

)

3.2X

1.4X

1.3X

1.2X


0

0.5

1

1.5

0 1 2 4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(u

s)

MVAPICH2-2.3


OpenMPI-3.0.0

Intra-node Point-to-Point Performance on OpenPower

Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Intra-Socket Small Message Latency Intra-Socket Large Message Latency

Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth

0.30us0

20

40

60

80

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)

MVAPICH2-2.3


OpenMPI-3.0.0

0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Ban

dw

idth

(M

B/s

)

MVAPICH2-2.3


OpenMPI-3.0.0

0

20000

40000

60000

80000

1 8 64 512 4K 32K 256K 2M

Ban

dw

idth

(M

B/s

)

MVAPICH2-2.3


OpenMPI-3.0.0


0

5

10

15

20

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

Late

ncy

(u

s)


INTRA-NODE LATENCY (SMALL)

INTRA-SOCKET(NVLINK) INTER-SOCKET

MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal)

010203040

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4M

Ban

dw

idth

(G

B/s

ec)


INTRA-NODE BANDWIDTH


0

2

4

6

8

1 4

16

64

25

6

1K

4K

16

K

64

K

25

6K

1M

4MB

and

wid

th (

GB

/sec

)


INTER-NODE BANDWIDTH

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and 4X-FDR InfiniBand Inter-connect

0

200

400

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(u

s)


INTRA-NODE LATENCY (LARGE)


0

10

20

30

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

Late

ncy

(u

s)


INTER-NODE LATENCY (SMALL)

0

200

400

600

800

1000

Late

ncy

(u

s)


INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 33.2 GB/sec (NVLINK)Intra-node Latency: 13.8 us (without GPUDirectRDMA)

Inter-node Latency: 23 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR)Available since MVAPICH2-GDR 2.3a


0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

3X0

2000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

34%

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

Optimized All-Reduce with XPMEM on OpenPOWER(N

od

es=1

, PP

N=2

0)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(No

des

=2, P

PN

=20

)

4X48%

3.3X

2X

2X


• Multi-dimensional data

• Row based organization

• Contiguous on one dimension

• Non-contiguous on other dimensions

• Halo data exchange

• Duplicate the boundary

• Exchange the boundary in each iteration

Halo data exchange

Non-contiguous Data Exchange


MPI Datatype support in MVAPICH2• Datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);

MPI_Type_commit(&new_type);

…

MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks

- Efficiently move data between nodes using RDMA

- In progress - currently optimizes vector and hindexed datatypes

- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.


MPI Datatype Processing (Computation Optimization )

• Comprehensive support

• Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library

• Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels

• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray

• MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_SUBARR_XDIM

• MV2_CUDA_KERNEL_SUBARR_YDIM

• MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block

• MV2_CUDA_KERNEL_IDXBLK_XDIM


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Init

iate

Ke

rnel

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Init

iate

Ke

rnel

Star

t Se

nd


Kernel on Stream

Isend(1)

Init

iate

Ke

rnel

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);


• Streaming applications on HPC

systems

1. Communication (MPI)• Broadcast-type operations

2. Computation (CUDA)• Multiple GPU nodes as workers

Streaming ApplicationsData Source

SenderHPC resources for real-time analytics

Real-time streaming

Worker

CPU

GPU

GPU

Worker

CPU

GPU

GPU

Worker

CPU

GPU

GPU

Worker

CPU

GPU

GPU

Worker

CPU

GPU

GPU

Data streaming-like broadcast operations


IB HCA

CPU

GPU

Source

IB Switch

Header

Data

IB HCA

CPU

GPU

Destination 1Header

Data

IB HCA

CPU

GPU

Destination NHeader

Data

1. IB Gather + GDR Read2. IB Hardware Multicast3. IB Scatter + GDR Write

• For GPU-resident data, using

– GPUDirect RDMA (GDR)

– InfiniBand Hardware Multicast (IB-MCAST)

• Overhead

– IB UD limit

– GDR limit

Hardware Multicast-based Broadcast

A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.


• Proposed Intra-node Topology-Aware Broadcast

– CUDA InterProcess Communication (IPC)

Broadcast on Multi-GPU systems

Node 1

IB Switch

GPU 0 GPU 1 GPU N

Node NGPU

CPU

Source

GPU

CPU

CPUMulticast steps

cudaMemcpy(Device ↔ Device)

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

Available since

MVAPICH2-GDR 2.3a


0

10

20

30

40

50

60

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K

Late

ncy

(μ

s)


MCAST-GDR-OPT MCAST-GDR

• IB-MCAST + GDR + Topology-aware IPC-based schemes

– Up to 58% and 79% reduction

for small and large messages

Streaming Benchmark @ CSCS (88 GPUs)

58%

0

2000

4000

6000

8000

10000

12000

32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(μ

s)


MCAST-GDR-OPT MCAST-GDR

79%

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.





• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR• Benefits of CUDA-Aware MPI with TensorFlow

• Benefits with CNTK

• Optimized Collectives for Deep Learning

• Co-designed OSU-Caffe

• Out-of-core DNN Training

• Conclusions

Outline


• Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance

– NCCL2, CUDA-Aware MPI --> scale-out performance

• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit

communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-u

p P

erf

orm

ance

Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

cuDNN

gRPC

Hadoop

MPIMKL-DNN

DesiredNCCL1NCCL2


• MVAPICH2-GDR offers excellent

performance via advanced designs for

MPI_Allreduce.

• Up to 22% better performance on

Wilkes2 cluster (16 GPUs)

Exploiting CUDA-Aware MPI for TensorFlow (Horovod)

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16

Imag

es/s

ec (

hig

her

is b

ette

r)

No. of GPUs (4GPUs/node)

MVAPICH2 MVAPICH2-GDR

MVAPICH2-GDR is up to 22% faster than MVAPICH2


• @ RI2 cluster, 16 GPUs, 1 GPU/node

– Microsoft Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK]

Performance Benefits with CNTK Deep Learning Framework

0

50

100

150

200

250

300

8 16

Trai

nin

g Ti

me

(s)

Number of GPU nodes

AlexNet model

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

500

1000

1500

2000

2500

8 16

Trai

nin

g Ti

me

(s)

Number of GPU nodes

VGG model

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lower is better

15% 24%6%

15%

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.

• Reduces up to 24% and 15% of latency for AlexNet and VGG models

• Higher improvement can be observed for larger system sizes


0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(u

s)


MVAPICH2 BAIDU OPENMPI

0

1000

2000

3000

4000

5000

6000

Late

ncy

(m

s)



1

10

100

1000

10000

100000

4

16

64

25

6

10

24

40

96

16

38

4

65

53

6

26

21

44

Late

ncy

(u

s)



• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI

*Available since MVAPICH2-GDR 2.3a

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower

than Baidu

~4X better


MVAPICH2-GDR vs. NCCL2 – Reduce Operation

• Optimized designs offer better/comparable performance for most cases

• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs

1

10

100

1000

10000

100000

Late

ncy

(u

s)


MVAPICH2-GDR NCCL2

~5X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Late

ncy

(u

s)


MVAPICH2-GDR NCCL2

~2.5X better


MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1

10

100

1000

10000

100000

Late

ncy

(u

s)


MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

Late

ncy

(u

s)


MVAPICH2-GDR NCCL2

~3X better


• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses

– Multi-GPU Training within a single node

– Performance degradation for GPUs across different

sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training

– Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on

CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on

ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

nin

g Ti

me

(sec

on

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use caseOSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

http://hidl.cse.ohio-state.edu/


Out-of-Core Deep Neural Network Training with Caffe• Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 is the state-of-the-art DNN architecture for Image

Recognition but current frameworks can only go up to a small batch

size of 45

– Next generation models like Neural Machine Translation (NMT) are

ridiculously large, consists of billions of parameters, and require even

more memory

– Can we design Out-of-core DNN training support using new software

features in CUDA 8/9 and hardware mechanisms in Pascal/Volta

GPUs?

• General intuition is that managed allocations “will be” slow!

– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows

the potential of managed memory designs that can provide

performance with negligible/no overhead.

– In addition to Out-of-core Training support, productivity can be

greatly enhanced in terms of DL framework design by using the new

Unified Memory features.

256512

1024

2048

4096

32 64256

512

1024

50 100 150 200 250

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Ba

tch

Siz

e

Trainability(MemoryRequirements)

AlexNet GoogLeNet VGG

Out-of-coreTrainingP100GPU MemoryLimit(16GB)

A. Awan, C. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ‘18


• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta

V100 GPU with CUDA9 and CUDNN7

• OC-Caffe allows efficient scale-up on DGX-1 system with Pascal P100 GPUs with CUDA9 and

CUDNN7

Performance Trends for OC-Caffe

Out-of-core (over-subscription)

0

5

10

15

20

Ima

ges/

sec

(Hig

her

is b

ett

er)

caffe-gpu oc-caffe-naïve oc-caffe-opt

caffe-cpu intel-caffe intel-caffe-opt

oc-caffe-opt is80% better than

intel-caffecaffe-gpu

cannot

run

X

intel-caffe-opt

(N/A)

X

Scale-up on DGX-1

1

10

100

1000

10000

1 2 4 8

Imag

es

Per

Seco

nd

(H

igh

er

isb

ett

er)

NumberofGPUs(DGX-1)

Caffe OC-Caffe

A. Awan, C. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ‘18






• Conclusions

Outline


• MVAPICH2-GDR Library provides optimized MPI communication on

InfiniBand and RoCE clusters with GPUs

• Supports both X86 and OpenPower with NVLink

• Takes advantage of CUDA features like IPC and GPUDirect RDMA families

• Allows flexible solutions for streaming applications with GPUs

• Provides optimized solutions (scale-up and scale-out) for High-Performance

Deep Learning

Conclusions


Join us at the OSU Booth

• Join the OSU Team at SC’18 at Booth #4404 and grab a free T-Shirt!

• More details about various events available at the following link

• http://mvapich.cse.ohio-state.edu/conference/735/talks/


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


mvapich2-gdr: pushing the frontier of hpc and deep...

Documents