nccl 2 - gtc on-demand featured talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...

Sylvain Jeaugey

NCCL 2.0

DEEP LEARNING ON GPUS Making DL training times shorter

Multi-core CPU GPU

Multi-GPU

NCCL 1

Multi-GPU

Multi-node

NCCL 2

Deeper neural networks, larger data sets … training is a very, very long operation !

NCCL A multi-GPU communication library

NVLink

Sockets (Ethernet)

Infiniband, with GPU Direct RDMA

To other systems

Within a system

GPU Direct P2P

NCCL Architecture

CUBLAS

Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks

NVIDIA GPUs

AGENDA

NCCL History Design

NCCL 2.0 New features API Changes

Performance

Future

HISTORY

Q4 2015: NCCL 1.x

Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.

Limited to intra-node.

Q2 2017: NCCL 2.x and beyond

NVIDIA Library, multi-node support and improved API.

DESIGN

Optimized collective communication library between CUDA devices.

Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.

Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as little threads as possible to permit other computation to progress simultaneously.

What is NCCL ?

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

DESIGN Rings

PCIe / QPI : 1 unidirectional ring

DESIGN Rings

DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring

Reduction

DESIGN Kernels

sendbuff recvbuff

Next GPU

in the ring

Previous GPU

in the ring

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

Node 1

Node N-1

7 4 GPUs per

socket

2 sockets

per node

n nodes

NCCL 2.0

Node 0

Node 1

Node n-1

1 process

per GPU P

NCCL 2.0

Node 0

Node 1

Node n-1

Process 0

t0 t1 t2 t3

Process 1

t0 t1 t2 t3

Process 2

t0 t1 t2 t3

Process 3

t0 t1 t2 t3

Process 2n-2

t0 t1 t2 t3

Process 2n-1

t0 t1 t2 t3

1 thread

per GPU

1 process

per socket

NCCL 2.0

Node 0

Node 1

Node n-1

1 process

per node Process 0 Process 1 Process n-1

8 GPUs per

process

NCCL 2.0 API

NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread

NCCL 1.x : for (int i=0; i<ngpus; i++) {

cudaSetDevice(devices[i]);

ncclAllReduce(…, comms[i], streams[i]);

NCCL 2.0 : ncclGroupStart();

for (int i=0; i<ngpus; i++) {

ncclAllReduce(…, comms[i], streams[i]);

ncclGroupEnd();

Group calls

Process 0

NCCL 2.0 API

Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclCommInitRank(&comm, nranks, id, rank);

Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclGroupStart();

for (int i=0; i<ndev; i++) {

cudaSetDevice(devices[i]);

ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);

ncclGroupEnd();

Integration with parallel environments

Process 0

NCCL 2.0 API

Other small API adjustments over the NCCL 1.x API :

Counts are now of type size_t instead of int

allGather arguments order has been fixed to be similar to other operations

Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double

Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations

Others

PERFORMANCE

PERFORMANCE Intra-node performance

4 QPI 4 CPU 4 PCI DGX-1

AllReduce bandwidth (OMB, size=128MB, in GB/s)

PERFORMANCE Inter-node performance

2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

AllReduce bandwidth (OMB, size=128MB, in GB/s)

Baidu Allreduce

PERFORMANCE Deep Learning - CNTK

0 8 16 24 32

CNTK scaling ResNet50, images/s

Ideal MPI NCCL

1645 1744

FUTURE

Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)

User-defined reduction operations also, trying to merge computation and communication better

Windows support

Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm

Top asked features

nccl 2 - gtc on-demand featured talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...

Documents

signal processing on gpus for radio telescopes | gtc...

accelerating spark workloads using...

bringing nvidia gpus to the pgas/openshmem world ...bringing...

nvidia gpus power fusionaccess graphics workstation bring...

c++17 parallel algorithms on nvidia gpus with pgi c++ ·...

compression on...

introducing gpus to a commercial reservoir...

monitoring & managing gpus in cluster environments | gtc ·...

deep packet inspection using gpus - nvidia...13 5/11/2017...

webgl visualization tools and gpus for marketing of...

real-time use of gpus in high-energy physics experiments...

s7281: device lending: dynamic sharing of gpus in a...

gtc 2016 europe: breaking new database performance records...

signal processing on gpus for radio telescopes - gtc...

gtc 2015 - how...

multi-gpu training with nccl -...

gpus in the film visual effects pipeline -...

dr. james c. beyer - gtc on-demand featured...

gromacs kepler gpus gtc express webinar

how virtual gpus are transforming industries › video ›...