multi gpu programming with mpi - cnr · 2014-05-01 · multi gpu programming (with mpi) massimo...

94
MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy [email protected]

Upload: others

Post on 20-May-2020

68 views

Category:

Documents


0 download

TRANSCRIPT

MULTI GPU PROGRAMMING (WITH MPI)

Massimo Bernaschi National Research Council of Italy

[email protected]

SOME OF THE SLIDES COME FROM A TUTORIAL PRESENTED DURING THE 2014 GTC

CONFERENCE BY Jiri Kraus and Peter Messmer (NVIDIA)

Download the sample codes from: http://twin.iac.rm.cnr.it/cccsample.tgz To unpack the codes: tar zxvf cccsample.tgz Get http://twin.iac.rm.cnr.it/CUDA_C_QuickRef.pdf

WHAT YOU WILL LEARN § How to use more than one GPU for the same task § How to use MPI for inter GPU communication with CUDA § What CUDA-aware MPI is § What Multi Process Service is and how to use it § How to use NVIDIA Tools in an MPI environment § How to hide MPI communication times

3

MULTI-GPU MEMORY

ü GPUs do not share global memory ü But starting on CUDA 4.0 one GPU can copy data from/to

another GPU memory directly if the GPUs are connected to the same PCIe switch

ü Inter-GPU communication ü Application code is responsible for copying/moving data

between GPU ü Data travel across the PCIe bus

ü Even when GPUs are connected to the same PCIe switch!

SHARING DATA BETWEEN GPUS

§ Options —  Explicit copies via host

—  Zero-copy shared host array

—  Per-device arrays with peer-to-peer exchange transfers —  Peer-to-peer memory access

MULTI-GPU ENVIRONMENT

ü GPUs have consecutive integer IDs, starting with 0 ü Starting on CUDA 4.0, a host thread can maintain more than one GPU context at a time ü CudaSetDevice allows to change the “active” GPU

(and so the context) ü Device 0 is chosen when cudaSetDevice is not called

ü GPU are ordered by decreasing performance ü Remember that multiple host threads can establish contexts

with the same GPU ü Driver handles time-sharing and resource partitioning unless the GPU is

in exclusive mode

MULTI-GPU ENVIRONMENT: A SIMPLE APPROACH // Run independent kernel on each CUDA device

int numDevs = 0; cudaGetNumDevices(&numDevs); ... for (int d = 0; d < numDevs; d++) { cudaSetDevice(d); kernel<<<blocks, threads>>>(args);

} Exercise: Compare the dot_simple_multiblock.cu and dot_multigpu.cu programs. Compile and run. Attention! Add the –arch=sm_20 option: nvcc –o dot_multigpu dot_multigpu.cu –arch=sm_20

CUDA FEATURES USEFUL FOR MULTI-GPU

• Control multiple GPUs with a single CPU thread –  Simpler coding: no need for CPU multithreading

• Streams: –  Enable executing kernels and memcopies concurrently – Up to 2 concurrent memcopies: to/from GPU

• Peer-to-Peer (P2P) GPU memory copies – Transfer data between GPUs using PCIe P2P support – Done by GPU DMA hardware – host CPU is not involved

•  Data traverses PCIe links, without touching CPU memory

– Disjoint GPU-pairs can communicate simultaneously

PEER-TO-PEER GPU COMMUNICATION •  Up to CUDA 4.0, GPU could not

exchange data directly.

Example: read p2pcopy.cu, compile nvcc –o p2pcopy p2pcopy.cu –arch=sm_30

•  With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly.

cudaStreamCreate(&stream_on_gpu_0); cudaSetDevice(0); cudaDeviceEnablePeerAccess( 1, 0 ); cudaMalloc(d_0, num_bytes); /* create array on GPU 0 */ cudaSetDevice(1); cudaMalloc(d_1, num_bytes); /* create array on GPU 1 */ cudaMemcpyPeerAsync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */

INTER-GPU COMMUNICATION WITH MPI •  Example: simpleMPI.c

–  Generate some random numbers on one node. –  Dispatch them to all nodes. –  Compute their square root on each node's GPU. –  Compute the average of the results by using MPI.

To compile: nvcc -c simpleCUDAMPI.cu

$mpic++ -o simpleMPI simpleMPI.c simpleCUDAMPI.o -L/usr/local/cuda/lib64/ -lcudart

To run: $mpirun –np 2 simpleMPI

On a single line!

MPI+CUDA

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 0

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node n-1

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 1

MPI+CUDA

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 0

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node n-1

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 1

MPI+CUDA

//MPI rank 0 MPI_Send(s_buf_d,size,MPI_CHAR,n-1,tag,MPI_COMM_WORLD); //MPI rank n-1 MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MESSAGE PASSING INTERFACE - MPI §  Standard to exchange data between processes via messages

—  Defines API to exchanges messages

§  Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv

§  Collectives, e.g. MPI_Reduce

§ Multiple implementations (open source and commercial) —  Binding for C/C++, Fortran, Python, …

—  E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

14

MPI – A MINIMAL PROGRAM #include <mpi.h>

int main(int argc, char *argv[]) {

int rank,size;

/* Initialize the MPI library */

MPI_Init(&argc,&argv);

/* Determine the calling process rank and total number of ranks */

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&size);

/* Call MPI routines like MPI_Send, MPI_Recv, ... */

...

/* Shutdown MPI library */

MPI_Finalize();

return 0;

} 15

MPI – COMPILING AND LAUNCHING $ mpicc –o myapp myapp.c $ mpirun –np 4 ./myapp <args>

16

myapp myapp myapp myapp

A SIMPLE EXAMPLE

Solving the Laplace equation with the Jacobi method

T(i,j) T(i+1,j) T(i-1,j)

T(i,j-1)

T(i,j+1)

The Laplace equation in 2D (elliptic PDE)

The Jacobi method: each point is updated as the average of its nearest neighbours

EXAMPLE: JACOBI SOLVER – SINGLE GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)

for (int j=1; j < m-1; j++)

Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]

+ T[i][j-1] + T[i][j+1])

§ Copy Tnew to T or Swap Tnew and T § Next iteration

Exercise LAPLACE CUDA (serial)

•  wget http://twin.iac.rm.cnr.it/eserlaplace.tgz

•  tar zxvf eserlaplace.tgz

•  cd LAPLACE_CNAF/SERIAL

•  C version in the laplace_serial.c, laplace_serial_cuda.c and cuda_tools.c files

•  CUDA version activated by using –DCUDA at compile time

(otherwise code within #ifdef CUDA is not compiled!)

•  Use the Makefile: make c_serial

make c_cuda

Exercise: complete the CUDA parts indicated by EXERCISE in laplace_serial_cuda.c

EXAMPLE: JACOBI SOLVER – MULTI GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)

for (int j=1; j < m-1; j++)

Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]

+ T[i][j-1] + T[i][j+1])

§  Exchange halo with neighbors § Copy Tnew to T or Swap Tnew and T § Next iteration

EXAMPLE: JACOBI SOLVER §  Solves the 2D-Laplace equation on a rectangle

∆𝒖(𝒙,𝒚)=𝟎  ∀  (𝒙,𝒚)∈Ω\𝜹Ω —  Dirichlet boundary conditions (constant values on boundaries)

𝒖(𝒙,𝒚)=𝒇(𝒙,𝒚)∈𝜹Ω

§  2D domain decomposition with n x k domains

Rank (0,0)

Rank (0,1)

Rank (0,n-1) …

Rank (k-1,0)

Rank (k-1,1)

Rank (k-1,n-1)

EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_boundary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew+offset_top_boundary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

1

1 2

2

Pla

in C

on

CP

U

EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE

MPI_Sendrecv(Tnew_d+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew_d+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew_d+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew_d+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

CU

DA

1

1 2

2

Depending on the CUDA (and MPI) version, the MPI_Sendrecv may work or fail. We may need to copy data from the GPU to the CPU to exchange them with other MPI tasks.

Device pointer

EXAMPLE: JACOBI – TOP/BOTTOM HALO UPDATE – WITHOUT CUDA-AWARE MPI

25

//send to bottom and receive from top – top bottom omitted

cudaMemcpy(Tnew+1, Tnew_d+1, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

cudaMemcpy(Tnew_d, Tnew, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);

CU

DA

EXAMPLE: JACOBI – LEFT/RIGHT HALO UPDATE

//right neighbor omitted

cuda_copy_to_buffer<<<gs,bs,0,s>>>(to_left_d, u_new_d, n, m);

cudaStreamSynchronize(s);

MPI_Sendrecv( to_left_d, n-2, MPI_DOUBLE, l_nb, 0,

from_left_d, n-2, MPI_DOUBLE, l_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE );

cuda_copy_from_buffer<<<gs,bs,0,s>>>(u_new_d, from_left_d, n, m);

27

CU

DA

Jacobi method with MPI Domain Decomposition

28

T=T0 until (deltaT_max < tol)

1.  Update the halo regions (MPI_Sendrecv) 2.  Tnew(i,j)=T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1) 3.  deltaT(i,j) = abs(Tnew(i,j)-T(i,j)) 4.  deltaT_max_loc = max(deltaT) 5.  compute deltaT_max (MPI_Allreduce) 6.  T=Tnew

•  The exercise/sample makes use of the MPI Cartesian decomposition •  Each MPI process is in charge of a subdomain and before the execution

of the update needs to receive data on the boundaries of its subdomain •  For the evaluation of the global error it is necessary to carry out also a

distributed reduction operation (MPI_Allreduce).

Esercizio LAPLACE CUDA MPI

•  wget http://twin.iac.rm.cnr.it/eserlaplace.tgz

•  tar zxvf eserlaplace.tgz

•  cd LAPLACE/MPI

•  C version in the laplace_cuda_mpi.c and cuda_tools.c files

•  CUDA version activated by using –DCUDA at compile time

(otherwise code within #ifdef CUDA is not compiled!)

•  Use the Makefile: make c_mpi make c_mpi_cuda

Exercise: complete the CUDA parts indicated by EXERCISE The copies of MPI buffers correspond to kernels that must be implemented.

THE DETAILS

UNIFIED VIRTUAL ADDRESSING No UVA: Multiple Memory Spaces UVA : Single Address Space

System Memory

CPU GPU

GPU Memory

PCI-e

0x0000

0xFFFF

0x0000

0xFFFF

System Memory

CPU GPU

GPU Memory

PCI-e

0x0000

0xFFFF

No UVA : Separate Address Spaces

UNIFIED VIRTUAL ADDRESSING

§ One address space for all CPU and GPU memory —  Determine physical memory location from a pointer value

—  Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

§  Supported on devices with compute capability 2.0 for —  64-bit applications on Linux and on Windows also TCC mode

Pinned fabric Buffer

Host Buffer memcpy

GPU Buffer cudaMemcpy

Pinned fabric Buffer Pinned fabric Buffer

MPI+CUDA

//MPI rank 0 MPI_Send(s_buf_d,size,…); //MPI rank n-1 MPI_Recv(r_buf_d,size,…);

With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_h,size,…);

No UVA and regular MPI

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES

NVIDIA GPUDIRECTTM

PEER TO PEER TRANSFERS

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

PEER TO PEER TRANSFERS

NVIDIA GPUDIRECTTM SUPPORT FOR RDMA

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM SUPPORT FOR RDMA

CUDA-AWARE MPI Example: MPI Rank 0 MPI_Send from GPU Buffer MPI Rank 1 MPI_Recv to GPU Buffer §  Show how CUDA+MPI works in principle

—  Depending on the MPI implementation, message size, system setup, … situation might be different

§ Two GPUs in two nodes

42

CUDA-AWARE MPI

GPU Buffer

PCI-E DMA

Pinned CUDA Buffer Pinned fabric Buffer

RDMA

Host Buffer

memcpy

MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI GPU TO REMOTE GPU GPUDIRECT SUPPORT FOR RDMA

Time

MPI_Sendrecv

MPI GPU TO REMOTE GPU GPUDIRECT SUPPORT FOR RDMA

MPI Rank 0 MPI Rank 1

GPU

Host

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

REGULAR MPI GPU TO REMOTE GPU

memcpy H->D MPI_Sendrecv memcpy D->H

Time

REGULAR MPI GPU TO REMOTE GPU

MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT

48

MPI_Sendrecv

Time

MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT

49

More Details on pipelines: S4158 - CUDA Streams: Best Practices and Common Pitfalls (Tuesday 03/27 210A)

PERFORMANCE RESULTS TWO NODES

0 1000 2000 3000 4000 5000 6000 7000

1 4 16

64

256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

BW (

MB/

s)

Message Size (byte)

OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40

CUDA-aware MPI with GPUDirect RDMA

CUDA-aware MPI

regular MPI

Latency (1 byte) 19.04 us 16.91 us 5.52 us

PERFORMANCE RESULTS TWO NODES

1

10

100

1000

10000

1 4 16

64

256

1024

40

96

1638

4 65

536

2621

44

1048

576

4194

304

Late

ncy

(us)

log

scal

e

Message Size (byte)

OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40

CUDA-aware MPI with GPUDirect RDMA

CUDA-aware MPI

regular MPI

MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

GPU ACCELERATION OF LEGACY MPI APPLICATION

§ Typical legacy application —  MPI parallel

—  Single or few threads per MPI rank (e.g. OpenMP)

§ Running with multiple MPI ranks per node

§ GPU acceleration in phases —  Proof of concept prototype, ..

—  Great speedup at kernel level

§ Application performance misses expectations

GPU parallelizable part CPU parallel part Serial part

N=1

Multicore CPU only

GPU parallelizable part CPU parallel part Serial part

N=2 N=1

Multicore CPU only

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1

Multicore CPU only

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

N=1

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

With Hyper-Q/MPS Available in K20, K40

N=4 N=2 N=1 N=8

PROCESSES SHARING GPU WITHOUT MPS: NO OVERLAP

Process A Process B

Context A Context B

Process A Process B

GPU

GPU SHARING WITHOUT MPS

Time-slided use of GPU

Context switch

Context Switch

PROCESSES SHARING GPU WITH MPS: MAXIMUM OVERLAP

Process A Process B

Context A Context B

GPU Kernels from Process A

Kernels from Process B

MPS Process

GPU SHARING WITH MPS

CASE STUDY: HYPER-Q/MPS FOR UMT Enables overlap

between copy and compute of different

processes

Sharing the GPU between multi MPI

ranks increases GPU utilization

USING MPS -  No application modifications necessary -  Not limited to MPI applications -  MPS control daemon

-  Spawn MPS server upon CUDA application startup

-  Typical setup export CUDA_VISIBLE_DEVICES=0

nvidia-smi –i 0 –c EXCLUSIVE_PROCESS

nvidia-cuda-mps-control –d

-  On Cray XK/XC systems export CRAY_CUDA_MPS=1

USING MPS ON MULTI-GPU SYSTEMS -  MPS server only supports a single GPU

-  Use one MPS server per GPU

-  Target specific GPU by setting CUDA_VISIBLE_DEVICES -  Adjust pipe/log directory

export DEVICE=0

export CUDA_VISIBLE_DEVICES=${DEVICE}

export CUDA_MPS_PIPE_DIRECTORY=${HOME}/mps${DEVICE}/pipe

export CUDA_MPS_LOG_DIRECTORY=${HOME}/mps${DEVICE}/log

cuda_mps_server_control –d

export DEVICE=1 …

-  More at http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html

MPS SUMMARY §  Easy path to get GPU acceleration for legacy applications §  Enables overlapping of memory copies and compute between

different MPI ranks

68

DEBUGGING AND PROFILING

TOOLS FOR MPI+CUDA APPLICATIONS § Memory Checking cuda-memcheck § Debugging cuda-gdb § Profiling nvprof and NVIDIA Visual Profiler

MEMORY CHECKING WITH CUDA-MEMCHECK § Cuda-memcheck is a functional correctness checking suite

similar to the valgrind memcheck tool § Can be used in a MPI environment

mpirun -np 2 cuda-memcheck ./myapp <args>

§ Problem: output of different processes is interleaved —  Use save, log-file command line options and launcher script

#!/bin/bash

LOG=$1.$OMPI_COMM_WORLD_RANK

#LOG=$1.$MV2_COMM_WORLD_RANK

cuda-memcheck --save $LOG.memcheck $*

mpirun -np 2 cuda-memcheck-script.sh ./myapp <args>

MEMORY CHECKING WITH CUDA-MEMCHECK

MEMORY CHECKING WITH CUDA-MEMCHECK Read outputfiles with cuda-memcheck --read

DEBUGGING MPI+CUDA APPLICATIONS USING CUDA-GDB WITH MPI APPLICATIONS

§ You can use cuda-gdb just like gdb with the same tricks §  For smaller applications, just launch xterms and cuda-gdb > mpirun –np 4 xterm -e cuda-gdb --cuda-use-lockfile=0 ./myapp

DEBUGGING MPI+CUDA APPLICATIONS CUDA-GDB ATTACH

§ CUDA 5.0 and forward have the ability to attach to a running process

if ( rank == 0 ) {

int i=0;

printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(),name);

while (0 == i) {

sleep(5);

}

}

> mpiexec -np 2 ./jacobi_mpi+cuda

Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and one Tesla M2070 for each process (2049 rows per process).

rank 0: pid 30034 on judge107 ready for attach

> ssh judge107

jkraus@judge107:~> cuda-gdb --pid 30034

DEBUGGING MPI+CUDA APPLICATIONS CUDA_DEVICE_WAITS_ON_EXCEPTION

DEBUGGING MPI+CUDA APPLICATIONS THIRD PARTY TOOLS

§ Allinea DDT debugger § Totalview

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

3 Usage modes: 1.  Embed pid in output filename mpirun –np 2 nvprof --output-profile profile.out.%p

2.  Only save the textual output mpirun –np 2 nvprof --log-file profile.out.%p

3.  Collect profile data on all processes that run on a node nvprof --profile-all-processes –o profile.out.%p

EXERCISE: profile your laplace_mpi_cuda code by using the second mode.

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

PROFILING MPI+CUDA APPLICATIONS THIRD PARTY TOOLS

§ Multiple parallel profiling tools are CUDA aware —  Score-P

—  Vampir

—  Tau

§ These tools are good for discovering MPI issues as well as basic CUDA performance inhibitors

ADVANCED MPI ON GPUS

BEST PRACTICE: USE NONE-BLOCKING MPI

MPI_Request t_b_req[4];

MPI_Irecv(Tnew+offset_top_bondary,m-2,MPI_DOUBLE,t_nb,0,MPI_COMM_WORLD,t_b_req);

MPI_Irecv(Tnew+offset_bottom_bondary,m-2,MPI_DOUBLE,b_nb,1,MPI_COMM_WORLD,t_b_req+1);

MPI_Isend(Tnew+offset_last_row,m-2,MPI_DOUBLE,b_nb,0,MPI_COMM_WORLD,t_b_req+2);

MPI_Isend(Tnew+offset_first_row,m-2,MPI_DOUBLE,t_nb,1,MPI_COMM_WORLD,t_b_req+3);

MPI_Waitall(4, t_b_req, MPI_STATUSES_IGNORE);

85

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

NO

NE

-BLO

CK

ING

B

LOC

KIN

G

Gives MPI more

opportunities to build

efficient piplines

OPTIMIZED CODE WITH “OLD STYLE” MPI

•  Pipelining  at  user  level  with  non-blocking  MPI  and  CUDA  functions

•  Sender:   for  (j  =  0;  j  <  pipeline_len;  j++) cudaMemcpyAsync(s_buf+j*block_sz,s_device+j*block_sz,…); for  (j  =  0;  j  <  pipeline_len;  j++) { while  (result  !=  cudaSuccess)  { results = cudaStreamQuery(..); if (j>0) MPI_Test(…); } MPI_Isend(s_buf+j*block_sz,block_sz,MPI_CHAR,1,1,…) } MPI_Waitall();

GENERAL MULTI-GPU PROGRAMMING PATTERN

• The goal is to hide communication cost – Overlap with computation

• So, every time-step, each GPU should: – Compute parts (halos) to be sent to neighbors – Compute the internal region (bulk) –  Exchange halos with neighbors

• Linear scaling as long as internal-computation takes longer than halo exchange

– Actually, separate halo computation adds some overhead

overlapped

OVERLAPPING COMMUNICATION AND COMPUTATION

89

Process whole domain MPI

No

Ove

rlapp

Process inner domain

MPI Process boundary domain Dependency

Boundary and inner domain processing

can overlap

Ove

rlapp

Possible Speedup

90!

PAGE-LOCKED DATA TRANSFERS ü  cudaMallocHost() allows allocation of page-locked (“pinned”) host memory

ü  Same syntax of cudaMalloc() but works with CPU pointers and memory

ü  Enables highest cudaMemcpy performance ü  3.2 GB/s on PCI-e x16 Gen1 ü  6.5 GB/s on PCI-e x16 Gen2

ü  Use with caution!!! ü  Allocating too much page-locked memory can reduce overall

system performance

See and test the bandwidth.cu sample

91!

OVERLAPPING DATA TRANSFERS AND COMPUTATION

ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable devices ü Kernel computation can overlap data transfers on devices with “Concurrent

copy and execution” (compute capability >= 1.1)

ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches

ü  Asynchronous host-device memory copy returns control immediately to CPU ü  cudaMemcpyAsync(dst, src, size, dir, stream); ü  requires pinned host memory

(allocated with cudaMallocHost)

ü  Overlap CPU computation with data transfer ü  0 = default stream

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

92!

ASYNCHRONOUS DATA TRANSFERS

overlapped

OVERLAPPING KERNEL AND DATA TRANSFER

ü  Requires: ü  Kernel and transfer use different, non-zero streams

ü  For the creation of a stream: cudaStreamCreate ü  Remember that a CUDA call to stream 0 blocks until all previous calls complete

and cannot be overlapped

ü  Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>(…);

overlapped Exercise: read, compile and run simpleStreams.cu

OVERLAPPING COMMUNICATION AND COMPUTATION

process_boundary_and_pack<<<gs_b,bs_b,0,s1>>>(u_new_d,u_d,to_left_d,to_right_d,n,m);

process_inner_domain<<<gs_id,bs_id,0,s2>>>(u_new_d, u_d,to_left_d,to_right_d,n,m);

cudaStreamSynchronize(s1); //wait for boundary

MPI_Request req[8];

//Exchange halo with left, right, top and bottom neighbor

// Using non-blocking MPI primitives

MPI_Waitall(8, req, MPI_STATUSES_IGNORE);

unpack<<<gs_s,bs_s>>>(u_new_d, from_left_d, from_right_d, n, m);

cudaDeviceSynchronize(); //wait for iteration to finish

95

CU

DA

HANDLING MULTI GPU NODES

§ Multi GPU nodes and GPU-affinity: —  Use local rank:

int local_rank = //determine local rank

int num_devices = 0;

cudaGetDeviceCount(&num_devices);

cudaSetDevice(local_rank % num_devices);

—  Look at function assignDeviceToProcess in file assign_process_gpu.c (directory LAPLACE_CNAF/MPI)

—  Use exclusive process mode + cudaSetDevice(0)

98

HANDLING MULTI GPU NODES

§ How to determine local rank: —  Rely on process placement (with one rank per GPU)

int rank = 0;

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

int num_devices = 0;

cudaGetDeviceCount(&num_devices); // num_devices == ranks per node

int local_rank = rank % num_devices;

—  Use environment variables provided by MPI launcher §  e.g. for OpenMPI

int local_rank = atoi(getenv("OMPI_COMM_WORLD_LOCAL_RANK"));

§  e.g. For MVPAICH2

int local_rank = atoi(getenv("MV2_COMM_WORLD_LOCAL_RANK"));

99

HOMEWORK

§  Implement the Jacobi method for the solution of the 2D Laplace equation by using CUDA streams and MPI non-blocking point-to-point primitives to overlap computation and communication.

§ Develop a 3D solution by using a decomposition along one axis (e.g., the Z direction)

100

CONCLUSIONS

§ Using MPI as abstraction layer for Multi GPU programming allows multi GPU programs to scale beyond a single node

—  CUDA-aware MPI delivers ease of use, reduced network latency and increased bandwidth

§ All NVIDIA tools are usable and third party tools are available § Multipe CUDA-aware MPI implementations available

—  OpenMPI, MVAPICH2, Cray, IBM Platform MPI

§ With CUDA streams it is possible to use the CPU as a GPU network-coprocessor.

103

OVERLAPPING COMMUNICATION AND COMPUTATION – TIPS AND TRICKS § CUDA-aware MPI might use the default stream

—  Allocate stream with the non-blocking flag (cudaStreamNonBlocking)

§  In case of multiple kernels for boundary handling the kernel processing the inner domain might sneak in

—  Use single stream or events for inter stream dependencies via cudaStreamWaitEvent – disables overlapping of boundary and inner domain kernels

—  Use high priority streams for boundary handling kernels – allows overlapping of boundary and inner domain kernels

§ As of CUDA 6.0 GPUDirect P2P in multi process can overlap disable it for older releases

104

HIGH PRIORITY STREAMS

§  Improve scalability with high priority streams (cudaStreamCreateWithPriority)

Comp. Local Forces

Comp. Non-Local Forces

Ex. Non-local Atom pos.

Ex. Non-local forces

Stream 1

Stream 2

Comp. Local Forces

Comp. Non-Local Forces

Ex. Non-local Atom pos.

Ex. Non-local forces

Stream 1 (LP)

Stream 2 (HP)

Possible Speedup