advanced mpi capabilities -...

Advanced MPI Capabilities

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

VSCSE Webinar (May 6-8, 2014)

Day 3

by

Karen Tomko Ohio Supercomputer Center

E-mail: [email protected] http://www.osc.edu/~ktomko

http://www.cse.ohio-state.edu/%7Epanda

http://www.osc.edu/%7Ektomko

• Tuesday, May 6 – MPI-3 Additions to the MPI Spec – Updates to the MPI One-Sided Communication Model (RMA)

– Non-Blocking Collectives

– MPI Tools Interface

• Wednesday, May 7 – MPI/PGAS Hybrid Programming – MVAPICH2-X: Unified runtime for MPI+PGAS

– MPI+OpenSHMEM

– MPI+UPC

• Thursday, May 8 – MPI for many-core processor – MVAPICH2-GPU: CUDA-aware MPI for NVidia GPU

– MVAPICH2-MIC Design for Clusters with InfiniBand and Intel Xeon Phi

2

Plans for Thursday

VSCSE-Day3

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters •

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

3

Multi-core Processors

VSCSE-Day3

The Multi-core Era

2003

2013

2008

Performance share of cores/socket in the

Top500 systems

VSCSE-Day3 4

Advent of Accelerators

18.1% 16.2%

Performance share of accelerators in the Top500 systems

VSCSE-Day3 5

State-of-art in Interconnects

Performance share of Interconnects in the Top500 systems

VSCSE-Day3 6

VSCSE-Day3 7

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• We concentrated on MPI during Day 1

• PGAS and Hybrid MPI+PGAS during Day 2

Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 40/100GigE,

Aries, BlueGene)

Multi/Many-core Architectures

Point-to-point Communication (two-sided & one-sided)

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Communication Library or Runtime for Programming Models

8

Accelerators (NVIDIA and MIC)

Middleware

VSCSE-Day3

Co-Design Opportunities

and Challenges

across Various Layers

Performance Scalability

Fault-Resilience

9

Presentation Overview

VSCSE-Day3

• CUDA-Aware MPI (MVAPICH2-GPU)

• MVAPICH2-GPU with GPUDirect-RDMA (GDR)

• MVAPICH2 and OpenACC

• Programming Models for Intel MIC

• MVAPICH2-MIC Designs

• Collaboration between Mellanox and NVIDIA to converge on one memory registration technique

• Both devices register a common

host buffer – GPU copies data to this buffer, and the network

adapter can directly read from this buffer (or vice-versa)

• Note that GPU-Direct does not allow you to bypass host memory

• Latest GPUDirect RDMA achieves this

VSCSE-Day3 10

GPU-Direct

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(sbuf, sdev, . . .); MPI_Send(sbuf, size, . . .);

At Receiver: MPI_Recv(rbuf, size, . . .); cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

•Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance

11 VSCSE-Day3

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,.

. .); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(sbuf + j * block_sz, blksz . . .); } MPI_Waitall();

Sample Code – User Optimized Code • Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity VSCSE-Day3 12

Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details to the programmer

– Pipelined data transfer which automatically provides optimizations inside MPI library without user tuning

• A new Design incorporated in MVAPICH2 to support this functionality

13 VSCSE-Day3

At Sender: At Receiver: MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity • CUDA-Aware MPI

14 VSCSE-Day3

MPI_Send(s_device, size, …);

MPI-Level Two-sided Communication

• 45% improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (u

s)

Message Size (bytes)

Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

15 VSCSE-Day3

Better

MVAPICH2 1.8, 1.9, 2.0a-2.0rc1 Releases • Support for MPI communication from NVIDIA GPU device

memory • High performance RDMA-based inter-node point-to-point

communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point

communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective

communication from GPU device buffers

16 VSCSE-Day3

VSCSE-Day3 17

Applications-Level Evaluation

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios •one process/GPU per node, 16 nodes

• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

020406080

100120140

256*256*256 512*256*256 512*512*256 512*512*512

Step

Tim

e (S

)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

0102030405060708090

1 GPU/Proc per Node 2 GPUs/Procs per NodeTota

l Ex

ecut

ion

Tim

e (s

ec)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

LBM-CUDA

18


VSCSE-Day3


• MVAPICH2-GPU with GPUDirect-RDMA (GDR) – Two-sided Communication

– One-sided Communication

– MPI Datatype Processing




• Fastest possible communication between GPU and other PCI-E devices

• Network adapter can directly read data from GPU device memory

• Avoids copies through the host

• Allows for better asynchronous communication

VSCSE-Day3 19

GPU-Direct RDMA with CUDA 5/6

InfiniBand

GPU

GPU Memory

CPU

Chip set

System Memory

MVAPICH2-GDR 2.0b is available for users

• Hybrid design using GPUDirect RDMA

– GPUDirect RDMA and Host-based pipelining

– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge

• Support for communication using multi-rail

• Support for Mellanox Connect-IB and ConnectX VPI adapters

• Support for RoCE with Mellanox ConnectX VPI adapters

VSCSE-Day3 20

GPUDirect RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)

21

Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

8K 32K 128K 512K 2M

1-Rail2-Rail1-Rail-GDR2-Rail-GDR

Large Message Latency

Message Size (Bytes)

Late

ncy

(us)

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch

10 %

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Small Message Latency


Late

ncy

(us)

67 %

5.49 usec

VSCSE-Day3

22

Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bandwidth


Band

wid

th (M

B/s)

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M


Large Message Bandwidth


Band

wid

th (M

B/s)



5x

9.8 GB/s

VSCSE-Day3

23

Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth



GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

0

5000

10000

15000

20000

25000

8K 32K 128K 512K 2M


Large Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

4.3x

19 %

19 GB/s

VSCSE-Day3

24

Applications-level Benefits: AWP-ODC with MVAPICH2-GPU

Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU

0

50

100

150

200

250

300

350

Platform A Platform B

MV2 MV2-GDR11.6%

15.4%

Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3

Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB

• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010

• An initial version using MPI + CUDA for GPU clusters

• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem

• GPUDirect-RDMA delivers better performance with newer architecture

050

100150200250300350

Platform A Platfrom AGDR

Platform B Platfrom BGDR

Computation Communication

21% 30%

VSCSE-Day3

25

Continuous Enhancements for Improved Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


VSCSE-Day3

0

2

4

6

8

10

12

14

1 4 16 64 256 1K 4K 16K

MV2-2.0b-GDR

Improved



Late

ncy

(us)

GPU-GPU Internode MPI Latency

• Reduced synchronization and while avoiding expensive copies

31 %

26

Dynamic Tuning for Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


VSCSE-Day3

GPU-GPU Internode MPI Performance

0

100

200

300

400

500

600

128K 512K 2M

Opt_latency

Opt_bw

Opt_dynamic

Latency


Late

ncy

(us)

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Opt_latency

Opt_bw

Opt_dynamic

Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

27


VSCSE-Day3








28

One-sided communication

• Send/Recv semantics incur overheads • Distributed buffer information

• Message matching

• Additional copies or rendezvous exchange

• One-sided communication • Separates synchronization from communication

• Direct mapping over RDMA semantics

• Lower overheads and better overlap

4 bytes Host-Host GPU-GPU

IB send/recv 0.98 1.84

MPI send/recv 1.25 6.95

Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB

VSCSE-Day3

29

MPI-3 RMA Support with GPUDirect RDMA

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

MPI-3 RMA provides flexible synchronization and completion primitives

0

2

4

6

8

10

12

14

16

18

20

1 4 16 64 256 1K 4K

Send-Recv

Put+ActiveSync

Put+Flush



Late

ncy

(use

c)

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1 4 16 64 256 1K 4K

Send-RecvPut+ActiveSync

Small Message Rate


Mes

sage

Rat

e (M

sgs/

sec)

VSCSE-Day3

< 2usec

30

Communication Kernel Evaluation: 3Dstencil and Alltoall

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

0102030405060708090

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

3D Stencil with 16 GPU nodes

020406080

100120140

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

AlltoAll with 16 GPU nodes

VSCSE-Day3

42% 50%

31


VSCSE-Day3








Non-contiguous Data Exchange

32

• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other

dimensions

• Halo data exchange – Duplicate the boundary – Exchange the boundary in each

iteration

Halo data exchange

VSCSE-Day3

MPI Datatype Processing • Comprehensive support

– Targeted kernels for regular datatypes - vector, subarray, indexed_block

– Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library – Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels – Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

– Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_SUBARR_XDIM

• MV2_CUDA_KERNEL_SUBARR_YDIM

• MV2_CUDA_KERNEL_SUBARR_ZDIM

– Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM

33 VSCSE-Day3

34

Application-Level Evaluation (LBMGPU-3D)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

050

100150200250300350400

8 16 32 64

Tota

l Exe

cutio

n Ti

me

(sec

)

Number of GPUs

MPI MPI-GPU5.6%

8.2% 13.5%

15.5%

3D LBM-CUDA

VSCSE-Day3

How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements – Mellanox OFED 2.1

– NVIDIA Driver 331.20 or later

– NVIDIA CUDA Toolkit 5.5

– Plugin for GPUDirect RDMA

(http://www.mellanox.com/page/products_dyn?product_family=116)

• Has optimized designs for point-to-point communication using GDR

• Work under progress for optimizing collective and one-sided communication

• Contact MVAPICH help list with any questions related to the package [email protected]

• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!

35 VSCSE-Day3

http://www.mellanox.com/page/products_dyn?product_family=116

mailto:[email protected]

mailto:[email protected]

36


VSCSE-Day3






• Programming Models with Intel MIC


• OpenACC is gaining popularity • Multiple sessions during GTC ‘14 • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators

#pragma acc region {

for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators

buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

37 VSCSE-Day3

• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in

OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA

Using MVPICH2 with the new OpenACC 2.0

38 VSCSE-Day3

A = malloc(sizeof(int) * N); …...

#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop

MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }

…… free(A);

39


VSCSE-Day3








VSCSE-Day3 40

InfiniBand + MIC systems

Courtesy: Scott McMillan, Intel Corporation, Presentation at TACC-Intel Highly Parallel Computing Symposium` 12 (http://www.tacc.utexas.edu/documents/13601/d9d58515-5c0a-429d-8a3f-85014e9e4dab)

VSCSE-Day3

Programming Models for Intel MIC Architecture

41


VSCSE-Day3 42

Offload Model for Intel MIC-based Systems

• MPI processes on the host

• Intel MIC used through offload – OpenMP, Intel Cilk Plus, Intel TBB and Pthreads

• MPI communication – Intranode and Internode


VSCSE-Day3 43

Many-core Hosted Model for Intel MIC-based Systems

• MPI processes on the Intel MIC

• MPI communication – IntraMIC, InterMIC (host-bypass/directly over the network)


VSCSE-Day3 44

Symmetric Model for Intel MIC-based Systems

• MPI processes on the host and the Intel MIC

• MPI communication – Intranode, Internode, IntraMIC, InterMIC, Host-MIC


45


VSCSE-Day3


• MVAPICH2-GPU with GPUDirect-RDMA (GDR)




Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PCIe

M

IC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host

10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host

MPI Process

46

11. Inter-Node MIC-MIC with IB adapter on remote socket

and more . . .

• Critical for runtimes to optimize data movement, hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

VSCSE-Day3

47

Efficient MPI Communication on Xeon Phi Clusters

• Offload Mode

• Intranode Communication

• Coprocessor-only Mode

• Symmetric Mode

• Internode Communication

• Coprocessors-only Mode

• Symmetric Mode

• Comparison with Intel’s MPI Library

VSCSE-Day3

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program MPI Program

MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

• Offload mode: lesser overhead way to extract performance from Xeon Phi • MPI processes run only on the host

48

MPI Applications on MIC Clusters

VSCSE-Day3

MPI Data Movement in Offload Mode

CPU0 CPU1 QPI

MIC

PC

Ie

CPU1

MIC

IB

Node 0 Node 1

1. Intra-Socket 2. Inter-Socket 3. Inter-Node

MPI Process

49

CPU0

VSCSE-Day3

• MPI communication only from the host

• Existing MPI library like MVAPICH2 1.9 or 2.0rc1 can be used

VSCSE-Day3 50

One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency


Late

ncy

(us)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR



Late

ncy

(us)

VSCSE-Day3 51

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

52


• Offload Mode



• Symmetric Mode



• Symmetric Mode


VSCSE-Day3

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program



MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

• Xeon Phi as a many-core node

53 VSCSE-Day3

MPI Applications on Xeon Phi Clusters - IntraNode

MPI Data Movement in Coprocessor-Only Mode

CPU0 CPU1 QPI

MIC

PC

Ie

Node 0

Intra-MIC Communication

54

MPI Process

VSCSE-Day3

C

L2

C

L2

C

L2

C

L2 C

L2

C

L2

C

L2

C

L2 GDDR

GDDR

GDDR

GDDR

Xeon Phi

MVAPICH2-MIC Channels for Intra-MIC Communication

• MVAPICH2 provides a hybrid of two channels

– CH3-SHM

• Derives the design for shared memory communication on host

• Tuned for the Xeon Phi architecture

– CH3-SCIF (Hybrid)

• SCIF is a lower level API provided by MPSS

• Provides user control of the DMA

• Takes advantage of SCIF to improve performance

MVAPICH2

Xeon Phi

CH3-SHM

SCIF

55 VSCSE-Day3

POSIX Shared Memory

S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda - Efficient Intra-node Communication on Intel-MIC Clusters - International Symposium on Cluster, Cloud and Grid Computing, May 2013 (CCGrid’ 13).

S. Potluri, K. Tomko, D. Bureddy and D. K. Panda - Intra-MIC MPI Communication using MVAPICH2: Early Experience - TACC-Intel Highly-Parallel Computing Symposium (TI-HPCS), April 2012 - Best Student Paper

CH3-SCIF (Hybrid)

Experimental Setup

• Stampede@TACC – Dual socket oct-core Intel Sandy Bridge (E5-2680, 2.70 GHz)

– 32 GB Memory

– SE10P (B0-KNC) MIC Coprocessor

– Mellanox IB FDR MT4099 HCA

– CentOS release 6.3 (Final), MPSS 2.1.6720-13, Intel Composer_xe_2013.2.146

• Multi-MIC configuration – Dual socket oct-core Intel Sandy Bridge (E5-2670, 2.70 GHz)

– 32 GB Memory

– 5110P MIC Coprocessor

– Mellanox IB FDR MT4099 HCA

– RHEL 6.3 (Santiago), MPSS 2.1.6720-13, Intel Compiler 13.4.183

• MVAPICH2 1.9, MVAPICH2-MIC based on 1.9, Intel MPI 4.1.1.036

56 VSCSE-Day3

Intra-MIC - Point-to-Point Communication

osu_latency (small)

osu_bw osu_bibw

Better Be

tter

Bett

er

57

Better

0

5

10

15

20

25

0 2 8 32 128 512 2K

Lat

ency

(use

c)


MV2MV2-MIC

02000400060008000

1000012000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

8000

10000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


02000400060008000

10000120001400016000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


osu_latency (large)

VSCSE-Day3

9137 14030

Intra-MIC - Collective Communication – Scatter

8 procs - osu_scatter (small)

Better

58

Better

VSCSE-Day3

0

20

40

60

80

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


MV2MV2-MIC

8 procs - osu_scatter (large)

0

5000

10000

15000

20000

8K 32K 128K 512K

Lat

ency

(use

c)


020406080

100120140

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Better

16 procs - osu_scatter (small)

05000

100001500020000250003000035000

8K 32K 128K 512K

Lat

ency

(use

c)


16 procs - osu_scatter (large)

Better

Intra-MIC - Collective Communication – Allgather

8 procs – osu_allgather (small)

Better

59

Better

VSCSE-Day3

8 procs – osu_allgather (large)

Better

16 procs – osu_allgather (small) 16 procs – osu_allgather (large)

Better

0

50

100

150

200

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


MV2MV2-MIC

0

10000

20000

30000

40000

50000

8K 32K 128K 512K

Lat

ency

(use

c)


0100200300400500600700800

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


0

20000

40000

60000

80000

100000

8K 32K 128K 512K

Lat

ency

(use

c)


Intra-MIC – 3DStencil Comm. Kernel and P3DFFT

• 3DStencil - near-neighbor communication – up to 6 neighbors – 64KB message size

• P3DFFT, a popular library for 3D Fast Fourier Transforms — (MPI + OpenMP) version — Alltoall communication — Test performs forward transform and a backward transform in each iteration

60 VSCSE-Day3

0123456

4 8 16Tim

e pe

r St

ep (m

sec)

Processes Count (MIC)

3DStencil Comm. Kernel – 64KB

05

1015202530

2 4 8Tota

l Tim

e pe

r St

ep(s

ec)

Processes on MIC - 16 Threads/Proc

P3DFFT – 512x512x512

MV2 MV2-MIC

71.6%

24.9%

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program



MPI Program

Host-only

Offload

MIC-only

• MPI processes running on the MIC and the Host

61 VSCSE-Day3

MPI Applications on MIC Clusters - IntraNode

Symmetric

MPI Data Movement in Symmetric Mode – IntraNode

CPU0 CPU1 QPI

MIC

PC

Ie

Node 0

1. Intra-MIC

62

MPI Process

2. Intra-Socket MIC-Host 3. Inter-Socket MIC-Host

VSCSE-Day3

MVAPICH2-MIC Channels for MIC-Host Communication

• IB Channel (OFA-IB-CH3)

– High-performance InfiniBand channel in MVAPICH2

– Uses IB verbs – can use both DirectIB and IB-SCIF by interface selection – MV2_IBA_HCA

– DirectIB provides low latency path

– Limited by P2P bandwidth on Sandy Bridge nodes

• CH3-SCIF (Hybrid)

– SCIF can be used for communication between Xeon Phi and Host

– DirectIB is used for low latency

MVAPICH2

OFA-IB-CH3

CH3-SCIF (Hybrid)

SCIF

IB-HCA

IB-Verbs

PCI-E

Xeon Phi

MVAPICH2

Host

OFA-IB-CH3

SCIF

IB-Verbs

CH3-SCIF (Hybrid)

mlx4_0 mlx4_0

962.86 MB/s 5280 MB/s

P2P Read / IB Read from Xeon Phi P2P Write/IB Write to Xeon Phi

SNB E5-2670

63 VSCSE-Day3

Intra-Socket Host-MIC - Point-to-Point Communication

osu_latency (small)

osu_bw osu_bibw

Better Be

tter

Bett

er

64

Better

osu_latency (large)

VSCSE-Day3

0

5

10

15

0 2 8 32 128 512 2K

Lat

ency

(use

c)


MV2MV2-MIC

0500

10001500200025003000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

8000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


0

2000

4000

6000

8000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


Similar for Inter-Socket Host-MIC

6651

Limited by

IB Write BW

Limited by

IB Read/Write BW

Intra-Node Symmetric Mode - Collective Performance

osu_scatter (small)

osu_allgather (small) osu_allgather (large)

Better

65

Better

osu_scatter (large)

VSCSE-Day3

Better

Better

16 Procs on Host and 16 Procs on MIC

0500

100015002000250030003500

8K 32K 128K 512K

Lat

ency

(use

c)


0

200

400

600

800

1000

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


0

50000

100000

150000

200000

8K 32K 128K 512K

Lat

ency

(use

c)


020406080

100120

1 4 16 64 256 1K

Lat

ency

(use

c)


MV2MV2-MIC

IntraNode Symmetric Mode – 3DStencil Comm. Kernel and P3DFFT

• Running across MIC and Host

• Benefits from shared memory tuning and hybrid SCIF+IB design

• To explore the best combination of MPI processes and threads ˘

66 VSCSE-Day3

MV2 MV2-MIC


0123456

4+4 8+8 16+16Tim

e pe

r St

ep (m

sec)

Processes Count (Host + MIC)

P3DFFT – 512x512x512

0

5

10

15

20

2H-2M 2H-4M 2H-8M

Tim

e pe

r St

ep (s

ec)

Processes on Host-MIC - 8 Threads/Proc

67.4%

19.7%

67


• Offload Mode



• Symmetric Mode



• Symmetric Mode


VSCSE-Day3

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program



MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

68 VSCSE-Day3

MPI Applications on MIC Clusters - InterNode

MPI Data Movement in Coprocessor-Only Mode

CPU0 CPU1 QPI

MIC

PC

Ie

CPU1

MIC

IB

Node 0 Node 1

1. MIC-RemoteMIC

69

CPU0

MPI Process

VSCSE-Day3

MVAPICH2 Channels for MIC-RemoteMIC Communication

• CH3-IB channel – High-performance InfiniBand channel in MVAPICH2

– Implemented using IB Verbs - DirectIB

– Currently does not support advanced features like hardware multicast,etc.

– Limited by P2P Bandwidth offered on Sandy Bridge platform

70

MVAPICH2 Xeon Phi

IB Verbs

OFA-IB-CH3

mlx4_0

MVAPICH2 Xeon Phi

IB Verbs

CH3

mlx4_0

IB-HCA

VSCSE-Day3

Host Proxy-based Designs in MVAPICH2-MIC

• Direct IB channels is limited by P2P read bandwidth

• MVAPICH2-MIC uses a hybrid DirectIB + host proxy-based approach to work around this

71 VSCSE-Day3

962.86 MB/s 5280 MB/s

P2P Read / IB Read from Xeon Phi P2P Write/IB Write to Xeon Phi

6977 MB/s 6296 MB/s

IB Read from Host

Xeon Phi-to-Host

SNB E5-2670

MIC-RemoteMIC Point-to-Point Communication

osu_latency (small)

osu_bw osu_bibw

Better Be

tter

Bett

er

72

Better

osu_latency (large)

VSCSE-Day3

0

5

10

15

20

25

0 2 8 32 128 512 2K

Lat

ency

(use

c)


MV2MV2-MIC

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


0

2000

4000

6000

8000

10000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


5236 8451

InterNode Coprocessor-only Mode - Collective Communication

osu_scatter (small)


Better

73

Better

osu_scatter (large)

VSCSE-Day3

Better

Better

8 MICs with 8 Procs/MIC

0

200

400

600

800

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


MV2MV2-MIC

0100002000030000400005000060000

8K 32K 128K 512K

Lat

ency

(use

c)


0

500

1000

1500

2000

2500

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


050000

100000150000200000250000300000350000400000

8K 32K 128K 512K

Lat

ency

(use

c)


InterNode - Coprocessor-only Mode - P3DFFT Performance

74 VSCSE-Day3

• MV2-INTRA-OPT includes the intra-node tuning and optimizations

• Comparison shows benefit of proxy

MV2-INTRA-OPT MV2-MIC

0

5

10

15

4 8 16Tim

e pe

r St

ep (m

sec)

Processes Per MIC (32 MICs)


0

5

10

15

20

4 8Exe

cutio

n Ti

me

(sec

)

Processes Per MIC – 16 Threads/Process (64 MICs)

P3DFFT – 2Kx2Kx2K

65.6% 24.6%

75

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode



• Symmetric Mode



• Symmetric Mode


VSCSE-Day3

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program



MPI Program

Host-only

Offload

Symmetric

Coprocessor-only

76 VSCSE-Day3

MPI Applications on MIC Clusters - InterNode

Direct-IB Communication

77

IB Write to Xeon Phi (P2P Write)

IB Read from Xeon Phi (P2P Read)

CPU

Xeon Phi

PCIe

QPI

E5-2680 v2 E5-2670

5280 MB/s (83%)

IB FDR HCA

MT4099

6396 MB/s (100%)

962 MB/s (15%) 3421 MB/s (54%)

1075 MB/s (17%) 1179 MB/s (19%)

370 MB/s (6%) 247 MB/s (4%)

(SandyBridge) (IvyBridge)

• IB-verbs based communication from Xeon Phi

• No porting effort for MPI libraries with IB support

Peak IB FDR Bandwidth: 6397 MB/s

Intra-socket

IB Write to Xeon Phi (P2P Write)

IB Read from Xeon Phi (P2P Read) Inter-socket

• Data movement between IB HCA and Xeon Phi: implemented as peer-to-peer transfers

VSCSE-Day3

MIC-Remote-MIC P2P Communication

78 VSCSE-Day3

Bandwidth

Better Bett

er

Bett

er

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


5236

Intra-socket P2P

Inter-socket P2P

02000400060008000

10000120001400016000

8K 32K 128K 512K 2M

Lat

ency

(use

c)



0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/se

c)


Better 5594

Bandwidth

Inter-Node Symmetric Mode Collective Communication

osu_scatter (small)


Better

79

Better

osu_scatter (large)

VSCSE-Day3

Better

Better

8 Nodes with 8 Procs/Host and 8 Procs/MIC

0

50

100

150

1 4 16 64 256 1K 4K

Late

ncy

(use

c)


MV2MV2-MIC

05000

10000150002000025000

8K 32K 128K 512K

Late

ncy

(use

c)


010002000300040005000

1 4 16 64 256 1K 4K

Late

ncy

(use

c)


0

200000

400000

600000

800000

8K 32K 128K 512K

Late

ncy

(use

c)


InterNode – Symmetric Mode - P3DFFT Performance

80 VSCSE-Day3

• To explore different threading levels for host and Xeon Phi

3DStencil Comm. Kernel

P3DFFT – 512x512x4K

0

2

4

6

8

10

4+4 8+8

Tim

e pe

r St

ep (s

ec)

Processes on Host+MIC (32 Nodes)

0

5

10

15

20

2+2 2+8

Com

m p

er S

tep

(sec

)

Processes on Host-MIC - 8 Threads/Proc (8 Nodes)

50.2%

MV2-INTRA-OPT MV2-MIC

16.2%

81

Optimized MPI Collectives for MIC Clusters (Broadcast)

VSCSE-Day3

K. Kandalla , A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, "Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters; HotI’13

0

50

100

150

200

250

1 4 16 64 256 1K 4K

Late

ncy

(use

cs)


64-Node-Bcast (16H + 60M) Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

2000

4000

6000

8000

10000

12000

14000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(use

cs)


64-Node-Bcast (16H + 60M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

100

200

300

400

500

4 Nodes (16H + 16 M) 8 Nodes (8H + 8 M)Exec

utio

n Ti

me

(sec

s)

Configuration

Windjammer Performance MV2-MIC

MV2-MIC-Opt

• Up to 50 - 80% improvement in MPI_Bcast latency with as many as 4864 MPI processes (Stampede)

• 12% and 16% improvement in the performance of Windjammer with 128 processes

82

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

VSCSE-Day3

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14 (accepted)

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(use

cs)


32-Node-Allgather (16H + 16 M) Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(use

cs) x

1000


32-Node-Allgather (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(use

cs) x 10

00


32-Node-Alltoall (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exec

utio

n Ti

me

(sec

s)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance Communication

Computation

76% 58%

55%

83

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode



• Symmetric Mode



• Symmetric Mode


VSCSE-Day3

84

MVAPICH2 and Intel MPI – Intra-MIC

02468

101214

0 2 8 32 128 512 2K

Late

ncy

(use

c)



Intel MPI

MVAPICH2-MIC

0

500

1,000

1,500

2,000

2,500

4K 16K 64K 256K 1M 4M

Late

ncy

(use

c)



0

2,000

4,000

6,000

8,000

10,000

1 16 256 4K 64K 1M

Band

wid

th (M

Bps)


Bandwidth

02,0004,0006,0008,000

10,00012,00014,00016,000

1 16 256 4K 64K 1M

Band

wid

th (M

Bps)


Bi-directional Bandwidth

Better

Better

Bett

er

Bett

er

VSCSE-Day3

85

MVAPICH2 and Intel MPI – Intranode – Host-MIC

0

2

4

6

8

10

0 2 8 32 128 512 2K

Late

ncy

(use

c)


Small Message Latency Intel MPIMVAPICH2-MIC

01,0002,0003,0004,0005,0006,0007,000

1 16 256 4K 64K 1M

Band

wid

th (M

Bps)


Bandwidth

01,0002,0003,0004,0005,0006,0007,0008,000

1 16 256 4K 64K 1MBand

wid

th (M

Bps)


Bi-directional Bandwidth

0200400600800

1,0001,200

4K 16K 64K 256K 1M 4M

Late

ncy

(use

c)



Better

Better

Bett

er

Bett

er

VSCSE-Day3

86

MVAPICH2 and Intel MPI – Internode MIC-MIC

0200400600800

1,0001,2001,4001,600

4K 16K 64K 256K 1M 4M

Late

ncy

(use

c)


Intel MPIMVAPICH2-MIC

01,0002,0003,0004,0005,0006,000

1 16 256 4K 64K 1M

Band

wid

th (M

Bps)


Better

Bett

er

Bett

er

VSCSE-Day3

S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, D. K. Panda; MVAPICH-PRISM: A Proxy-Based Communication Framework Using InfiniBand and SCIF for Intel MIC Clusters; SC’13

Intra-socket P2P

Inter-socket P2P

0

5000

10000

15000

8K 32K 128K 512K 2M

Late

ncy

(use

c)




0100020003000400050006000

8K 32K 128K 512K 2MBand

wid

th (M

Bps)


Bandwidth

Bandwidth

Better

• An initial version is installed on TACC-Stampede for advanced users

• Based on the feedback, it is being enhanced

• Will be made available to the public in the near future

VSCSE-Day3 87

Status of MVAPICH2-MIC

VSCSE-Day3 88

Hands-on Exercises

PDF version of exercises available from http://www.cse.ohio-state.edu/~panda/vssce14/dk_karen_day3_exercises.pdf

• Measure the latency between Device-to-Device, Host-to-Device, and Host-to-Host using OMB micro benchmark.

• Additional compilation flag: – D_ENABLE_CUDA_

• Running Parameters: – D D (D2D), H D (H2D), H H (H2H)

VSCSE-Day3 89

Exercise #1

• Implement the basic point-to-point communication using MPI_Send/MPI_Recv across device buffer on each GPU node – Naïve Design

• cudaMemcpy(D2H)

• MPI_Send/Recv(H2H)

• cudaMemcpy(H2D)

– CUDA-Aware Design • MPI_Send/Recv(D2D)

VSCSE-Day3 90

Exercise #2

• Design a program to carry out vector addition using CUDA-aware MPI. Basic steps will be as follows: 1. Process 1 initializes vectors a and b on host buffer, and sends

them to device buffer on process 2 • MPI_Send(a, …), MPI_Send(b, …)

2. Process 2 then launches the kernel to add the vectors • __global__ void vecAdd(float *a, float *b, float *c, int vecsize)

3. Process 2 sends the result vector c back to process • MPI_Send(c,…)

VSCSE-Day3 91

Exercise #3

VSCSE-Day3 92

Solutions and Guidelines

• Solutions for these exercises available at: /nfs/02/w557091/mpi-gpu-exercises/ • See README file inside the above folder for Build and

Run instructions

advanced mpi capabilities -...

Documents