designing high performance and energy- efficient mpi...
TRANSCRIPT
![Page 1: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/1.jpg)
Designing High Performance and Energy-
Efficient MPI Collectives for Next Generation
Clusters
Akshay Venkatesh, 5th year Ph.D student
Advisor : DK Panda
Network-based Computing Lab, OSU
![Page 2: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/2.jpg)
2
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work
• Conclusions
Presentation Outline
![Page 3: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/3.jpg)
3
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work
• Conclusions
Presentation Outline
![Page 4: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/4.jpg)
4
• Culmination of Dennard’s Scaling* has
yielded manyfold increase in parallelism on
processing chips and has placed emphasis
on power/energy conservation of systems
• High performance computing domain has
seen increased use accelerators/co-
processors such as NVIDIA GPUs/ Intel
MICs
• Scientific applications routinely use these
specialized hardware to accelerate
compute phases owing to their >= 1
Teraflops/device capability at comparatively
lower power footprint
• MPI/PGAS serve as the de facto
programming models to amalgamate
capacities of several such distributed
heterogeneous nodes
* Power density remains constant
Introduction
![Page 5: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/5.jpg)
5
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work
• Conclusions
Presentation Outline
![Page 6: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/6.jpg)
6
• With diversification of compute
platforms, it is important to ensure that
compute and communication phases of
long running applications are efficient
• => Execution time and energy usage
are two dimensions that demand
attention
• NVIDIA GPUs and Intel MICs
(available as PCIe devices) introduce
differential compute and memory costs
• MPI collectives such as Broadcast,
Alltoall and Allgather can contribute to
a significant fraction of total application
execution time and energy
• Minimizing latency, increasing overlap
and minimizing energy of MPI
collectives require rethinking of
underlying algorithms
Problem Statement
Sandy Bridge
CPUPCIe(GPU/
MIC)
NIC7 GB/s
7 GB/s
6.3 GB/s
6.3 GB/s
0.9 GB/s
5.2 GB/s
![Page 7: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/7.jpg)
7
• Popular algorithms such as Bruck’s allgather/alltoall algorithms, recursive
doubling and ring algorithms assume uniformity of cost paths => Repeated
use of non-optimal paths and steps in heterogeneous systems
• Existing runtimes that support communication operations from GPU buffers do
not exploit novel mechanisms such as GPUDirect RDMA in throughput-critical
scenarios
• Methods to hide latency (critical for GPUs) are unavailable in the form of non-
blocking GPU collectives
• Rules to apply energy efficiency levers during MPI calls in an application-
oblivious manner that works for irregular/regular communication patterns do
not exist
Problem Statement (continued…)
![Page 8: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/8.jpg)
8
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work
• Conclusions
Presentation Outline
![Page 9: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/9.jpg)
9
• Can variations of popular collective algorithms be proposed that is better suited
towards platforms with heterogeneous communication cost paths and compute
capacities?
• Can new heuristics lead to reduced collective communication cost in
heterogeneous clusters?
• Can direct-GPU memory access mechanisms such NVIDIA GPUDirect-RDMA
be coupled with existing paradigms such as the hardware multicast feature for
throughput oriented applications?
• Can direct-GPU memory access mechanisms such as GPUDirect-RDMA and
associated CUDA features be combined with network offload methods such as
CORE-Direct to realize efficient non-blocking GPU collectives for good overlap
and latency?
• Can a set of generic rules be proposed for point-to-point and collective routines
such that energy savings are made only at relevant calls with negligible
performance degradation?
• Can these rules ensure energy-savings in an application-oblivious manner and
not just to well balanced applications?
Challenges
![Page 10: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/10.jpg)
10
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work
• Conclusions
Presentation Outline
![Page 11: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/11.jpg)
11
Contributions Outline
Distributed Scientific Applications (PSDNS, HPL,
Graph500, Lulesh, Mini-apps, Sweep3D, Streaming-class)
Programming Models for
Communication (MPI, PGAS)
Programming Models
for Computation
Collectives
(Bcast, Allgather,
Alltoall)
Point-to-point
operations
(send, recv)
RMA ops
(Put, Get,
Fence, Flush)
Eager Protocols
(send-recv,
RDMA-Fastpath)
Rendezvous
Protocols
(RDMA-Read,
RDMA-Write)
Network-centric DMA
ops (IB -RC, UD,
Mcast, offload)
PCIe-centric DMA
ops (CUDA, SCIF)
CPU-centric ops
(load, store)
MUX
Algorithms
(knomial, Bruck’s,
pairwise, Ring)
![Page 12: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/12.jpg)
12
Contributions Outline
Distributed Scientific Applications (PSDNS, HPL,
Graph500, Lulesh, Mini-apps, Sweep3D, Streaming-class)
Programming Models for
Communication (MPI, PGAS)
Programming Models
for Computation
Collectives
(Bcast, Allgather,
Alltoall)
Point-to-point
operations
(send, recv)
RMA ops
(Put, Get,
Fence, Flush)
Eager Protocols
(send-recv,
RDMA-Fastpath)
Rendezvous
Protocols
(RDMA-Read,
RDMA-Write)
Network-centric DMA
ops (IB -RC, UD,
Mcast, offload)
PCIe-centric DMA
ops (CUDA, SCIF)
CPU-centric ops
(load, store)
MUX
Algorithms
(knomial, Bruck’s,
pairwise, Ring)Dictates
execution time
and energy
usage
![Page 13: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/13.jpg)
13
Contributions Outline
Distributed Scientific Applications (PSDNS, HPL,
Graph500, Lulesh, Mini-apps, Sweep3D, Streaming-class)
Programming Models for
Communication (MPI, PGAS)
Programming Models
for Computation
Collectives
(Bcast, Allgather,
Alltoall)
Point-to-point
operations
(send, recv)
RMA ops
(Put, Get,
Fence, Flush)
Eager Protocols
(send-recv,
RDMA-Fastpath)
Rendezvous
Protocols
(RDMA-Read,
RDMA-Write)
Network-centric DMA
ops (IB -RC, UD,
Mcast, offload)
PCIe-centric DMA
ops (CUDA, SCIF)
CPU-centric ops
(load, store)
MUX
Algorithms
(knomial, Bruck’s,
pairwise, Ring) Focus of
Contributions
![Page 14: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/14.jpg)
14
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 15: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/15.jpg)
15
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 16: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/16.jpg)
16
Sandy Bridge
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M
NIC7 GB/s
7 GB/s
6.3 GB/s
6.3 GB/s
0.9 GB/s
5.2 GB/s
PCIe
Device
(MIC/GPU)
General
Purpose CPU
Pairwise Algorithm – used for
large message alltoall operations
![Page 17: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/17.jpg)
17
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Selective-rerouting Pairwise
Alltoall (delegated)
![Page 18: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/18.jpg)
18
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Selective-rerouting Pairwise
Alltoall (delegated)
![Page 19: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/19.jpg)
19
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Selective-rerouting Pairwise
Alltoall (delegated)
![Page 20: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/20.jpg)
20
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Selective-rerouting Pairwise
Alltoall (delegated)
![Page 21: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/21.jpg)
21
Delegation Mechanisms
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Default Pairwise Alltoall
0H 1M 2H 3M
Node 1 Node 2
Step 1
Step 2
Step 3
0H 1M 2H 3M
0H 1M 2H 3M
Selective-rerouting Pairwise
Alltoall (delegated)
![Page 22: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/22.jpg)
22
• Similar delegation approach applicable to other
important collectives (Allgather, Allreduce, Bcast
and Gather)
• Results
Contributions Outline
![Page 23: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/23.jpg)
23
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 24: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/24.jpg)
24
Path-reordering
1H 0H
2M 3M 4H 5H
7M 6M
Node 1 Node 2
Default Ring algorithm
• Cost of the ring dictated by
slowest sub-path in the ring
• All outgoing paths from the
PCIe device are the
slowest owing to read
performance
• Total cost = (n – 1) * Tslowest
![Page 25: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/25.jpg)
25
Path-reordering
1H 0H
2M 3M 4H 5H
7M 6M
Node 1 Node 2
Reordered Ring algorithm
• The goal is to ensure that
each node has a host
processes lined up as
border node
• If there is at least one host
process/node then virtual
ranks can be assigned
such that no MIC
processes are at the border
• Slow paths still exist but
Tnewslowest < Tslowest
• Total cost = (n – 1) *
Tnewslowest
![Page 26: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/26.jpg)
26
Default recursive doubling algorithm
H H
M M
Node 1
Node 2
H H
M M
H H
M M
Step1 – message size = m
Node 2
H H
M M
H H
M M
Node 1
Node 2
H H
M M
Node 1
Step2 – message size = 2m
Step3 – message size = 4m
![Page 27: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/27.jpg)
27
Schedule-reordered recursive doubling
algorithm
H H
M M
Node 1
Node 2
H H
M M
H H
M M
Step1 – message size = m
Node 2
H H
M M
H H
M M
Node 1
Node 2
H H
M M
Node 1
Step2 – message size = 2m
Step3 – message size = 4m
• Ensures that
largest
transfers don’t
occur on the
slowest paths
![Page 28: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/28.jpg)
28
Results of delegation schemes and
adaptations
![Page 29: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/29.jpg)
29
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 30: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/30.jpg)
30
GDR+Mcast for throughput-oriented
applications
• Existing schemes that broadcast GPU data using hardware multicast did
not exploit novel direct-GPU memory access mechanisms like GPUDirect
RDMA (GDR)
• This leads to unexploited performance possibilities and detrimental to
throughput-oriented streaming applications
• However, combining GDR with UD-based multicast is challenging
![Page 31: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/31.jpg)
31
GDR+Mcast for throughput-oriented
applications
• We propose a scheme that leverages on the scatter-gather list
abstraction to specify host and GPU memory regions and solve the
problem of addressing UD-packet header data and GPU payloads
• An improvement of 50% reduction in latency is observed in comparison
with host-staged approach with consistent scaling
![Page 32: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/32.jpg)
32
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 33: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/33.jpg)
33
Default orchestration of non-blocking GPU Collectives
![Page 34: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/34.jpg)
34
Combing CORE-Direct and GPUDirect RDMA for non-blocking
GPU collecitves
![Page 35: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/35.jpg)
35
Combing CORE-Direct and GPUDirect RDMA for non-blocking
GPU collecitves
• We proposed schemes that leverages on CORE-Direct network offload
technology and GPUDirect RDMA along with CUDA’s callback
mechanism to realize non-blocking GPU collectives
• For dense collectives such as Iallgather and Ialltoall, the proposed
methods help achieve close to 100% overlap in the large message range
and exhibit favorable latency in comparison with blocking counterparts
![Page 36: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/36.jpg)
36
• Delegations mechanisms for dense collectives
• Path-cost aware collective adaptations
• Combining GPUDirect RDMA and hardware
multicast for streaming apps
• Combining GPUDirect RDMA and CORE-Direct
for non-blocking GPU collectives
• Application-oblivious Energy-Aware MPI (EAM)
runtime
Contributions Outline
![Page 37: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/37.jpg)
37
Contributions Outline
• State of the art
approaches treat MPI
as a blackbox and
adopt aggressive
power saving
mechanisms which
lead to degraded
communication
performance
• We propose rules that
rely on intimate
knowledge of the
underlying MPI point-
to-point and collective
protocols in addition to
communication time
prediction models such
as logGP
![Page 38: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/38.jpg)
38
Contributions Outline
• Rules for applying appropriate energy levers for send and receive
operations that use RGET protocol are shown.
![Page 39: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/39.jpg)
39
Contributions Outline
• Up to 40% improvement in
energy usage of graph500
• Up to 10 application
benchmarks showed no
greater than user-allowed
5% degradation in overall
performance
• Proposed approach works
for both irregular and
regular communication
patterns
![Page 40: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/40.jpg)
40
• Introduction
• Problem Statement
• Challenges
• Contributions and Results
• Future work and Conclusions
Presentation Outline
![Page 41: Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fb0bc3a3b78366c885d3816/html5/thumbnails/41.jpg)
41
• The work proposes methods to reduce latency
(heterogeneous clusters) and energy usage
(homogeneous) of time consuming collective operations
in heavily used MPI applications
• Results show methods are scalable and lead to
application execution time improvement
• Future directions include formulating energy rules for
RMA operations for both homogeneous and
heterogeneous clusters as well designing novel
asynchronous transfer mechanisms with NVIDIA’s GPU
offload technologies
Contributions Outline