![Page 1: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/1.jpg)
Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA
NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS
![Page 2: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/2.jpg)
2
BACKGROUND
Efficiency of parallel computation tasks
• Amount of exposed parallelism
• Amount of work assigned to each processor
Expense of communications among tasks
• Amount of communication
• Degree of overlap of communication with computation
What limits the scalability of parallel applications?
![Page 3: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/3.jpg)
3
COMMON COMMUNICATION PATTERNS
![Page 4: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/4.jpg)
4
COMMUNICATION AMONG TASKS
Point-to-point communication
• Single sender, single receiver
• Relatively easy to implement efficiently
Collective communication
• Multiple senders and/or receivers
• Patterns include broadcast, scatter, gather, reduce, all-to-all, …
• Difficult to implement efficiently
What are common communication patterns?
![Page 5: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/5.jpg)
5
POINT-TO-POINT COMMUNICATION
Most common pattern in HPC, where communication is usually to nearest neighbors
Single-sender, single-receiver per instance
2D Decomposition
Boundary
exchanges
![Page 6: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/6.jpg)
7
COLLECTIVE COMMUNICATIONMultiple senders and/or receivers
![Page 7: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/7.jpg)
8
BROADCASTOne sender, multiple receivers
GPU0 GPU1 GPU2 GPU3
A
GPU0 GPU1 GPU2 GPU3
A A A A
broadcast
![Page 8: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/8.jpg)
9
SCATTEROne sender; data is distributed among multiple receivers
scatter
GPU0 GPU1 GPU2 GPU3
A B C D
GPU0 GPU1 GPU2 GPU3
A
B
C
D
![Page 9: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/9.jpg)
10
GATHERMultiple senders, one receiver
GPU0 GPU1 GPU2 GPU3
A
B
C
D
GPU0 GPU1 GPU2 GPU3
A B C D
gather
![Page 10: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/10.jpg)
11
ALL-GATHERGather messages from all; deliver gathered data to all participants
GPU0 GPU1 GPU2 GPU3
A B C D
GPU0 GPU1 GPU2 GPU3
A A A A
B B B B
C C C C
D D D D
all-gather
![Page 11: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/11.jpg)
12
REDUCECombine data from all senders; deliver the result to one receiver
GPU0 GPU1 GPU2 GPU3
A B C D
reduce
GPU0 GPU1 GPU2 GPU3
A
+
B
+
C
+
D
![Page 12: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/12.jpg)
13
ALL-REDUCECombine data from all senders; deliver the result to all participants
GPU0 GPU1 GPU2 GPU3
A B C D
all-reduce
GPU0 GPU1 GPU2 GPU3
A
+
B
+
C
+
D
A
+
B
+
C
+
D
A
+
B
+
C
+
D
A
+
B
+
C
+
D
![Page 13: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/13.jpg)
14
REDUCE-SCATTERCombine data from all senders; distribute result across participants
reduce-scatter
GPU0 GPU1 GPU2 GPU3
A0+B0+
C0+D0
A1+B1+
C1+D1
A2+B2+
C2+D2
A3+B3+
C3+D3
GPU0 GPU1 GPU2 GPU3
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
![Page 14: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/14.jpg)
15
ALL-TO-ALLScatter/Gather distinct messages from each participant to every other
GPU0 GPU1 GPU2 GPU3
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
all-to-all
GPU0 GPU1 GPU2 GPU3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
![Page 15: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/15.jpg)
16
THE CHALLENGE OF COLLECTIVES
![Page 16: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/16.jpg)
17
THE CHALLENGE OF COLLECTIVES
Having multiple senders and/or receivers compounds communication inefficiencies
• For small transfers, latencies dominate; more participants increase latency
• For large transfers, bandwidth is key; bottlenecks are easily exposed
• May require topology-aware implementation for high performance
• Collectives are often blocking/non-overlapped
Collectives are often avoided because they are expensive. Why?
![Page 17: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/17.jpg)
18
THE CHALLENGE OF COLLECTIVES
Collectives are central to scalability in a variety of key applications:
• Deep Learning (All-reduce, broadcast, gather)
• Parallel FFT (Transposition is all-to-all)
• Molecular Dynamics (All-reduce)
• Graph Analytics (All-to-all)
• …
If collectives are so expensive, do they actually get used? YES!
![Page 18: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/18.jpg)
19
THE CHALLENGE OF COLLECTIVES
Scaling requires efficient communication algorithms and careful implementation
Communication algorithms are topology-dependent
Topologies can be complex – not every system is a fat tree
Most collectives amenable to bandwidth-optimal implementation on rings, andmany topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan]
Many implementations seen in the wild are suboptimal
![Page 19: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/19.jpg)
20
RING-BASED COLLECTIVES: A PRIMER
![Page 20: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/20.jpg)
21
BROADCASTwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 21: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/21.jpg)
22
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 22: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/22.jpg)
23
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵
Step 2: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 23: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/23.jpg)
24
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵
Step 2: ∆𝑡 = 𝑁/𝐵
Step 3: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 24: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/24.jpg)
25
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵
Step 2: ∆𝑡 = 𝑁/𝐵
Step 3: ∆𝑡 = 𝑁/𝐵
Total time: 𝑘 − 1 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
𝑘: number of GPUs
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 25: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/25.jpg)
26
BROADCASTwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 26: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/26.jpg)
27
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 27: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/27.jpg)
28
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 28: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/28.jpg)
29
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 29: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/29.jpg)
30
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 30: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/30.jpg)
31
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)
Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)
...
Total time: S𝑁/(𝑆𝐵) + (𝑘 − 2) 𝑁 (𝑆𝐵)= 𝑁(𝑆 + 𝑘 − 2)/(𝑆𝐵) → 𝑁/𝐵
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
![Page 31: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/31.jpg)
32
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 0
![Page 32: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/32.jpg)
33
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 1
![Page 33: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/33.jpg)
34
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 2
![Page 34: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/34.jpg)
35
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 3
![Page 35: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/35.jpg)
36
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 4
![Page 36: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/36.jpg)
37
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 5
![Page 37: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/37.jpg)
38
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 6
![Page 38: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/38.jpg)
39
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 7
![Page 39: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/39.jpg)
40
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 0
![Page 40: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/40.jpg)
41
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 1
![Page 41: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/41.jpg)
42
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 2
![Page 42: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/42.jpg)
43
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
done
![Page 43: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/43.jpg)
44
RING-BASED COLLECTIVESA primer
GPU0 GPU1
CPU
4-GPU-PCIe
GPU2 GPU3
Switch
PCIe Gen3 x16 ~12 GB/s
![Page 44: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/44.jpg)
45
RING-BASED COLLECTIVESA primer
GPU0 GPU1
CPU
4-GPU-PCIe
GPU2 GPU3
Switch
PCIe Gen3 x16 ~12 GB/s
![Page 45: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/45.jpg)
46
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU0 GPU1
CPU
GPU2 GPU3
Switch
GPU4 GPU5
CPU
GPU6 GPU7
Switch
PCIe Gen3 x16 ~12 GB/s
SMP Connection (e.g., QPI)
![Page 46: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/46.jpg)
47
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU2
CPU
4-GPU-FC
GPU1
GPU3
GPU0
Switch
32 GB/s
GPU2
CPU
4-GPU-Ring
GPU1
GPU3
GPU0
Switch
32 GB/s
NVLink~16 GB/s
PCIe Gen3 x16 ~12 GB/s
![Page 47: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/47.jpg)
48
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU2
CPU
4-GPU-FC
GPU1
GPU3
GPU0
Switch
GPU2
CPU
4-GPU-Ring
GPU1
GPU3
GPU0
Switch
NVLink~16 GB/s
PCIe Gen3 x16 ~12 GB/s
![Page 48: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/48.jpg)
49
INTRODUCING NCCL (“NICKEL”):ACCELERATED COLLECTIVES
FOR MULTI-GPU SYSTEMS
![Page 49: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/49.jpg)
50
INTRODUCING NCCL
GOAL:
• Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
• Pattern the library after MPI’s collectives
• Handle the intra-node communication in an optimal way
• Provide the necessary functionality for MPI to build on top to handle inter-node
Accelerating multi-GPU collective communications
![Page 50: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/50.jpg)
51
NCCL FEATURES AND FUTURES
Collectives
• Broadcast
• All-Gather
• Reduce
• All-Reduce
• Reduce-Scatter
• Scatter
• Gather
• All-To-All
• Neighborhood
Key Features
• Single-node, up to 8 GPUs
• Host-side API
• Asynchronous/non-blocking interface
• Multi-thread, multi-process support
• In-place and out-of-place operation
• Integration with MPI
• Topology Detection
• NVLink & PCIe/QPI* support
(Green = Currently available)
![Page 51: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/51.jpg)
52
NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
• GPUDirect P2P Direct Access
• Three primitive operations: Copy, Reduce, ReduceAndCopy
• Intra-kernel synchronization between GPUs
• One CUDA thread block per ring-direction
![Page 52: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/52.jpg)
53
NCCL EXAMPLE
#include <nccl.h>
ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
double *d_send, *d_recv;
// allocate d_send, d_recv; fill d_send with data
ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);
// consume d_recv
}
All-reduce
![Page 53: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/53.jpg)
54
NCCL PERFORMANCEBandwidth at different problem sizes (4 Maxwell GPUs)
All-Gather
All-Reduce
Reduce-Scatter
Broadcast
![Page 54: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/54.jpg)
55
AVAILABLE NOWgithub.com/NVIDIA/nccl
![Page 55: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/55.jpg)
56
THANKS TO MY COLLABORATORS
Nathan Luehr
Jonas Lippuner
Przemek Tredak
Sylvain Jeaugey
Natalia Gimelshein
Simon Layton
This research is funded in part by the U.S. DOE DesignForward program
![Page 56: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2](https://reader035.vdocuments.net/reader035/viewer/2022062605/5fcbeb123750615d566486cd/html5/thumbnails/56.jpg)