![Page 1: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/1.jpg)
iSwitch: Accelerating Distributed Reinforcement
Learning with In-Switch Computing
Jian Huang
Youjie Li Iou-Jen Liu Yifan Yuan
Deming Chen Alexander Schwing
University of Illinois at Urbana-Champaign
![Page 2: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/2.jpg)
2
AI Applications are Increasingly Operating in Dynamic Environments
Autonomous Driving GamesRobotics
![Page 3: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/3.jpg)
2
AI Applications are Increasingly Operating in Dynamic Environments
Autonomous Driving GamesRobotics
Reinforcement Learning Empowers AI Applications to Take Real-Time Intelligent Actions
![Page 4: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/4.jpg)
3
What is Reinforcement Learning?
Agent Environment
![Page 5: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/5.jpg)
3
What is Reinforcement Learning?
Agent Environment
State
![Page 6: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/6.jpg)
3
What is Reinforcement Learning?
Agent Environment
Action
State
![Page 7: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/7.jpg)
3
What is Reinforcement Learning?
Agent Environment
Action
Next State
![Page 8: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/8.jpg)
3
What is Reinforcement Learning?
Agent Environment
Action
Reward
Next State
![Page 9: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/9.jpg)
3
What is Reinforcement Learning?
Model
Agent Environment
Action
Reward
Next State
![Page 10: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/10.jpg)
3
What is Reinforcement Learning?
Model
Agent Environment
Action
Reward
Next State
![Page 11: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/11.jpg)
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
Action
Reward
Next State
![Page 12: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/12.jpg)
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
![Page 13: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/13.jpg)
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
Train a Typical RL Agent on a
Single GPU = 8 Days*
*Mnih, ICML’16
![Page 14: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/14.jpg)
3
What is Reinforcement Learning?
RL Requires Distributed Training for Improved Performance
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
Train a Typical RL Agent on a
Single GPU = 8 Days*
*Mnih, ICML’16
![Page 15: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/15.jpg)
4
Parameter
Server
Workers
Centralized Distributed RL Training: Parameter-Server Based
Switch
![Page 16: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/16.jpg)
4
Parameter
Server
Workers
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
![Page 17: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/17.jpg)
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
![Page 18: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/18.jpg)
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
![Page 19: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/19.jpg)
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
![Page 20: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/20.jpg)
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
Multiple
Network Hops
![Page 21: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/21.jpg)
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
ServerCentral
Bottleneck
Multiple
Network Hops
![Page 22: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/22.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Ring-AllReduce
Switch
Workers
Model Sum
![Page 23: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/23.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum
![Page 24: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/24.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum Sum
![Page 25: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/25.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum Sum
Full
![Page 26: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/26.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Aggregated
Gradient
Sum Sum
FullFull
Full Full
Aggregation
Complete!
![Page 27: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/27.jpg)
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Excessive
Network Hops
Model Sum
Aggregated
Gradient
Sum Sum
FullFull
Full Full
Aggregation
Complete!
![Page 28: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/28.jpg)
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
![Page 29: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/29.jpg)
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
Network Hops = 4
![Page 30: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/30.jpg)
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
Network Hops = 4 Network Hops = 4N - 4
![Page 31: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/31.jpg)
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
![Page 32: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/32.jpg)
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
![Page 33: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/33.jpg)
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
88x Smaller Gradient Size
158x More Iterations
![Page 34: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/34.jpg)
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
Distributed RL Training is Latency Critical
88x Smaller Gradient Size
158x More Iterations
![Page 35: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/35.jpg)
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
![Page 36: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/36.jpg)
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
Gradient Aggregation over the Network Dominates the Training Time (50~83%)
![Page 37: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/37.jpg)
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
Gradient Aggregation over the Network Dominates the Training Time (50~83%)
Compute
Network
![Page 38: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/38.jpg)
9
Programmable Switch
Aggregation Accelerator
+ + + =
In-Switch Acceleration: A New Distributed Computing Paradigm
![Page 39: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/39.jpg)
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
In-Switch Acceleration: A New Distributed Computing Paradigm
![Page 40: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/40.jpg)
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
Programmability Hardware-Algorithm Co-Design
In-Switch Acceleration: A New Distributed Computing Paradigm
![Page 41: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/41.jpg)
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
Programmability
Scalability
Hardware-Algorithm Co-Design
Scale Training at Rack Scale
In-Switch Acceleration: A New Distributed Computing Paradigm
![Page 42: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/42.jpg)
10
Challenges of In-Switch Acceleration
No Impact on
Regular Switch
Functions
![Page 43: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/43.jpg)
10
Challenges of In-Switch Acceleration
Limited
On-Chip
Resources
No Impact on
Regular Switch
Functions
![Page 44: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/44.jpg)
10
Challenges of In-Switch Acceleration
Limited
On-Chip
Resources
No Impact on
Regular Switch
Functions
Scale with
More Switches
and Nodes
![Page 45: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/45.jpg)
11
Basics of Programmable Switch
Control Plane
Data Plane
![Page 46: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/46.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
![Page 47: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/47.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
![Page 48: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/48.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
![Page 49: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/49.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
![Page 50: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/50.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
![Page 51: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/51.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
![Page 52: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/52.jpg)
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
Packet Forwarding
11
![Page 53: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/53.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control
DataHead
Packet Forwarding
![Page 54: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/54.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
![Page 55: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/55.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
…
![Page 56: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/56.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
…
![Page 57: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/57.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
![Page 58: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/58.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
![Page 59: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/59.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
![Page 60: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/60.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
![Page 61: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/61.jpg)
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
![Page 62: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/62.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
![Page 63: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/63.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
Core of
Regular
Functions
![Page 64: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/64.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
![Page 65: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/65.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
![Page 66: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/66.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
![Page 67: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/67.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Header
![Page 68: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/68.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular
Header
![Page 69: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/69.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular
Header
![Page 70: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/70.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
![Page 71: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/71.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Header
![Page 72: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/72.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Gradient
Header
![Page 73: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/73.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Gradient
Header
![Page 74: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/74.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
![Page 75: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/75.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular Traffic
Gradient Traffic
![Page 76: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/76.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Regular Traffic
Gradient Traffic
![Page 77: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/77.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
![Page 78: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/78.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
![Page 79: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/79.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
![Page 80: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/80.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
![Page 81: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/81.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
![Page 82: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/82.jpg)
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
![Page 83: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/83.jpg)
13
Developing Light-Weight Accelerator for Aggregation
In-Switch Accelerator
![Page 84: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/84.jpg)
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
![Page 85: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/85.jpg)
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
![Page 86: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/86.jpg)
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
![Page 87: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/87.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
![Page 88: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/88.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Header
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 89: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/89.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Header
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 90: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/90.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Header
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 91: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/91.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 92: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/92.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 93: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/93.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
![Page 94: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/94.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 95: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/95.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 96: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/96.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 97: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/97.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 98: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/98.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 99: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/99.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Pkt i
![Page 100: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/100.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Pkt i
![Page 101: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/101.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
![Page 102: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/102.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Threshold
![Page 103: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/103.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
![Page 104: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/104.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
Pkt i
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
![Page 105: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/105.jpg)
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
Pkt i
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
Accelerator Resource Consumption:
extra 18.6% of LUT, 17.3% of FF, and 17 DSP
![Page 106: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/106.jpg)
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Sum
Result
![Page 107: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/107.jpg)
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Packet-Level Aggregation in Our iSwitch
Sum
Result
![Page 108: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/108.jpg)
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Packet-Level Aggregation in Our iSwitch
Sum
Result
Further Reduce
Aggregation Time
![Page 109: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/109.jpg)
15
Extending Network Protocol for In-Switch Computing
Regular Packet:
ETH IP UDP Application Data
![Page 110: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/110.jpg)
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Data Packet of iSwitch:
![Page 111: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/111.jpg)
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Data Packet of iSwitch:
![Page 112: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/112.jpg)
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
![Page 113: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/113.jpg)
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
![Page 114: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/114.jpg)
15
Extending Network Protocol for In-Switch Computing
Action Description
Join Join the training job
Leave Leave the training job
Reset Clear the accelerator on the switch
SetH Set aggregation threshold H on switch
FBcast Force broadcast a segment on switch
Help Request a lost data packet for a worker
Ack Confirm the success of some actions
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
![Page 115: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/115.jpg)
15
Extending Network Protocol for In-Switch Computing
Action Description
Join Join the training job
Leave Leave the training job
Reset Clear the accelerator on the switch
SetH Set aggregation threshold H on switch
FBcast Force broadcast a segment on switch
Help Request a lost data packet for a worker
Ack Confirm the success of some actions
iSwitch extension will NOT affect regular network functions
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
![Page 116: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/116.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
![Page 117: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/117.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Programmable Switch
Aggregation
Accelerator
![Page 118: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/118.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
![Page 119: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/119.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
![Page 120: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/120.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
Keep
Aggregating
![Page 121: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/121.jpg)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
Keep
Aggregating
HW/Algo Co-Design For Improved Parallelism
![Page 122: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/122.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Typical Network Architecture at Data Center
![Page 123: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/123.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
![Page 124: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/124.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
![Page 125: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/125.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt
![Page 126: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/126.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt
![Page 127: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/127.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt
![Page 128: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/128.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt
![Page 129: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/129.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt
![Page 130: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/130.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
![Page 131: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/131.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
![Page 132: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/132.jpg)
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
No Additional Cost or Topology Change for Scaling In-Switch Computing
![Page 133: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/133.jpg)
18
In-Switch
Computing
Implementation
RL Training
Benchmarks
NetFPGA-SUME Board
GPU Cluster
DQN A2C PPO DDPG
![Page 134: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/134.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
![Page 135: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/135.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)
![Page 136: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/136.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)
![Page 137: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/137.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
![Page 138: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/138.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
![Page 139: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/139.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
![Page 140: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/140.jpg)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
3.7x Speedup
1.9x
![Page 141: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/141.jpg)
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
![Page 142: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/142.jpg)
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
![Page 143: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/143.jpg)
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
![Page 144: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/144.jpg)
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
![Page 145: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/145.jpg)
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
![Page 146: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/146.jpg)
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO Asynchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
iSW
Ideal
![Page 147: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/147.jpg)
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO Asynchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
iSW
Ideal
Close-to Linear Speedup for Both Training Modes
![Page 148: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/148.jpg)
22
In-Switch
Computing
Summary
Programmable Switch
Aggregation Accelerator
+ + + =
3.7x Speedup for Both Sync/Async Training
Scales at Rack-Scale Clusters
![Page 149: iSwitch: Accelerating Distributed Reinforcement Learning](https://reader036.vdocuments.net/reader036/viewer/2022081412/629c2d7e21e70812ac17d753/html5/thumbnails/149.jpg)
Thanks!
Jian Huang
Youjie Li
Iou-Jen Liu Yifan Yuan
Deming Chen Alexander Schwing
University of Illinois at Urbana-Champaign