iswitch: accelerating distributed reinforcement learning

Post on 05-Jun-2022

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

iSwitch: Accelerating Distributed Reinforcement

Learning with In-Switch Computing

Jian Huang

Youjie Li Iou-Jen Liu Yifan Yuan

Deming Chen Alexander Schwing

University of Illinois at Urbana-Champaign

2

AI Applications are Increasingly Operating in Dynamic Environments

Autonomous Driving GamesRobotics

2

AI Applications are Increasingly Operating in Dynamic Environments

Autonomous Driving GamesRobotics

Reinforcement Learning Empowers AI Applications to Take Real-Time Intelligent Actions

3

What is Reinforcement Learning?

Agent Environment

3

What is Reinforcement Learning?

Agent Environment

State

3

What is Reinforcement Learning?

Agent Environment

Action

State

3

What is Reinforcement Learning?

Agent Environment

Action

Next State

3

What is Reinforcement Learning?

Agent Environment

Action

Reward

Next State

3

What is Reinforcement Learning?

Model

Agent Environment

Action

Reward

Next State

3

What is Reinforcement Learning?

Model

Agent Environment

Action

Reward

Next State

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

Action

Reward

Next State

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

Train a Typical RL Agent on a

Single GPU = 8 Days*

*Mnih, ICML’16

3

What is Reinforcement Learning?

RL Requires Distributed Training for Improved Performance

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

Train a Typical RL Agent on a

Single GPU = 8 Days*

*Mnih, ICML’16

4

Parameter

Server

Workers

Centralized Distributed RL Training: Parameter-Server Based

Switch

4

Parameter

Server

Workers

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

Multiple

Network Hops

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

ServerCentral

Bottleneck

Multiple

Network Hops

5

Decentralized Distributed RL Training: AllReduce Based

Ring-AllReduce

Switch

Workers

Model Sum

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum Sum

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum Sum

Full

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Aggregated

Gradient

Sum Sum

FullFull

Full Full

Aggregation

Complete!

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Excessive

Network Hops

Model Sum

Aggregated

Gradient

Sum Sum

FullFull

Full Full

Aggregation

Complete!

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

Network Hops = 4

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

Network Hops = 4 Network Hops = 4N - 4

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

88x Smaller Gradient Size

158x More Iterations

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

Distributed RL Training is Latency Critical

88x Smaller Gradient Size

158x More Iterations

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

Gradient Aggregation over the Network Dominates the Training Time (50~83%)

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

Gradient Aggregation over the Network Dominates the Training Time (50~83%)

Compute

Network

9

Programmable Switch

Aggregation Accelerator

+ + + =

In-Switch Acceleration: A New Distributed Computing Paradigm

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

In-Switch Acceleration: A New Distributed Computing Paradigm

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

Programmability Hardware-Algorithm Co-Design

In-Switch Acceleration: A New Distributed Computing Paradigm

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

Programmability

Scalability

Hardware-Algorithm Co-Design

Scale Training at Rack Scale

In-Switch Acceleration: A New Distributed Computing Paradigm

10

Challenges of In-Switch Acceleration

No Impact on

Regular Switch

Functions

10

Challenges of In-Switch Acceleration

Limited

On-Chip

Resources

No Impact on

Regular Switch

Functions

10

Challenges of In-Switch Acceleration

Limited

On-Chip

Resources

No Impact on

Regular Switch

Functions

Scale with

More Switches

and Nodes

11

Basics of Programmable Switch

Control Plane

Data Plane

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Packet Forwarding

11

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control

DataHead

Packet Forwarding

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

Core of

Regular

Functions

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Gradient

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Gradient

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular Traffic

Gradient Traffic

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Regular Traffic

Gradient Traffic

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

13

Developing Light-Weight Accelerator for Aggregation

In-Switch Accelerator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Header

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Header

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Header

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Pkt i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Pkt i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Threshold

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

Pkt i

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

Pkt i

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Accelerator Resource Consumption:

extra 18.6% of LUT, 17.3% of FF, and 17 DSP

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Sum

Result

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Packet-Level Aggregation in Our iSwitch

Sum

Result

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Packet-Level Aggregation in Our iSwitch

Sum

Result

Further Reduce

Aggregation Time

15

Extending Network Protocol for In-Switch Computing

Regular Packet:

ETH IP UDP Application Data

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Data Packet of iSwitch:

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Data Packet of iSwitch:

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

15

Extending Network Protocol for In-Switch Computing

Action Description

Join Join the training job

Leave Leave the training job

Reset Clear the accelerator on the switch

SetH Set aggregation threshold H on switch

FBcast Force broadcast a segment on switch

Help Request a lost data packet for a worker

Ack Confirm the success of some actions

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

15

Extending Network Protocol for In-Switch Computing

Action Description

Join Join the training job

Leave Leave the training job

Reset Clear the accelerator on the switch

SetH Set aggregation threshold H on switch

FBcast Force broadcast a segment on switch

Help Request a lost data packet for a worker

Ack Confirm the success of some actions

iSwitch extension will NOT affect regular network functions

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Programmable Switch

Aggregation

Accelerator

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

Keep

Aggregating

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

Keep

Aggregating

HW/Algo Co-Design For Improved Parallelism

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Typical Network Architecture at Data Center

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

No Additional Cost or Topology Change for Scaling In-Switch Computing

18

In-Switch

Computing

Implementation

RL Training

Benchmarks

NetFPGA-SUME Board

GPU Cluster

DQN A2C PPO DDPG

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

3.7x Speedup

1.9x

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO Asynchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

iSW

Ideal

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO Asynchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

iSW

Ideal

Close-to Linear Speedup for Both Training Modes

22

In-Switch

Computing

Summary

Programmable Switch

Aggregation Accelerator

+ + + =

3.7x Speedup for Both Sync/Async Training

Scales at Rack-Scale Clusters

Thanks!

Jian Huang

Youjie Li

li238@Illinois.edu

Iou-Jen Liu Yifan Yuan

Deming Chen Alexander Schwing

University of Illinois at Urbana-Champaign

top related