filetitle: bml: a high-performance, low-cost gradient synchronization algorithm for dml training...

1
BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training Songtao Wang 1,2 , Dan Li 1 , Yang Cheng 1 , Jinkun Geng 1 ,Yanshu Wang 1 , Shuai Wang 1 , Shutao Xia 2 and Jianping Wu 1 1 Department of Computer Science and Technology, Tsinghua University 2 Graduate School at Shenzhen, Tsinghua University Introduction The recent breakthroughs on ML theory have brought new chances in many fields. ML puts new requirements on the computation and dataset. Recently, the widely using of GPU and distributed cluster can accelerate the heavy workflow, which, in return, brings new challenges to networking as well as distributed synchronization algorithm. In distributed machine learning (DML), the network performance between machines sig- nificantly impacts the speed of iterative training. In this paper we propose BML, a new gradient synchronization algorithm with higher network performance and lower network cost than the current practice. BML runs on BCube network, instead of using the tra- ditional Fat-Tree topology. BML algorithm is designed in such a way that, compared to the parameter server (PS) algorithm on a Fat-Tree network connecting the same number of server machines, BML achieves theoretically 1 k of the gradient synchronization time, with k 5 of switches (the typical number of k is 24). Experiments of LeNet-5 and VGG- 19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4%. DML Background and Motivation Gradient synchronization flow Parameter Server Worker Worker Worker Worker (a) PS algorithm Gradient synchronization flow (b) Ring AllReduce algorithm Figure 1: Gradient synchronization algorithms. Gradient Synchronization Algorithm: There are two representative algorithms to synchronize the gradients among machines, namely, PS algorithm and Ring AllReduce algorithm. The PS algorithm is shown in Fig. 1(a). in which a logical parameter server (composed of a number of physical servers) interacts with every worker for parameter update. Each worker pushes its local gradients to the parameter server after training in an iteration, and pulls the aggregated gradients from the parameter server before going to the next iteration. With more workers in DML, the amount of traffic exchanged with the parameter server also increases. The Ring AllReduce algorithm is widely used in HPC, as shown by Fig. 1(b). Where all the machines are organized as a logical ring and the algorithm includes two stages. In the scatter stage, it takes N - 1 steps for each machine to aggregate 1 N of the gradients; in the gather stage, it takes N - 1 more steps for each machine to get a complete set of the updated gradients. Figure 2: A Fat-Tree network with 16 servers. Physical Network Topology: Fat-Tree is the current practice of physical network topology in most commercial data centers, as shown in Fig. 2. PS in Fat-Tree offers flex- ibility in placing the parameter servers and workers. To fully utilize the hardware resource, we can place one parameter server and one worker in each physical server, i.e. P2P. We can also run the Ring AllReduce algorithm in a Fat-Tree network. Both PS(P2P) and Ring-Allreduce in fat-tree networking consume 2*(N-1) N * T F in each iteration. Motivation of BML Design: In DML, Fat-Tree does not well match the traffic pat- tern of DML training. When running the PS algorithm, each synchronization flow needs to traverse multiple hops of switches before being aggregated, which not only hurts the gradient synchronization performance, but also wastes the bandwidth/link resource. We seek to design a new gradient synchronization algorithm on alternative network topology, which can achieve less GST with lower hardware cost. The new network topology and synchronization algorithm atop can be used to build a server cluster purposely for DML training. BML Design BML takes the full advantage of BCube networking topology design to achieve high per- formance in distributed training over multiple nodes. We have also designed new gradient synchronization algorithm based on the BCube networking topology, namely BML. BML acts as Fig.3. S[0,0]:<0,*,*>([0,0]), <1,*,*>([0,0]) S[0,1]:<0,*,*>([0,1]), <1,*,*>([0,1]) S[0,2]:<0,*,*>([0,2]), <1,*,*>([0,2]) S[1,0]:<0,*,*>([1,0]), <1,*,*>([1,0]) S[1,1]:<0,*,*>([1,1]), <1,*,*>([1,1]) S[1,2]:<0,*,*>([1,2]), <1,*,*>([1,2]) S[2,0]:<0,*,*>([2,0]), <1,*,*>([2,0]) S[2,1]:<0,*,*>([2,1]), <1,*,*>([2,1]) S[2,2]:<0,*,*>([2,2]), <1,*,*>([2,2]) Level-1 Level-0 (a) Initial states of gradient pieces. S[0,0]:<0,*,0>([0,*]), <1,0,*>([*,0]) S[0,1]:<0,*,1>([0,*]), <1,0,*>([*,1]) S[0,2]:<0,*,2>([0,*]), <1,0,*>([*,2]) S[1,0]:<0,*,0>([1,*]), <1,1,*>([*,0]) S[1,1]:<0,*,1>([1,*]), <1,1,*>([*,1]) S[1,2]:<0,*,2>([1,*]), <1,1,*>([*,2]) S[2,0]:<0,*,0>([2,*]), <1,2,*>([*,0]) S[2,1]:<0,*,1>([2,*]), <1,2,*>([*,1]) S[2,2]:<0,*,2>([2,*]), <1,2,*>([*,2]) Level-0 links used by thread 0 Level-1 links used by thread 1 Level-1 Level-0 (b) States after the first step of ag- gregation stage. S[0,0]:<0,0,0>([*,*]), <1,0,0>([*,*]) S[0,1]:<0,0,1>([*,*]), <1,0,1>([*,*]) S[0,2]:<0,0,2>([*,*]), <1,0,2>([*,*]) S[1,0]:<0,1,0>([*,*]), <1,1,0>([*,*]) S[1,1]:<0,1,1>([*,*]), <1,1,1>([*,*]) S[1,2]:<0,1,2>([*,*]), <1,1,2>([*,*]) S[2,0]:<0,2,0>([*,*]), <1,2,0>([*,*]) S[2,1]:<0,2,1>([*,*]), <1,2,1>([*,*]) S[2,2]:<0,2,2>([*,*]), <1,2,2>([*,*]) Level-1 Level-0 Level-1 links used by thread 0 Level-0 links used by thread 1 (c) States after the second step of aggregation stage. S[0,0]:<0,*,0>([*,*]), <1,0,*>([*,*]) S[0,1]:<0,*,1>([*,*]), <1,0,*>([*,*]) S[0,2]:<0,*,2>([*,*]), <1,0,*>([*,*]) S[1,0]:<0,*,0>([*,*]), <1,1,*>([*,*]) S[1,1]:<0,*,1>([*,*]), <1,1,*>([*,*]) S[1,2]:<0,*,2>([*,*]), <1,1,*>([*,*]) S[2,0]:<0,*,0>([*,*]), <1,2,*>([*,*]) S[2,1]:<0,*,1>([*,*]), <1,2,*>([*,*]) S[2,2]:<0,*,2>([*,*]), <1,2,*>([*,*]) Level-1 Level-0 Level-1 links used by thread 0 Level-0 links used by thread 1 (d) States after the first step of broadcast stage. S[0,0]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[0,1]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[0,2]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[1,0]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[1,1]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[1,2]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[2,0]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[2,1]:<0,*,*>([*,*]), <1,*,*>([*,*]) S[2,2]:<0,*,*>([*,*]), <1,*,*>([*,*]) Level-1 Level-0 Level-0 links used by thread 0 Level-1 links used by thread 1 (e) States after the second step of broadcast stage. Figure 3: BML Gradient synchronization algorithms. Implementation and evaluation We implement BML in TensorFlow v1.4 and run on a BCube(3,2) testbed with 9 servers(2*K40C + 64G DRAM + E5-2620 CPU) and multiple 40GE switches. To com- pare BML with Fat-Tree, we also build a Fat-Tree network with the same number of GPU servers. We run the P2P based PS algorithm for gradient synchronization in Fat-Tree, where each server plays as both a parameter server and a worker. RoCE (RDMA over Converged Ethernet) is used as the transport protocol in all these networks. We run two representative public deep learning benchmarks, namely, LeNet-5 and VGG- 19, in each network. In our experiments, we set different sub-minibatch sizes on a training server in different rounds of experiments to show the training speed of DML, we fix the number of iterations trained in each round of the experiment as 1000 and measure the job completion time(JCT) of the benchmark (shown as fig.4). 0 20 40 60 80 100 120 140 128 256 512 1024 JCT (seconds) Sub-minibatch size BML Fat-Tree (a) LeNet-5 0 1000 2000 3000 4000 5000 4 16 32 64 JCT (seconds Sub-minibatch size BML Fat-Tree (b) VGG-19 Figure 4: Experiment to evaluate the job completion time (JCT) In this experiment, we find that BML can reduce the job completion time by 0%50%, 29.2%52.1% in training LeNet-5, VGG-19, respectively. Conclusion In this work we design BML, a new DML gradient synchronization algorithm with higher performance and lower cost. BML runs on BCube topology instead of the commonly- used Fat-Tree network in current data centers. Compared with the PS algorithm running on a Fat-Tree network connecting the same number of servers, BML achieves 1 k of of the GST while using only k 5 switches. The experiments of typical deep learning benchmarks on Tensorflow also validate the performance gains of BML. NIPS, Dec., 2018, Montreal, Canada

Upload: ngoanh

Post on 17-Aug-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: fileTitle: BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training Author: Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng,Yanshu Wang, Shuai Wang, Shutao

BML: A High-performance, Low-cost GradientSynchronization Algorithm for DML Training

Songtao Wang1,2, Dan Li1, Yang Cheng1, Jinkun Geng1,Yanshu Wang1, Shuai Wang1,Shutao Xia2 and Jianping Wu1

1Department of Computer Science and Technology, Tsinghua University2Graduate School at Shenzhen, Tsinghua University

IntroductionThe recent breakthroughs on ML theory have brought new chances in many fields. MLputs new requirements on the computation and dataset. Recently, the widely using ofGPU and distributed cluster can accelerate the heavy workflow, which, in return, bringsnew challenges to networking as well as distributed synchronization algorithm.In distributed machine learning (DML), the network performance between machines sig-nificantly impacts the speed of iterative training. In this paper we propose BML, a newgradient synchronization algorithm with higher network performance and lower networkcost than the current practice. BML runs on BCube network, instead of using the tra-ditional Fat-Tree topology. BML algorithm is designed in such a way that, compared tothe parameter server (PS) algorithm on a Fat-Tree network connecting the same numberof server machines, BML achieves theoretically 1

kof the gradient synchronization time,

with k5of switches (the typical number of k is 2∼4). Experiments of LeNet-5 and VGG-

19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the jobcompletion time of DML training by up to 56.4%.

DML Background and Motivation

Gradient synchronization flow

Parameter Server

Worker Worker Worker Worker

(a) PS algorithm

Gradient synchronization flow

(b) Ring AllReduce algorithm

Figure 1: Gradient synchronization algorithms.

Gradient Synchronization Algorithm: There are two representative algorithms tosynchronize the gradients among machines, namely, PS algorithm and Ring AllReducealgorithm. The PS algorithm is shown in Fig. 1(a). in which a logical parameter server(composed of a number of physical servers) interacts with every worker for parameterupdate. Each worker pushes its local gradients to the parameter server after training inan iteration, and pulls the aggregated gradients from the parameter server before goingto the next iteration. With more workers in DML, the amount of traffic exchanged withthe parameter server also increases. The Ring AllReduce algorithm is widely used inHPC, as shown by Fig. 1(b). Where all the machines are organized as a logical ring andthe algorithm includes two stages. In the scatter stage, it takes N − 1 steps for eachmachine to aggregate 1

Nof the gradients; in the gather stage, it takes N− 1 more steps

for each machine to get a complete set of the updated gradients.

Figure 2: A Fat-Tree network with 16 servers.

Physical Network Topology: Fat-Tree is the current practice of physical networktopology in most commercial data centers, as shown in Fig. 2. PS in Fat-Tree offers flex-ibility in placing the parameter servers and workers. To fully utilize the hardware resource,we can place one parameter server and one worker in each physical server, i.e. P2P. Wecan also run the Ring AllReduce algorithm in a Fat-Tree network. Both PS(P2P) andRing-Allreduce in fat-tree networking consume 2∗(N−1)

N∗ TF in each iteration.

Motivation of BML Design: In DML, Fat-Tree does not well match the traffic pat-tern of DML training. When running the PS algorithm, each synchronization flow needsto traverse multiple hops of switches before being aggregated, which not only hurts thegradient synchronization performance, but also wastes the bandwidth/link resource. Weseek to design a new gradient synchronization algorithm on alternative network topology,which can achieve less GST with lower hardware cost. The new network topology andsynchronization algorithm atop can be used to build a server cluster purposely for DMLtraining.

BML Design

BML takes the full advantage of BCube networking topology design to achieve high per-formance in distributed training over multiple nodes. We have also designed new gradientsynchronization algorithm based on the BCube networking topology, namely BML. BMLacts as Fig.3.

S[0,0]:<0,*,*>([0,0]),<1,*,*>([0,0])

S[0,1]:<0,*,*>([0,1]),<1,*,*>([0,1])

S[0,2]:<0,*,*>([0,2]),<1,*,*>([0,2])

S[1,0]:<0,*,*>([1,0]),<1,*,*>([1,0])

S[1,1]:<0,*,*>([1,1]),<1,*,*>([1,1])

S[1,2]:<0,*,*>([1,2]),<1,*,*>([1,2])

S[2,0]:<0,*,*>([2,0]),<1,*,*>([2,0])

S[2,1]:<0,*,*>([2,1]),<1,*,*>([2,1])

S[2,2]:<0,*,*>([2,2]),<1,*,*>([2,2])

Level-1

Level-0

(a) Initial states of gradient pieces.

S[0,0]:<0,*,0>([0,*]),<1,0,*>([*,0])

S[0,1]:<0,*,1>([0,*]),<1,0,*>([*,1])

S[0,2]:<0,*,2>([0,*]),<1,0,*>([*,2])

S[1,0]:<0,*,0>([1,*]),<1,1,*>([*,0])

S[1,1]:<0,*,1>([1,*]),<1,1,*>([*,1])

S[1,2]:<0,*,2>([1,*]),<1,1,*>([*,2])

S[2,0]:<0,*,0>([2,*]),<1,2,*>([*,0])

S[2,1]:<0,*,1>([2,*]),<1,2,*>([*,1])

S[2,2]:<0,*,2>([2,*]),<1,2,*>([*,2])

Level-0 links used by thread 0Level-1 links used by thread 1

Level-1

Level-0

(b)States after the first step of ag-gregation stage.

S[0,0]:<0,0,0>([*,*]),<1,0,0>([*,*])

S[0,1]:<0,0,1>([*,*]),<1,0,1>([*,*])

S[0,2]:<0,0,2>([*,*]),<1,0,2>([*,*])

S[1,0]:<0,1,0>([*,*]),<1,1,0>([*,*])

S[1,1]:<0,1,1>([*,*]),<1,1,1>([*,*])

S[1,2]:<0,1,2>([*,*]),<1,1,2>([*,*])

S[2,0]:<0,2,0>([*,*]),<1,2,0>([*,*])

S[2,1]:<0,2,1>([*,*]),<1,2,1>([*,*])

S[2,2]:<0,2,2>([*,*]),<1,2,2>([*,*])

Level-1

Level-0

Level-1 links used by thread 0Level-0 links used by thread 1

(c)States after the second step ofaggregation stage.

S[0,0]:<0,*,0>([*,*]),<1,0,*>([*,*])

S[0,1]:<0,*,1>([*,*]),<1,0,*>([*,*])

S[0,2]:<0,*,2>([*,*]),<1,0,*>([*,*])

S[1,0]:<0,*,0>([*,*]),<1,1,*>([*,*])

S[1,1]:<0,*,1>([*,*]),<1,1,*>([*,*])

S[1,2]:<0,*,2>([*,*]),<1,1,*>([*,*])

S[2,0]:<0,*,0>([*,*]),<1,2,*>([*,*])

S[2,1]:<0,*,1>([*,*]),<1,2,*>([*,*])

S[2,2]:<0,*,2>([*,*]),<1,2,*>([*,*])

Level-1

Level-0

Level-1 links used by thread 0Level-0 links used by thread 1

(d) States after the first step ofbroadcast stage.

S[0,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[0,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[0,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

Level-1

Level-0

Level-0 links used by thread 0Level-1 links used by thread 1

(e)States after the second step ofbroadcast stage.

Figure 3: BML Gradient synchronization algorithms.

Implementation and evaluation

We implement BML in TensorFlow v1.4 and run on a BCube(3,2) testbed with 9servers(2*K40C + 64G DRAM + E5-2620 CPU) and multiple 40GE switches. To com-pare BML with Fat-Tree, we also build a Fat-Tree network with the same number of GPUservers. We run the P2P based PS algorithm for gradient synchronization in Fat-Tree,where each server plays as both a parameter server and a worker. RoCE (RDMA overConverged Ethernet) is used as the transport protocol in all these networks.We run two representative public deep learning benchmarks, namely, LeNet-5 and VGG-19, in each network. In our experiments, we set different sub-minibatch sizes on a trainingserver in different rounds of experiments to show the training speed of DML, we fix thenumber of iterations trained in each round of the experiment as 1000 and measure thejob completion time(JCT) of the benchmark (shown as fig.4).

0

20

40

60

80

100

120

140

128 256 512 1024

JCT

(se

conds)

Sub-minibatch size

BML Fat-Tree

(a) LeNet-5

0

1000

2000

3000

4000

5000

4 16 32 64

JCT

(se

co

nds

Sub-minibatch size

BML Fat-Tree

(b) VGG-19

Figure 4: Experiment to evaluate the job completion time (JCT)

In this experiment, we find that BML can reduce the job completion time by 0%∼50%,29.2%∼52.1% in training LeNet-5, VGG-19, respectively.

Conclusion

In this work we design BML, a new DML gradient synchronization algorithm with higherperformance and lower cost. BML runs on BCube topology instead of the commonly-used Fat-Tree network in current data centers. Compared with the PS algorithm runningon a Fat-Tree network connecting the same number of servers, BML achieves 1

kof of the

GST while using only k5switches. The experiments of typical deep learning benchmarks

on Tensorflow also validate the performance gains of BML.

NIPS, Dec., 2018, Montreal, Canada