filetitle: bml: a high-performance, low-cost gradient synchronization algorithm for dml training...

BML: A High-performance, Low-cost GradientSynchronization Algorithm for DML Training

Songtao Wang1,2, Dan Li1, Yang Cheng1, Jinkun Geng1,Yanshu Wang1, Shuai Wang1,Shutao Xia2 and Jianping Wu1

1Department of Computer Science and Technology, Tsinghua University2Graduate School at Shenzhen, Tsinghua University

IntroductionThe recent breakthroughs on ML theory have brought new chances in many fields. MLputs new requirements on the computation and dataset. Recently, the widely using ofGPU and distributed cluster can accelerate the heavy workflow, which, in return, bringsnew challenges to networking as well as distributed synchronization algorithm.In distributed machine learning (DML), the network performance between machines sig-nificantly impacts the speed of iterative training. In this paper we propose BML, a newgradient synchronization algorithm with higher network performance and lower networkcost than the current practice. BML runs on BCube network, instead of using the tra-ditional Fat-Tree topology. BML algorithm is designed in such a way that, compared tothe parameter server (PS) algorithm on a Fat-Tree network connecting the same numberof server machines, BML achieves theoretically 1

kof the gradient synchronization time,

with k5of switches (the typical number of k is 2∼4). Experiments of LeNet-5 and VGG-

19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the jobcompletion time of DML training by up to 56.4%.

DML Background and Motivation

Gradient synchronization flow

Parameter Server

Worker Worker Worker Worker

(a) PS algorithm

Gradient synchronization flow

(b) Ring AllReduce algorithm

Figure 1: Gradient synchronization algorithms.

Gradient Synchronization Algorithm: There are two representative algorithms tosynchronize the gradients among machines, namely, PS algorithm and Ring AllReducealgorithm. The PS algorithm is shown in Fig. 1(a). in which a logical parameter server(composed of a number of physical servers) interacts with every worker for parameterupdate. Each worker pushes its local gradients to the parameter server after training inan iteration, and pulls the aggregated gradients from the parameter server before goingto the next iteration. With more workers in DML, the amount of traffic exchanged withthe parameter server also increases. The Ring AllReduce algorithm is widely used inHPC, as shown by Fig. 1(b). Where all the machines are organized as a logical ring andthe algorithm includes two stages. In the scatter stage, it takes N − 1 steps for eachmachine to aggregate 1

Nof the gradients; in the gather stage, it takes N− 1 more steps

for each machine to get a complete set of the updated gradients.

Figure 2: A Fat-Tree network with 16 servers.

Physical Network Topology: Fat-Tree is the current practice of physical networktopology in most commercial data centers, as shown in Fig. 2. PS in Fat-Tree offers flex-ibility in placing the parameter servers and workers. To fully utilize the hardware resource,we can place one parameter server and one worker in each physical server, i.e. P2P. Wecan also run the Ring AllReduce algorithm in a Fat-Tree network. Both PS(P2P) andRing-Allreduce in fat-tree networking consume 2∗(N−1)

N∗ TF in each iteration.

Motivation of BML Design: In DML, Fat-Tree does not well match the traffic pat-tern of DML training. When running the PS algorithm, each synchronization flow needsto traverse multiple hops of switches before being aggregated, which not only hurts thegradient synchronization performance, but also wastes the bandwidth/link resource. Weseek to design a new gradient synchronization algorithm on alternative network topology,which can achieve less GST with lower hardware cost. The new network topology andsynchronization algorithm atop can be used to build a server cluster purposely for DMLtraining.

BML Design

BML takes the full advantage of BCube networking topology design to achieve high per-formance in distributed training over multiple nodes. We have also designed new gradientsynchronization algorithm based on the BCube networking topology, namely BML. BMLacts as Fig.3.

S[0,0]:<0,*,*>([0,0]),<1,*,*>([0,0])

S[0,1]:<0,*,*>([0,1]),<1,*,*>([0,1])

S[0,2]:<0,*,*>([0,2]),<1,*,*>([0,2])

S[1,0]:<0,*,*>([1,0]),<1,*,*>([1,0])

S[1,1]:<0,*,*>([1,1]),<1,*,*>([1,1])

S[1,2]:<0,*,*>([1,2]),<1,*,*>([1,2])

S[2,0]:<0,*,*>([2,0]),<1,*,*>([2,0])

S[2,1]:<0,*,*>([2,1]),<1,*,*>([2,1])

S[2,2]:<0,*,*>([2,2]),<1,*,*>([2,2])

Level-1

Level-0

(a) Initial states of gradient pieces.

S[0,0]:<0,*,0>([0,*]),<1,0,*>([*,0])

S[0,1]:<0,*,1>([0,*]),<1,0,*>([*,1])

S[0,2]:<0,*,2>([0,*]),<1,0,*>([*,2])

S[1,0]:<0,*,0>([1,*]),<1,1,*>([*,0])

S[1,1]:<0,*,1>([1,*]),<1,1,*>([*,1])

S[1,2]:<0,*,2>([1,*]),<1,1,*>([*,2])

S[2,0]:<0,*,0>([2,*]),<1,2,*>([*,0])

S[2,1]:<0,*,1>([2,*]),<1,2,*>([*,1])

S[2,2]:<0,*,2>([2,*]),<1,2,*>([*,2])

Level-0 links used by thread 0Level-1 links used by thread 1

Level-1

Level-0

(b)States after the first step of ag-gregation stage.

S[0,0]:<0,0,0>([*,*]),<1,0,0>([*,*])

S[0,1]:<0,0,1>([*,*]),<1,0,1>([*,*])

S[0,2]:<0,0,2>([*,*]),<1,0,2>([*,*])

S[1,0]:<0,1,0>([*,*]),<1,1,0>([*,*])

S[1,1]:<0,1,1>([*,*]),<1,1,1>([*,*])

S[1,2]:<0,1,2>([*,*]),<1,1,2>([*,*])

S[2,0]:<0,2,0>([*,*]),<1,2,0>([*,*])

S[2,1]:<0,2,1>([*,*]),<1,2,1>([*,*])

S[2,2]:<0,2,2>([*,*]),<1,2,2>([*,*])

Level-1

Level-0


(c)States after the second step ofaggregation stage.

S[0,0]:<0,*,0>([*,*]),<1,0,*>([*,*])

S[0,1]:<0,*,1>([*,*]),<1,0,*>([*,*])

S[0,2]:<0,*,2>([*,*]),<1,0,*>([*,*])

S[1,0]:<0,*,0>([*,*]),<1,1,*>([*,*])

S[1,1]:<0,*,1>([*,*]),<1,1,*>([*,*])

S[1,2]:<0,*,2>([*,*]),<1,1,*>([*,*])

S[2,0]:<0,*,0>([*,*]),<1,2,*>([*,*])

S[2,1]:<0,*,1>([*,*]),<1,2,*>([*,*])

S[2,2]:<0,*,2>([*,*]),<1,2,*>([*,*])

Level-1

Level-0


(d) States after the first step ofbroadcast stage.

S[0,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[0,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[0,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[1,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,0]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,1]:<0,*,*>([*,*]),<1,*,*>([*,*])

S[2,2]:<0,*,*>([*,*]),<1,*,*>([*,*])

Level-1

Level-0


(e)States after the second step ofbroadcast stage.

Figure 3: BML Gradient synchronization algorithms.

Implementation and evaluation

We implement BML in TensorFlow v1.4 and run on a BCube(3,2) testbed with 9servers(2*K40C + 64G DRAM + E5-2620 CPU) and multiple 40GE switches. To com-pare BML with Fat-Tree, we also build a Fat-Tree network with the same number of GPUservers. We run the P2P based PS algorithm for gradient synchronization in Fat-Tree,where each server plays as both a parameter server and a worker. RoCE (RDMA overConverged Ethernet) is used as the transport protocol in all these networks.We run two representative public deep learning benchmarks, namely, LeNet-5 and VGG-19, in each network. In our experiments, we set different sub-minibatch sizes on a trainingserver in different rounds of experiments to show the training speed of DML, we fix thenumber of iterations trained in each round of the experiment as 1000 and measure thejob completion time(JCT) of the benchmark (shown as fig.4).

0

20

40

60

80

100

120

140

128 256 512 1024

JCT

(se

conds)

Sub-minibatch size

BML Fat-Tree

(a) LeNet-5

0

1000

2000

3000

4000

5000

4 16 32 64

JCT

(se

co

nds

Sub-minibatch size

BML Fat-Tree

(b) VGG-19

Figure 4: Experiment to evaluate the job completion time (JCT)

In this experiment, we find that BML can reduce the job completion time by 0%∼50%,29.2%∼52.1% in training LeNet-5, VGG-19, respectively.

Conclusion

In this work we design BML, a new DML gradient synchronization algorithm with higherperformance and lower cost. BML runs on BCube topology instead of the commonly-used Fat-Tree network in current data centers. Compared with the PS algorithm runningon a Fat-Tree network connecting the same number of servers, BML achieves 1

kof of the

GST while using only k5switches. The experiments of typical deep learning benchmarks

on Tensorflow also validate the performance gains of BML.

NIPS, Dec., 2018, Montreal, Canada

filetitle: bml: a high-performance, low-cost gradient synchronization algorithm for dml training...

Documents