leader stochastic gradient descent (lsgd) for distributed training of deep learning...

1
Leader Stochastic Gradient Descent (LSGD) for Distributed Training of Deep Learning Models Yunfei Teng*, Wenbo Gao*, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller Motivation We consider distributed optimization under commu- nication constraints for training deep learning models. Our method diers from the state-of-art parameter- averaging scheme EASGD [] in a number of ways: . our objective formulation does not change the location of stationary points compared to the original optimization problem; . we avoid convergence decelerations caused by pulling local workers descending to dierent local minima to each other (i.e. to the average of their parameters); . our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); . our approach is more communication ecient since it broadcasts only parameters of the leader rather than all workers. Multi-Leader Setting We propose a multi-leader setting well-aligned with the hardware architecture: single computational node is formed by local workers local leader group of nodes global leader Figure: Trajectories of variables (x, y) during optimization. Figure: The value of the objective function L(x , y ) for each worker during training on a heavily non-convex problem. Figure: Low-rank matrix completion problems solved with EAGD and LGD. LSGD updates Objective function: min x , ,x , ,...,x n,l L ( x , , x , ,..., x n,l ) : = min x , ,x , ,...,x n,l n j = l i = E[ f ( x j ,i ; ξ j ,i )] + λ || x j ,i - ˜ x j || + λ G || x j ,i - ˜ x || () Parameter update: x j ,i t + = x j ,i t - ηg j ,i t ( x j ,i t ) gradient descent - λ ( x j ,i t - ˜ x j t ) local pulling - λ G ( x j ,i t - ˜ x t ) global pulling () ξ j ,i s are the data samples drawn from data distribution P n is the number of groups and l is the number of workers in each group x j , , x j , ,..., x j ,l are the parameters of the workers in the j th group ˜ x j is the local leader and ˜ x is the global leader (the best worker among local leaders) λ and λ G are the hyperparameters that denote the strength of the forces pulling the workers to their local and global leader respectively Algorithm LSGD Algorithm (Asynchronous) Input: pulling coecients λ, λ G , learning rate η, local/global communication periods τ,τ G Initialize: Randomly initialize x , , x , , ..., x n,l Set iteration counters t j ,i = Set ˜ x j = rg min x j, ,...,x j,l E[ f ( x j ,i ; ξ j ,i )] , ˜ x = rg min x , ,...,x n,l E[ f ( x j ,i ; ξ j ,i )] ; repeat for all j = , ,..., n, i = , ,..., l do . Do in parallel for each worker Draw random sample ξ j ,i t j,i x j ,i ←- x j ,i - ηg j ,i t ( x j ,i ) t j ,i = t j ,i + ; if nl τ divides ( n j = l i = t j ,i ) then ˜ x j = rg min x j, ,...,x j,l E[ f ( x j ,i ; ξ j ,i t j,i )] . . Determine the local best workers x j ,i ←- x j ,i - λ ( x j ,i - ˜ x j ) . Pull to the local best workers end if if nl τ G divides ( n j = l i = t j ,i ) then ˜ x = rg min x , ,...,x n,l E[ f ( x j ,i ; ξ j ,i t j,i )] . . Determine the global best worker x j ,i ←- x j ,i - λ G ( x j ,i - ˜ x ) . Pull to the global best worker end if end for until termination Stationary Points of LSGD The L(S)GD loss () has the property that the leader variable is always a stationary point of the underlying objective function. This is not the case for E(A)SGD. Theorem: Let Ω i be the points ( x ,..., x p ) where x i is the unique minimizer among ( x ,..., x p ) . If x * =( w ,..., w p ) Ω i is a stationary point of the LSGD objective function, then f ( w i )= . Convergence Rates Under the assumption of strong convexity of f , L(S)GD recovers the same convergence rate as SGD. Theorem: For suciently small learning rate η, the LSGD step satises Ef ( x + ) - f ( x * ) ( - mη )( f ( x ) - f ( x * )) - ηλ ( f ( x ) - f ( z )) + η M σ . Here z denotes the leader point. Since f ( z ) f ( x i ) for each worker x i , we see that there is an additional reduction ηλ ( f ( x ) - f ( z )) induced by the leader. We can then obtain the standard convergence rate: Theorem: If η decreases at the rate η k = Θ ( k ) , then Ef ( x k ) - f ( x * ) O( k ) . This complexity matches other distributed algorithms such as Elastic Averaging SGD and Hogwild. On large-scale problems, we may not exactly evaluate f to determine the leader, and instead use estimates of f . Suppose that we have unbiased estimates f ( x i ) with uniformly bounded variance σ f , and select the worker with lowest realized estimate to become the leader. This increases the variance of the limit by no more than O( λ p nσ f ) , where n is the number of workers. Theorem: Suppose the leader is stochastic. Then lim sp k Ef ( x k ) - f ( x * ) ηκσ + m λ p pσ f . If η, λ decrease at the rate η k = Θ ( k ) k = Θ ( k ) , then Ef ( x k ) - f ( x * ) O( k ) . Improvements in Step Direction When the landscape is locally convex, we expect that the new leader term will bring the step direction closer to the global minimizer. This can be shown quantitatively for quadratic problems. Theorem: Let f be a convex quadratic. If either the λ is small, or the angle at x between the gradient and the Newton step is large, then at least half of the candidate leaders {z : f ( z ) f ( x ) } will bring the step direction closer to the Newton direction d N in the sense that ngle ( d N , - f ( x )+ λ ( z - x )) ngle ( d N , - f ( x )) . Empirical Results Figure: CNN on CIFAR- with workers (on the left) and workers (on the right). Test error for the center variable versus wall-clock time. Figure: ResNet on CIFAR- with workers (on the left) and workers (on the right). Test error for the center variable versus wall-clock time. Figure: VGG on CIFAR- (on the left) and ResNet on ImageNet (on the right) with workers. Test error for the center variable versus wall-clock time. Figure: ResNet on CIFAR-. The identity of the worker that is recognized as the leader (i.e. rank) versus iterations (on the left) and the number of times each worker was the leader (on the right). References []S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, .

Upload: others

Post on 23-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leader Stochastic Gradient Descent (LSGD) for Distributed Training of Deep Learning Modelsmlg.eng.cam.ac.uk/adrian/LSGD_Poster_NeurIPS2019.pdf · 2019-12-04 · Leader Stochastic

Leader Stochastic Gradient Descent (LSGD) for Distributed Training of Deep Learning ModelsYunfei Teng*, Wenbo Gao*, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller

Motivation

We consider distributed optimization under commu-nication constraints for training deep learning models.Our method differs from the state-of-art parameter-averaging scheme EASGD [1] in a number of ways:1. our objective formulation does not change thelocation of stationary points compared to the originaloptimization problem;

2.we avoid convergence decelerations caused bypulling local workers descending to different localminima to each other (i.e. to the average of theirparameters);

3. our update by design breaks the curse of symmetry(the phenomenon of being trapped in poorlygeneralizing sub-optimal solutions in symmetricnon-convex landscapes);

4. our approach is more communication efficient sinceit broadcasts only parameters of the leader ratherthan all workers.

Multi-Leader Setting

We propose a multi-leader setting well-aligned withthe hardware architecture:• single computational node is formed by local workers⇒ local leader• group of nodes⇒ global leader

Figure: Trajectories of variables (x, y) during optimization.

Figure: The value of the objective function L(x, y) for each worker during training on a heavilynon-convex problem.

Figure: Low-rank matrix completion problems solved with EAGD and LGD.

LSGD updates

Objective function:

minx1,1,x1,2,...,xn,l

L(x1,1, x1,2, . . . , xn,l)

:= minx1,1,x1,2,...,xn,l

n∑j=1

l∑i=1

E[f(xj,i; ξj,i)] +λ

2||xj,i − xj||2 +

λG

2||xj,i − x||2 (1)

Parameter update:

xj,it+1 = xj,it − ηgj,it (xj,it )⏟ ⏞

gradient descent

− λ(xj,it − xjt)⏟ ⏞ local pulling

− λG(xj,it − xt)⏟ ⏞ global pulling

(2)

• ξj,is are the data samples drawn from data distributionP• n is the number of groups and l is the number of workers in each group• xj,1, xj,2, . . . , xj,l are the parameters of the workers in the jth group• xj is the local leader and x is the global leader (the best worker among localleaders)• λ and λG are the hyperparameters that denote the strength of the forces pullingthe workers to their local and global leader respectively

AlgorithmLSGD Algorithm (Asynchronous)

Input: pulling coefficients λ, λG, learning rate η, local/global communication periods τ, τGInitialize:Randomly initialize x1,1, x1,2, ..., xn,l

Set iteration counters tj,i = 0Set xj0 = rgmin

xj,1,...,xj,lE[f(xj,i; ξj,i0 )] , x0 = rgmin

x1,1,...,xn,lE[f(xj,i; ξj,i0 )];

repeatfor all j = 1, 2, . . . , n, i = 1, 2, . . . , l do . Do in parallel for each worker

Draw random sample ξj,itj,ixj,i←− xj,i − ηgj,it (x

j,i)tj,i = tj,i + 1;

if nlτ divides (n∑j=1

l∑i=1

tj,i) then

xj = rgminxj,1,...,xj,lE[f(xj,i; ξj,itj,i)]. . Determine the local best workers

xj,i←− xj,i − λ(xj,i − xj) . Pull to the local best workersend if

if nlτG divides (n∑j=1

l∑i=1

tj,i) then

x = rgminx1,1,...,xn,lE[f(xj,i; ξj,itj,i)]. . Determine the global best worker

xj,i←− xj,i − λG(xj,i − x) . Pull to the global best workerend if

end foruntil termination

Stationary Points of LSGD

The L(S)GD loss (1) has the property that the leader variable is always a stationary point of the underlying objective function. This is not thecase for E(A)SGD.Theorem: Let Ωi be the points (x1, . . . , xp) where xi is the unique minimizer among (x1, . . . , xp). If x∗ = (w1, . . . ,wp) ∈ Ωi is a stationarypoint of the LSGD objective function, then ∇f(wi) = 0.

Convergence Rates

Under the assumption of strong convexity of f, L(S)GD recovers the same convergence rate as SGD.Theorem: For sufficiently small learning rate η, the LSGD step satisfies

Ef(x+) − f(x∗) ≤ (1− mη)(f(x) − f(x∗)) − ηλ(f(x) − f(z)) +η2M

2σ2.

Here z denotes the leader point. Since f(z) ≤ f(xi) for each worker xi, we see that there is an additional reduction ηλ(f(x)− f(z)) inducedby the leader. We can then obtain the standard convergence rate:Theorem: If η decreases at the rate ηk = Θ(

1k), then Ef(xk) − f(x∗) ≤ O( 1k).

This complexity matches other distributed algorithms such as Elastic Averaging SGD and Hogwild.On large-scale problems, we may not exactly evaluate f to determine the leader, and instead use estimates of f. Suppose that we haveunbiased estimatesf(xi) with uniformly bounded variance σ2f , and select the worker with lowest realized estimate to become the leader.This increases the variance of the limit by no more than O(λ

pnσf), where n is the number of workers.

Theorem: Suppose the leader is stochastic. Then limspk→∞Ef(xk) − f(x∗) ≤ 12ηκσ

2 + 4mλppσf. If η, λ decrease at the rate

ηk = Θ(1k), λk = Θ(

1k), then Ef(xk) − f(x∗) ≤ O( 1k).

Improvements in Step Direction

When the landscape is locally convex, we expect that the new leader term will bring the step direction closer to the global minimizer. Thiscan be shown quantitatively for quadratic problems.Theorem: Let f be a convex quadratic. If either the λ is small, or the angle at x between the gradient and the Newton step is large, then atleast half of the candidate leaders z : f(z) ≤ f(x) will bring the step direction closer to the Newton direction dN in the sense thatngle(dN,−∇f(x) + λ(z− x)) ≤ ngle(dN,−∇f(x)).

Empirical Results

Figure: CNN7 on CIFAR-10 with 4 workers (on the left) and 16 workers (on the right). Test error for the center variable versus wall-clock time.

Figure: ResNet20 on CIFAR-10 with 4 workers (on the left) and 16 workers (on the right). Test error for the center variable versus wall-clock time.

Figure: VGG16 on CIFAR-10 (on the left) and ResNet50 on ImageNet (on the right) with 4 workers. Test error for the center variable versus wall-clock time.

Figure: ResNet20 on CIFAR-10. The identity of the worker that is recognized as the leader (i.e. rank) versus iterations (on the left) and the number of times each workerwas the leader (on the right).

References[1] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, 2015.

1/1