machine learning on the cheap: an optimized strategy to ...gaurij/spot_instance_ml.pdfec2 can reduce...

10
Machine Learning on the Cheap: An Optimized Strategy to Exploit Spot Instances Xiaoxi Zhang Carnegie Mellon University [email protected] Jianyu Wang Carnegie Mellon University [email protected] Fangjing Wu Carnegie Mellon University [email protected] Gauri Joshi Carnegie Mellon University [email protected] Carlee Joe-Wong Carnegie Mellon University [email protected] ABSTRACT Stochastic gradient descent (SGD) is at the backbone of most state- of-the-art machine learning problems. Due to the massive size of the neural network models and training datasets used today, it is imperative to distribute the task of computing gradients across multiple computing nodes. Yet even running distributed SGD algo- rithms can be prohibitively expensive, as these jobs require large amounts of computing resources and large training times. We pro- pose cost-effective strategies that exploit spot instances on Amazon EC2 for users to run SGD algorithms to train large-scale machine learning models. To do so, we quantify the resulting tradeoffs be- tween cost, training error, and job completion time, accounting for job interruptions on spot instances and the resulting variation in the number of active nodes on the training performance. We use this analysis to optimize the spot instance bids. Experimental results show that our optimized strategies achieve good training performance at substantially lower costs than heuristic strategies. ACM Reference Format: Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong. 2017. Machine Learning on the Cheap: An Optimized Strategy to Exploit Spot Instances. In Proceedings of ACM Long Conference Name conference (SHORTNAME’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10. 475/123_4 1 INTRODUCTION Stochastic gradient descent (SGD) is the core algorithm used by most state-of-the-art machine learning problems today [9, 15, 27]. Yet as ever more complex models are trained on ever larger amounts of data, most SGD implementations have been forced to distribute the task of computing gradients across multiple “worker” nodes, thus reducing the computational burden on any single node while speeding up the model training through parallelization. Currently, even such distributed training jobs require high-performance com- puting infrastructure such as GPUs to train accurate models in a reasonable amount of time. However, purchasing GPUs outright is expensive and requires intensive setup and maintenance. Renting Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). SHORTNAME’17, July 1997, City, State, Country © 2017 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06. https://doi.org/10.475/123_4 such machines as on-demand instances from services like Amazon EC2 can reduce the setup costs, but may still be prohibitively ex- pensive since distributed training jobs can take hours or even days to complete. One common way to save money on cloud instances is to utilize Amazon EC2 spot instances, which have lower prices but may experience interruptions [6, 31, 35]. Spot instances allow users to specify a maximum price (i.e., a “bid”) that they are willing to pay: if the (dynamically adjusted) spot price set by Amazon exceeds this bid, then the user’s job is interrupted and resumes once the spot price falls below the bid. By intelligently specifying their bids, users can control the cost of running their jobs, saving up to 90% compared to typical on-demand instances [35]. Distributed SGD algorithms may deploy each worker on a spot instance to save money. However, setting the bids is then difficult: simply trying to minimize the cost with a low bid can lead to more frequent interruptions, and thus a model with greater training error and longer training time. To the best of our knowledge, this work is the first to quantify the tradeoffs between these cost, error, and completion time objectives and derive the resulting optimal bidding strategies for running distributed SGD algorithms on spot instances, validating them by running distributed SGD jobs on Amazon EC2. 1.1 Use Cases Spot instances are a common way to run distributed machine learn- ing jobs at lower cost over interruptible devices, but not the only one. We detail two other example scenarios below and outline how our analysis can benefit users in these scenarios as well. Distributed machine learning jobs could be run on volunteer computing platforms, where users make computing cycles on their laptops or computers available to others when they would otherwise be idle [7, 12]. Such platforms have been used for computationally intensive workloads such as searching for prime numbers [3] or extraterrestrial life [29], and some startups are already building platforms that use spare GPU resources to run large-scale machine learning jobs [23]. Users on such platforms will naturally experi- ence fluctuations in available resources as machines become idle. Introducing a mechanism in which users could pay for access to more idle capacity would induce cost, error, and completion time tradeoffs similar to those experienced on spot instances. Our analy- sis could then help users and compute providers navigate them. The Internet-of-Things points towards another set of applica- tions where machine learning jobs might be run on devices that could be interrupted. Applications such as multi-user virtual or 1

Upload: others

Post on 18-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances

Xiaoxi Zhang

Carnegie Mellon University

[email protected]

Jianyu Wang

Carnegie Mellon University

[email protected]

Fangjing Wu

Carnegie Mellon University

[email protected]

Gauri Joshi

Carnegie Mellon University

[email protected]

Carlee Joe-Wong

Carnegie Mellon University

[email protected]

ABSTRACTStochastic gradient descent (SGD) is at the backbone of most state-

of-the-art machine learning problems. Due to the massive size of

the neural network models and training datasets used today, it is

imperative to distribute the task of computing gradients across

multiple computing nodes. Yet even running distributed SGD algo-

rithms can be prohibitively expensive, as these jobs require large

amounts of computing resources and large training times. We pro-

pose cost-effective strategies that exploit spot instances on Amazon

EC2 for users to run SGD algorithms to train large-scale machine

learning models. To do so, we quantify the resulting tradeoffs be-

tween cost, training error, and job completion time, accounting

for job interruptions on spot instances and the resulting variation

in the number of active nodes on the training performance. We

use this analysis to optimize the spot instance bids. Experimental

results show that our optimized strategies achieve good training

performance at substantially lower costs than heuristic strategies.

ACM Reference Format:Xiaoxi Zhang, JianyuWang, FangjingWu, Gauri Joshi, and Carlee Joe-Wong.

2017. Machine Learning on the Cheap: An Optimized Strategy to Exploit

Spot Instances. In Proceedings of ACM Long Conference Name conference(SHORTNAME’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.

475/123_4

1 INTRODUCTIONStochastic gradient descent (SGD) is the core algorithm used by

most state-of-the-art machine learning problems today [9, 15, 27].

Yet as ever more complex models are trained on ever larger amounts

of data, most SGD implementations have been forced to distribute

the task of computing gradients across multiple “worker” nodes,

thus reducing the computational burden on any single node while

speeding up the model training through parallelization. Currently,

even such distributed training jobs require high-performance com-

puting infrastructure such as GPUs to train accurate models in a

reasonable amount of time. However, purchasing GPUs outright is

expensive and requires intensive setup and maintenance. Renting

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

SHORTNAME’17, July 1997, City, State, Country© 2017 Copyright held by the owner/author(s).

ACM ISBN 123-4567-24-567/08/06.

https://doi.org/10.475/123_4

such machines as on-demand instances from services like Amazon

EC2 can reduce the setup costs, but may still be prohibitively ex-

pensive since distributed training jobs can take hours or even days

to complete.

One common way to save money on cloud instances is to utilize

Amazon EC2 spot instances, which have lower prices but may

experience interruptions [6, 31, 35]. Spot instances allow users to

specify a maximum price (i.e., a “bid”) that they are willing to pay:

if the (dynamically adjusted) spot price set by Amazon exceeds

this bid, then the user’s job is interrupted and resumes once the

spot price falls below the bid. By intelligently specifying their bids,

users can control the cost of running their jobs, saving up to 90%

compared to typical on-demand instances [35]. Distributed SGD

algorithms may deploy each worker on a spot instance to save

money. However, setting the bids is then difficult: simply trying

to minimize the cost with a low bid can lead to more frequent

interruptions, and thus a model with greater training error and

longer training time. To the best of our knowledge, this work is thefirst to quantify the tradeoffs between these cost, error, and completiontime objectives and derive the resulting optimal bidding strategiesfor running distributed SGD algorithms on spot instances, validatingthem by running distributed SGD jobs on Amazon EC2.

1.1 Use CasesSpot instances are a common way to run distributed machine learn-

ing jobs at lower cost over interruptible devices, but not the only

one. We detail two other example scenarios below and outline how

our analysis can benefit users in these scenarios as well.

Distributed machine learning jobs could be run on volunteer

computing platforms, where users make computing cycles on their

laptops or computers available to otherswhen theywould otherwise

be idle [7, 12]. Such platforms have been used for computationally

intensive workloads such as searching for prime numbers [3] or

extraterrestrial life [29], and some startups are already building

platforms that use spare GPU resources to run large-scale machine

learning jobs [23]. Users on such platforms will naturally experi-

ence fluctuations in available resources as machines become idle.

Introducing a mechanism in which users could pay for access to

more idle capacity would induce cost, error, and completion time

tradeoffs similar to those experienced on spot instances. Our analy-

sis could then help users and compute providers navigate them.

The Internet-of-Things points towards another set of applica-

tions where machine learning jobs might be run on devices that

could be interrupted. Applications such as multi-user virtual or

1

Page 2: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong

augmented reality generally require models to be rapidly trained

on data collected at multiple devices in the same geographical area.

Devices could then act as computing nodes running distributed

SGD, with a nearby edge server [28] aggregating their gradient

computations. However, the number of devices present can vary

as users move in and out of the area of interest (e.g., a museum

exhibition [8] or location-based game like Pokemon Go [22]). Our

analysis of the error and completion timewhen the number of active

devices varies over time due to spot instance interruptions can help

developers understand the expected performance of distributed

SGD-based learning algorithms in these applications.

1.2 Challenges and ContributionsIn the course of achieving our goals of quantifying the cost, error

and completion time tradeoffs and deriving the optimal bidding

strategies, we encounter several research challenges. It is unclear,

for instance, how the optimal bids should be determined: higher

bids can allow us to minimize interruptions and maintain a larger

average number of active workers to compute the gradients, but the

marginal decrease in error lessens as we maintain more and more

workers by increasing the bids. Successfully navigating this tradeoff

becomes even more complex when different workers can submit

different bids, as we must then co-optimize the bids to control the

number of active workers at different times. Even worse, the future

spot prices and runtimes for each worker’s gradient computations

are unknown, forcing us to consider stochastic performance models.

Straggling [14] and synchronization bottle-necks in the system, for

instance, can lead to varied worker runtimes. We elaborate on our

contributions and solutions to these challenges below.

Monotonicity of cost and completion time w.r.t. uniformbids (Section 5). We first consider a scenario in which all workers

place the same bid for spot instances. While this simplifies the

optimal bidding strategy by reducing it to a single decision (the bid

price for all workers), finding the optimal bid is still challenging.

Our first result is that for any runtime and spot price distributions,

the expected cost (Lemma 1) and completion time (Lemma 2) oftraining a model to a given error threshold monotonically increases

and decreases with a higher bid, respectively. Higher bids increase

the expected price that the user must pay, while shortening the

number of interruptions and thus the overall job completion time.

We then derive the cost-minimizing bid subject to completion time

and error constraints (Theorem 3 and Corollary 1).Quantifying the training error with dynamic numbers of

workers (Section 6). We next consider a scenario in which differ-

ent workers may submit different bids: allowing some workers to

submit lower bids, for instance, ensures that more workers compute

gradients, reducing the training error, when the spot price is low

and workers are inexpensive. In doing so, however, we encounter

a new challenge: prior analyses of distributed SGD algorithms do

not consider the possibility that the number of active workers will

change over time. Since workers will submit different bids, they

may be active at different times, requiring us to derive new bounds

on the training error under these dynamic conditions. By exploiting

the correlation in worker activity due to the bidding process, we

show that the error can be upper-bounded in terms of the expected

reciprocal of the number of workers at any given time, given that

at least one worker is active (Lemma 5).Deriving the optimal bidding strategies (Section 6).Wenext

find the optimal bidding strategies when two groups of workers

can submit different bids (Theorem 4 and Corollary 2). Using our

new error bounds from Lemma 5, we find the bids that minimize

the expected cost subject to constraints on the final training error

and expected completion time. By focusing on two distinct bids, we

illuminate the roles that higher and lower bids play in achieving

each performance metric. We show that the error is only affected by

the number of active workers given that some workers are active,

which is controlled by the difference between the lower and higher

bid prices: some workers are active whenever the spot price is be-

low the higher bid, and the lower bid controls whether additional

workers are active. Thus, the higher bid also controls the waiting

time of the job (when no workers are active), while the lower bid

controls the running time of each iteration.

We finally examine two variants of the standard distributed SGD

algorithm, in which an SGD iteration need not wait for all active

workers to complete. In this case, we show that it is optimal for all

workers to bid the same price when their runtimes are exponen-

tially distributed (Theorems 5 and 6); having more workers when

the spot price is low does not save any cost. Under more realistic

distributions, however, a relatively large difference between the bid

prices for different workers can save cost (Corollary 3).Experimental validation on Amazon EC2 (Section 7). We

validate our results by running distributed SGD jobs analyzing

the MNIST [5] and CIFAR-10 [20] datasets on Amazon EC2. We

simulate uniform and Gaussian spot price distributions, and show

that our derived optimal bid prices can reduce users’ cost by 62%

while meeting the same error and completion time requirements,

compared with bidding a high price to minimize interruptions as

suggested in [31]. Moreover, we show that adding workers later in

the job and re-optimizing the bids according to the realized error

and training time at the beginning of the job further reduces the

cost, leading to a better cost/completion time/error tradeoff.

We first give an overview of related work in Section 2 and then

introduce some preliminaries on distributed SGD and our problem

formulation in Sections 3 and 4 respectively. After deriving our

above contributions in Sections 5–7, we conclude in Section 8. All

proofs may be found in our online technical report [2].

2 RELATEDWORKOur work is broadly related to prior works on algorithms and

resource allocation for distributed machine learning, as well as

exploiting spot instances to efficiently run computational jobs.

Distributed machine learning. Distributed machine learning

generally assumes that multiple workers send local computation

results to be aggregated at a central server, which then sends them

updated parameter values. The SGD algorithm [27], in which work-

ers compute the gradients of a given objective function with respect

to model parameters, is particularly popular. To avoid computing

the gradient over the entire training dataset, SGD processes stochas-

tic samples (usually a mini-batch [25]) chosen from data residing

at each worker in each iteration. Recent work has attempted to

limit device-server communication to reduce the training time of

2

Page 3: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country

SGD and related models [18, 19, 24, 30], while others analyze the

effect of the mini-batch size [25] or learning rate [10, 14] on SGD

algorithms’ training accuracy and convergence. Bottou et al. [10]

provide a comprehensive analysis of the convergence of training

error in SGD algorithms, but they do not consider the running time

of each iteration. Dutta et al. [14] analyze the trade-off between

the training error and the actual training time (wall-clock time) of

distributed SGD, accounting for stochastic runtimes for the gradient

computation at different workers. Our work is similar in spirit but

focuses on spot instances, which introduces cost as another perfor-

mance metric. We also go beyond [10, 14] to derive error bounds

when the number of active workers changes in different iterations,

as may occur with some of our proposed bidding strategies.

Scheduling improvement for distributed ML. Large-scaledistributed machine learning training jobs can benefit from efficient

resource schedulers, e.g., [4, 32] optimize the resources allocated to

jobs so as to accelerate job training and reduce the communication

overhead. In [26], dynamic resource allocation is proposed for deep

learning training jobs with a goal of minimizing job completion time

and makespan, while [34] designs a dynamic resource scheduler

that maximizes the collective quality of the trained models. By

controlling the resources dedicated to each job, these schedulers

substantially improve the average job quality and delay compared

to existing fair resource schedulers. In contrast, we provide bidding

strategies for users to minimize the expected costs of running their

jobs on spot instances subject to deadline and error constraints.

Optimized bidding strategies in Spot markets. Designingbidding strategies for cloud users has been extensively studied. The

authors of [35] design optimal bids to minimize the cost of com-

pleting jobs with a pre-determined execution time and no deadline.

Other works derive different cost-aware bidding strategies that

consider jobs’ deadline constraints [33] or jointly optimize the use

of spot and on-demand instances [16]. However, these frameworks

cannot handle the dependencies between workers that exist in dis-

tributed SGD. Another line of work attempts to instead optimize

the markets in which users bid for spot instances. Sharma et al. [31]

advocate using a bid price equal to that of an on-demand instance

with the same resource configuration migrating to VM instances in

other spot markets as needed to handle interruptions. The result-

ing migration overhead, however, requires complex checkpointing

and migration strategies due to SGD’s substantial communication

dependencies between workers, realizing limited savings [21].

3 DISTRIBUTED SGD PRIMERMost state-of-the-art machine learning systems employ stochastic

Gradient Descent (SGD) to train a neural network model so as to

minimize an objective functionG : Rd → R as defined in (1) below.

We use the vector w to denote the model parameters, that is the

neural network weights that are trained.

G(w) ≜1

|S|

|S |∑s=1

l(h(xs ,w),ys ) (1)

Here, G(w) is the empirical average of the loss l(h(xs ,w),ys ) thatcompares our model’s prediction h(xs ,w) to the true output ys , foreach training sample (xs ,ys ). In (1), S is the training dataset with

a total of |S| training samples. To minimizeG(w), the SGD method

i=1i=2i=3

PS

w0 w1 w2

i=1i=2i=3

PS

w0 w1 w2

FullySync-SGD K-SyncSGD

i=1i=2i=3

PS

K-BatchSyncSGD

w0 w1 w2

Figure 1: Gradient computations on three workers for twoiterations in fully, K = 2-, and K = 2-batch synchronousSGD. The x-axis indicates time, and lighter colored arrowsindicate workers cancelled by the Parameter Server (PS).

iteratively updates the prediction parameters w in the opposite

direction of the objective function’s gradient in each iteration.

Computing the gradient of G over the entire dataset in each

iteration can be prohibitively expensive. Thus, SGD is generally

implemented using a mini-batch approach, where the gradients are

computed over a small, randomly chosen subset of data samples. Let

α j and Sj represent the step size and the mini-batch respectively,

at iteration j. The update rule in the jth iteration is given by:

wj+1 = wj −α j

|Sj |

∑s ∈Sj

∂l(h(xs ,wj ),ys )

∂wj(2)

To further speed up the training, gradients can be computed

in parallel by distributed workers and aggregated by a parameter

sever at the end of each iteration. In this work, we consider a central

parameter server and n worker nodes. Each worker has access to

a subset of the data, and in each iteration each worker fetches the

current parameterswj from the parameter server, computes the gra-

dients of l(h(xs ,wj ),ys ) over one mini-batch of its data, and pushes

them to the parameter server. Based on the gradient aggregation

method used by the server, there are two distributed SGD frame-

works widely adopted today: Synchronous SGD and AsynchronousSGD. We focus on synchronous policies in this paper since asyn-

chronous gradient aggregation causes staleness in the gradients

returned by workers, which can give inferior SGD convergence. We

consider three variants of synchronous SGD, as shown in Figure 1.

1. Fully-synchronous SGD: In each iteration j, the parameter

server waits for all workers to finish processing their mini-batches

and push the computed gradients, before updating the parameters

wj+1 to all workers for the next iteration.

2. N -Synchronous SGD: In fully synchronous SGD, waiting forthe slowest worker to finish its gradient computation can become a

bottleneck. To overcome this bottleneck, inN -synchronous SGD the

parameter server waits for the first N out of n workers to push their

computed gradients. It updates the parameters to wj+1 according

to these N gradients, and pusheswj+1 to all workers, canceling the

outstanding gradient computations at slow workers.

3. N -Batch-synchronous SGD: To further reduce synchroniza-tion delays, upon finishing a gradient computation, each worker

continues to process another mini-batch using the same parameters

wj until N mini-batches are finished collectively by the workers. In

contrast to N -synchronous SGD, the parameter server waits for the

first N mini-batches to be finished rather than the first N workers.

Once the parameter server receives N gradient updates, it cancels

the remaining gradient computations, and updatesw at all workers.

3

Page 4: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong

4 PROBLEM FORMULATIONIn order to save cost, one can run distributed SGD on spot instances

purchased fromAmazon EC2 by deploying each worker on one spot

instance. We adopt three metrics to evaluate the performance of

such strategies: 1) the error ϕ in the trained model, 2) the completiontime of the job τ (i.e., the wall-clock time from when the job is initiatedto when it completes), and 3) the total cost of renting spot instancesfor the workers. Specifically, we construct an optimization problem

that attempts to minimize the total expected cost, subject to the

constraints that the expected training error and completion time

cannot exceed given thresholds. We summarize this optimization

problem in (3)-(5), where ϵ and θ denotes the maximum allowed

error and the job completion time respectively.

minimize : Expected total cost E[C] (3)

st.: Expected training error E[ϕ] ≤ ϵ, (4)

Expected completion time E[τ ] ≤ θ , (5)

Alternative formulations might instead minimize the expected train-

ing error or the expected completion time, subject to a cost budget

constraint. Our strategy of optimizing the bid price and analysis of

the training error can easily be adapted to these alternatives.

To solve (3)-(5), we first derive the mathematical expressions for

the three performance metrics.

Cost of running SGD on spot instances. To evaluate the cost,we consider that the n workers are deployed on the same type of

spot instance. Let pt denote the spot price of each VM instance

at time slot t . We assume pt is i.i.d. drawn from a distribution

with a lower-bound

¯

p and upper-bound p, following prior works

on optimal bidding in spot markets [35]. Let f (·) and F (·) denotethe probability density function (PDF) and the cumulative density

function (CDF) of the random variable pt .Let bi represent the bid price for worker i ∈ {1, · · · ,n}. Thus,

worker i will be running at time t if and only if bi ≥ pt , and we

formalize the expected total cost in (3) as:

Expected cost := E[C] = Ept

[ τ∑t=1

pt

n∑i=1

1(pt ≤ bi )

](6)

Further, the feasible range of the bid price should be:

¯

p ≤ bi ≤ p, ∀i = 1, · · · ,n (7)

According to Amazon’s policy [6], bi is determined upon the job

submission without knowing the future spot prices and will be

fixed for the job’s lifetime. Although the user can effectively change

the bid price by terminating the original request and re-bidding for

a new VM, doing so induces significant migration overhead. Thus,

we assume that users employ persistent spot requests: a worker

with a persistent request will be resumed once the spot price falls

below its bid price, exiting the system once its job completes.

Bounds on training error. The constraint (4) is satisfied whenSGD is run for a sufficient number of iterations J . In order to com-

pute the minimum required J , we need to analyze the convergence

of the expected error E[ϕ] = E[G(wJ ) −G∗]. For this, we employ

two convergence results from [10], which require three assump-

tions on the objective function G:

Assumption 1 (Lipschitz-continuous objective gradients). The

objective function G : Rd → R is continuously differentiable and

its gradient ∇G is L-Lipschitz continuous (for L > 0), i.e.,

∥ ∇G(w) − ∇G(w′) ∥2≤ L ∥ w −w′ ∥2,∀w,w′ ∈ Rd (8)

Assumption 2 (First and second moments). Let ESj[∇G(wj ,Sj )

]represent the expected gradient at iteration j for a mini-batch Sjof the training data. Then there exist scalars µG ≥ 1 such that

∇G(wj )T ESj

[∇G(wj ,Sj )

]≥∥ ∇G(w) ∥2

2

and ∥ ESj[∇G(wj ,Sj )

]∥2 ≤ µG ∥ ∇G(w) ∥2 (9)

and scalarsM,MV ≥ 0 andMG = MV + µ2

G ≥ 0 such that

ESj[∥ ∇G(wj ,Sj ) ∥

2

2

]≤ M +MG ∥ ∇G(w) ∥2

2. (10)

Assumption 3 (Strong convexity). The objective function G(·) :

Rd → R is c-strongly convex, i.e., there exists c > 0 such that

G(w′) ≥ G(w) + ∇T (w′ −w) +c

2

∥ w′ −w ∥2

2,∀(w′,w) (11)

Theorem 1 (Theorem 4.6 in [10]). Suppose that the objectivefunction G(·) satisfies Assumptions 1, 2, and 3. If the SGD algorithmruns with a fixed step-size α satisfying 0 < α ≤ 1

LMGfor a total of

J iterations and waits for N ≤ n mini-batches in each iteration, wehave

E[G(wJ ) −G∗

]≤

αLM

2cN+ (1 − αc)J−1

(G(w1) −

αLM

2Nc

). (12)

Our framework also applies to general non-convex objective

functions, for which we can apply the following convergence result.

Theorem 2 (Theorem 4.8 in [10]). Suppose that the objectivefunction G(·) satisfies Assumptions 1 and 2. If the SGD algorithm isrun with a fixed step-size α satisfying 0 < α ≤ 1

LMGfor a total of J

iterations and waits for N ≤ n workers in each iteration, we have

E

1

J

J∑j=1

∥ ∇G(wj ) ∥2

2

≤αLM

N+

2G(w1)

Nα J, (13)

Expected completion time. Constraint (5) imposes a deadline

on the job’s completion time. This time depends on two factors:

the bid and spot prices, which control the number of time slots

when workers will be active; and the time taken by the workers

to compute mini-batch gradients in each iteration. To model the

latter, we define the random variableXi as the running time of each

worker i processing one mini-batch. Assume that X1,X2, · · · ,Xnare i.i.d., with the CDF FX (·) and PDF fX (·). These gradient com-

putation delays and the gradient aggregation method (fully, N -,

or N -batch-synchronous) used by the parameter server affect R,which we define as the running time of each iteration when at least

one worker is active. We derive the resulting expected completion

time for different bidding strategies below.

5 IDENTICAL BIDS FORWORKERSWe first consider the simplest case where we optimize a single

bid bi = b for each worker i ∈ {1, 2, . . . ,n}, and solve the cost

minimization problem (3)-(5). We first derive a general result that

holds for each SGD variant and all possible distributions of worker

running times (Theorem 3) and then derive specific bid prices for

fully, N -, and N -batch-synchronous SGD in Corollary 1.

4

Page 5: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country

5.1 Translating Error to Number of IterationsIn examining the expected training error, a simple observation is

that processing more mini-batches leads to a smaller training error,

given other parameters such as the step size and the uniform data

size of each mini-batch. Therefore, we ask: how many mini-batches

we should process to meet the error requirement ϵ?Constant number of active workers when the job is run-

ning.With the same bid price for all workers, we observe that at

any time, either all workers are active or no workers are active.

Thus, the number of active workers at any time, given the job is

running (i.e., some workers are active), is independent with the

bid price. Therefore, in each iteration for which the job is running,

n mini-batches are processed for fully-synchronous SGD, while

N < n mini-batches are processed for N -synchronous and N -batch-

synchronous SGDs.

Required iterations J is independent of bid price. We de-

fine J as the required number of iterations for the job, not counting

iterations where the bid price is below the spot price and all workers

are inactive. Since the number of mini-batches processed in each

iteration is constant, deriving the required number of mini-batches

to process is equivalent to deriving the value of J that can guaranteean expected training error no larger than ϵ . To derive this value

of J , we suppose we are given a functionˆϕ that upper-bounds the

expected training error as a function of J . As long asˆϕ is mono-

tonically non-increasing, i.e., more iterations (a larger J ) leads toa smaller error upper-bound, we can derive the required J by the

inverse function ofˆϕ := ˆϕ−1

. Following [10], we can obtain the

error upper-boundˆϕ from Theorems 1 and 2 as described below.

Error bound ˆϕ. We choose the representation for the error ϕ

and its upper-boundˆϕ according to the convexity of the objective

function G(·). If G(·) is strongly convex, ϕ (ˆϕ) represents the ex-

pected gap of the objective value under our derived compared to

the optimal parameters. IfG(·) is not strongly convex, we abuse the

term “error” and defineˆϕ as the average norm of the gradients of

G(·). A value ofˆϕ near zero then indicates that our derived param-

eters approach a local optimum of the aggregate loss function G(·)[11]. In each case, we respectively use Theorems 1 and 2 to set

ˆϕ = right-hand side of (12) ≤ ϵ, ˆϕ = right-hand side of (13) ≤ ϵ .

Mapping error bound to number of iterations. We define

ˆϕ−1(ϵ,N ) as the minimum number of iterations that the job needs

to process to achieve an error no larger than ϵ in expectation,

given N ≤ n mini-batches processed per iteration, and we set

J ≥ ˆϕ−1(ϵ,N ). For fully-synchronous SGD, we have N = n, andˆϕ−1(ϵ,N ) is the same forN -synchronous andN -batch-synchronous

SGD, since they both wait for N mini-batches in each iteration.

Note that depending on the tightness of the error upper-bound

ˆϕ, ˆϕ−1(ϵ,N ) may over-estimate the required number of iterations.

Nevertheless, having more thanˆϕ−1(ϵ,N ) iterations can guarantee

that the expected error is no larger than ϵ ; having fewer iterations

may still meet this error requirement but does not give a theoretical

guarantee. Thus, in order to ensure the feasibility of the cost min-

imization problem defined in (3) – (7), we need at leastˆϕ−1(ϵ,N )

iterations, with N mini-batches processed in each iteration, for each

of our three variants of synchronous SGD.

0Spot price (p)

0

0.2

0.4

0.6

0.8

1

1.2F (p)

F (b)

p p p

Area (hatched) ⇥ JnE[R] = E[cost(b)]

Area (blue) ⇥ JnE[R] = E[cost(b + �)]

FunctionF (p)

F (b)Function

F (p)

F (b + �)

Figure 2: Increasing the bid price from b to b + ∆ alwaysleads to a larger expected cost (blue region compared to thehatched region). The area of the hatched region equals theterm b −

∫ b

¯

pF (p)F (b)dp in the expected cost (14) and the area of

the blue region equals b + ∆ −∫ b+∆

¯

pF (p)

F (b+∆)dp, where ∆ > 0.

5.2 Optimizing the Bid PriceGiven that the upper bound of the training error ϕ can be translated

into having a sufficient number of iterations J , we now optimize

the bid price b with the number of iterations J ≥ ˆϕ−1(ϵ,N ), given

that N mini-batches are processed in each iteration. The derivation

of the optimal bid price, shown in Theorem 3, is based on two

observations that are summarized in Lemmas 1 and 2.

Lemma 1 (Monotonicity of expected cost). Using one bidprice for all workers, the expected cost of finishing a fully, N -, orN -batch-synchronous SGD job is given by

E[C] = JnE[R]

(b −

∫ b

¯

p

F (p)

F (b)dp

), (14)

which is monotonically non-decreasing with the bid price b, givenany number of iterations J .

Equation (14) removes the completion time variable τ in the

summation operation of the expected cost in (6). Instead, we relate

the expected cost to the spot price distribution and the total number

of iterations. Figure 2 illustrates that the expected cost under bid

price b (or b +∆) is equal to the area of the hatched (or blue) regionscaled by the same factor JnE[R]. The area of the region, and thus

the expected cost, is non-decreasing with the bid price.

The expected completion time is also monotonic with b:

Lemma 2. Using one bid price for all workers, the expected com-pletion time of running a synchronous SGD (three variants) for anygiven number of iterations J is given by

E[τ ] = JE[R]/F (b), (15)

which is monotonically non-increasing with the bid price b.

Given that both E[τ ] and E[C] increase with J , and we have

J ≥ ˆϕ−1(ϵ,N ) to meet the error requirement, we should have a

total of J = ˆϕ−1(ϵ,N ) iterations. Moreover, since the expected

5

Page 6: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong

completion time decreases with the bid price, the optimal bid price

should be the one that makes the expected completion time equal

to the deadline. Taking the above, we have the following theorem.

Theorem 3 (Optimal Uniform Bid). Suppose the job performsa synchronous SGD in which the parameter server waits for N ≤ nmini-batches in each iteration, the deadline of the job is θ , the spotprice is i .i .d . over time, and the running time of processing eachmini-batch by each worker is i .i .d . over all mini-batches and over allworkers. Then if E[R] denotes the expected runtime of each iteration,

the optimal bid price for all workers equals F−1

(ˆϕ−1(ϵ,N )E[R]

θ

).

Theorem 3 provides a general form of the optimal bid price, given

the number of mini-batches processed per iteration, N , and other

parameters. In order to derive the specific value of the optimal b,we need to obtain E[R]. Let fR (·) denote the probability density

function of R. Below, we give fR (·) for N -synchronous SGD (N =n for the fully-synchronous SGD), given fX (·), the distribution

of the running time of processing each mini-batch. For N -batch-

synchronous SGD, finding fR (·)with an arbitrary distribution fX (·)is intractable, but we show an existing result in Lemma 4 when

fX (·) is the PDF of an exponential distribution exp(λ).

Lemma 3. Suppose the job is performing an N -synchronous SGDor a fully-synchronous SGD (N = n). For any 1 ≤ N ≤ n, we have:

fR (x) =n!

(N − 1)!(n − N )!(FX (x))

N−1(1 − FX (x))n−N fX (x) (16)

Lemma 4 (Lemma 5 in [14]). Suppose the job is performing anN -batch-synchronous SGD and Xi ’s (i = 1, · · · ,n) are i.i.d. drawnfrom an exponential distribution exp(λ). For any 1 ≤ N ≤ n,

E[R] = NE[Xi ]/n = N /(λn) (17)

Based on Lemmas 3 and 4, we derive the optimal bid price when

the running time per mini-batch follows a shifted exponential dis-

tribution (generalizing the exponential distribution).

Corollary 1 ((Shifted) Exponential Running Time Distri-

bution). Suppose the job performs a synchronous SGD for ˆϕ(ϵ,N )

(N ≤ n)iterations, the spot price is i .i .d . over time, and the runningtime of processing each mini-batch is i .i .d . over all mini-batches andover all workers, with a PDF fX (x) = λ exp(−λ(x −α)),x ≥ α ,α ≥ 0.The optimal bid price for each SGD variant equals F−1(z), where:

z =

ˆϕ−1(ϵ,n)

θ

(nλ logn + α

)fully − synchronous

ˆϕ−1(ϵ,N )

θ

(nλ log

nn−N + α

)N − synchronous, N < n

ˆϕ−1(ϵ,N )

θ

(Nn ( 1

λ + α))

N − batch − synchronous, N ≤ n.

6 OPTIMAL HETEROGENEOUS BIDSIn this section, we propose a bidding strategy with two distinct

bid prices for two different groups of workers. This strategy is

motivated by the observation that bidding lower prices for some

workers yield a larger number of active workers when the spot

price is relatively low, which improves the training error but will

not cost much. We assume a strongly convex objective function

G(·) (Assumption 3) in this section.

Time-variant number of active workers. The largest differ-ence between using uniform bids as in Section 5 and this bidding

strategy is that the number of active workers can vary over time,depending on the spot price in each iteration and the bid prices forall workers. Thus, we first seek to understand how the time-variant

number of active workers affects the final training error, completion

time, and cost. Based on this analysis, we can proceed to derive the

bids that solve our optimization problem (3)–(5).

Formally, we place bids of b1 for workers 1, · · · ,n1 and b2 (< b1)

for workers n1 + 1, · · · ,n. We define the random variable y(pt , ®b)as the number of active workers at time t , which is a function of

the spot price pt and the bid prices®b = (b1,b2). At any given time,

with probability F (b1) − F (b2), y(·, ®b) = n1 workers are active; with

probability F (b2), all the y(·, ®b) = n workers are active.

Runtime and error bound w.r.t. bid price. We make two ob-

servations when we have different bids for different workers: (1)the running time of each iteration is a function of the bid prices ®b, as®b controls the number of active workers, which affects the running

time of each iteration. Therefore, we rewrite R, the running time

per iteration, as R(®b). (2) The number of iterations required to ensurea final error below ϵ depends on both ®b and ϵ for fully synchronous

SGD, as there can be different numbers of mini-batches processed,

in different iterations, depending on®b. Thus, we cannot define

J ≥ ˆϕ−1(ϵ,N ) from Theorem 1 as we did for uniform bids. We

instead defineˆϕ(®b) as the upper-bound of the error under

®b for a

fixed number of J iterations (dropping ϵ for simplicity). We derive

ˆϕ(®b) in the course of proving Lemma 5 and show that®b and J can

be co-optimized in Corollary 2.

Static spot price within each iteration assumption. If work-ers change their statuses (i.e., become active or inactive) due to spot

price changes within each iteration, the distributions of the cost

and of the completion time will become much more complex. For

instance, suppose the parameter server runs 4-synchronous SGD,

waiting for 4 workers each iteration. Then if 5 workers are active in

the beginning of a given iteration but only 3 of them remain active

in the middle, the parameter server will still wait for the fourth

worker to finish this iteration, i.e., until the spot price decreases

below this worker’s bid price. Therefore, to simplify the model,

we assume that each worker maintains the same (in)active status

through each iteration. In practice, the spot price only changes

once per hour [1], compared to a runtime of several minutes per

iteration, and thus the number of active workers will usually not

change during an iteration. If each worker has an independent ac-tive/inactive distribution, this assumption can be removed as we

can independently consider the random idle time of each individual

worker. We leave the detailed study of spot price changes within

an iteration to our future work.

6.1 Fully-synchronous SGDWe first consider fully-synchronous SGD, where the parameter

server waits for all active workers in each iteration. We initially

assume that n1, the number of workers in each group, and J , therequired number of iterations, are fixed; thus, we optimize the

trade-off between the expected cost, expected completion time, and

the training error using only the bid prices®b. After deriving the

closed-form optimal solutions of b1 and b2 in Theorem 4, we discuss

co-optimizing n1 and J with the bid price vector®b.

6

Page 7: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country

The expected cost minimization problem is given as follows.

minimizeb

J

∫ b1

¯

pE[R(®b)

��� ®b]y(®b)p f (p)

F (b1)dp (18)

subject to:

J

F (b1)

∫ b1

¯

pE[R(®b)

��� ®b] f (p)

F (b1)dp ≤ θ (19)

ˆϕ(®b) ≤ ϵ (Error constraint) (20)

p ≥ b1 ≥ b2 ≥¯

p, ∀i ≤ j (21)

Letting fR |p (x) denote the PDF of R(®b) conditioned on a given spot

price p ∈ [¯

p,b1], we have:

fR |p (x) =

{n(FX (x))

n−1 fX (x), if

¯

p ≤ p ≤ b2

n1(FX (x))n1−1 fX (x) if b2 < p ≤ b1

(22)

We give a more concrete form of E[C] and E[τ ] in [2]. We now

examine the constraint (20), showing that the error is less than the

threshold ϵ when we control the number of active workers y(®b):

Lemma 5. Suppose the objective functionG(·) satisfies Assumptions1, 2, and 3. Given a number of iterations (J ) that can guarantee1

n < Q(ϵ) ≤ 1

n1

; parameters L,M , and c from Assumptions 1–3; anda fixed step size α , the training error is at most ϵ if we have:

E

[1

y(®b)

����� y(®b) > 0

]≤ Q(ϵ) :=

2c(ϵ − (1 − αc)JE[G(w1)]

)αLM

(1 − (αc)J

) (23)

Lemma 5 connects the distribution of the spot price and our

bid prices to the training error through some average form of the

number of active workers, i.e., E[

1

y( ®b)

���� y(®b) > 0

]. Intuitively, the

harmonic expectation of the number of workers should be large

enough to limit the error. To derive the bound Q(ϵ) in practice, we

need to estimate parameters c andM ; we show that our results are

robust to mis-estimation in Section 7. Using this result, we now

derive the optimal bid prices:

Theorem 4 (Optimal-Two Bids with a Fixed J ). Suppose theobjective function G(·) satisfies Assumptions 1, 2, and 3. Let R(x)represent the expected running time per iteration with x workers activein the entire iteration. Let b∗

1and b∗

2denote the optimal bid prices

for two groups of workers, workers {1, · · · ,n1} and {n1 + 1, · · · ,n}.Given a number of iterations (J ) that can guarantee 1

n < Q(ϵ) ≤ 1

n1

,a fixed step size α , and a reasonable deadline (θ ≥ J R(n)), we have:

F (b∗1) =

J

θ

((R(n) − R(n1)

) 1

n1

−Q(ϵ)

1

n1

− 1

n+ R(n1)

)F (b∗

2) =

1

n1

−Q(ϵ)

1

n1

− 1

n× F (b∗

1), (24)

for any i.i.d. spot price and any i.i.d. running time per mini-batch, i.e.,F (·) and R(n) (or R(n1)) do not change during the training process.

Figure 3 illustrates the proof of Theorem 4. We first replace®b

in the cost minimization problem (18) by F (b1) and γ =F (b2)

F (b1). We

then show that the expected cost, completion time, and error are

monotonic w.r.t. to F (b1) and γ . The error is not a function of F (b1),

the probability at least some workers are active, as it depends only

on the number of active workers given that some workers are active,which is controlled by the relative difference between F (b1) and

F (b2): γ . We can thus choose γ as small as possible to satisfy the

error constraint while minimizing cost, and then choose F (b1) to

satisfy the completion time constraint. Intuitively, F (b1) should

be high enough to guarantee that some workers are active often

enough that the job completes before the deadline θ .

Co-optimizing n1 and ®b. Note that (24) provides the functionof the optimal bid prices w.r.t. any given n1, the number of workers

with the higher bid price. If n1 is not a known input but a variable

to be co-optimized with®b, we can write n1 and b∗

2in terms of F (b∗

1)

according to (24) and plug them into (18)-(21) to solve for b∗1first,

and then derive b∗2and the optimal n1.

Co-optimizing J and ®b. Taking J as an optimization variable

may allow us to further reduce the job’s cost. For instance, allow-

ing the job to run for more iterations, i.e., increasing J , increasesQ(ϵ) in Lemma 5’s error constraint (23). We can then increase

E

[1

y( ®b)

���� y(®b) > 0

]by submitting lower bids b2, making it less likely

that workers n1 + 1, . . . ,n will be active, while still satisfying (23).

A lower b2 may decrease the expected cost by making workers less

expensive, though this may be offset by the increased number of

iterations. To co-optimize J , we show it is a function of®b and ϵ :

Corollary 2 (Relationship of J and ®b). To guarantee a trainingerror no greater than ϵ , the minimum J is a function of our bid prices:

J = log(1−αcµ) H (®b, ϵ)

H (®b, ϵ) =

ϵ − αLM2cµ E

[1

y( ®b)

���� y(®b) > 0

]E[G(w1)] −

αLM2cµ E

[1

y( ®b)

���� y(®b) > 0

] (25)

We use Corollary 2 to give the main idea of co-optimizing J

and®b; we omit the details for reasons of space. The first step is

to replace J in (18) and (19) by (25). The error constraint (20) is

already guaranteed by (25) and can be removed. We then solve for

the remaining optimization variables, the bids®b.

6.2 N -synchronous SGDWe now consider N -synchronous SGD, where in each iteration,

the parameter server waits for N (< n) workers. We suppose that

N workers have the same bid price b1, and that the bid prices for

workers N + 1, · · · ,n are b2, · · · ,bn−N+1, respectively with b1 ≥

b2 ≥ . . . ≥ bn−N+1. Thus, N ≤ y(pt , ®b) ≤ n when

¯

p ≤ pt ≤ b1.

We use this strategy to guarantee that at least N workers are

active when the job is running (pt ≤ b1), avoiding cases where

fewer than N workers are active at the start of the iteration and

the algorithm must wait to start the next iteration until the spot

price falls sufficiently far that more workers become active. Since

we allow a distinct bid price for each remaining worker, the above

model is harder to analyze than the two bid prices for two groups

of workers adopted in Section 6.1. To derive the closed-form of the

optimal bid prices, we simplify the problem by assuming that the

running time per mini-batch is randomly drawn from an exponen-

tial distribution, as assumed in [13, 14]. Surprisingly, we can prove

that the optimal bid prices should be the same for all workers:

7

Page 8: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong

0 0.2 0.4 0.6 0.8 1.0

F(b1)

0.2

0.4

0.6

0.8

1.0

1.2

Cost Cost

increases with F(b1) (given γ)

(a) Cost-vs-F (b1)

0 0.2 0.4 0.6 0.8 1.0

F(b1)

200

400

600

800

1000

Compl. tim

e (s) Compl. time

decreases with F(b1) (given γ)

(b) Compl. time-vs-F (b1)

0 0.2 0.4 0.6 0.8 1.0

γ= F(b2)/F(b1)0.5

0.6

0.7

0.8

0.9

1.0

Cost

Cost increases with γ (given F(b1))

(c) Cost-vs-γ

0 0.2 0.4 0.6 0.8 1.0

γ= F(b2)/F(b1)400

420

440

460

480

500

Compl. tim

e (s) Compl. time increases

with γ (given F(b1))

(d) Compl. time-vs-γ

0 0.2 0.4 0.6 0.8 1.0

γ= F(b2)/F(b1)0.900

0.925

0.950

0.975

1.000

1.025

1.050

1.075

1.100

Error

Error decreases with γ

(e) Error-vs-γ

Figure 3: We plot the monotonocity of each performance metric w.r.t. to our new variables F (b1) and γ =F (b2)

F (b1), with any given

spot price and runtime distributions. We use Compl. time to abbreviate Completion time. As a larger γ leads to a smaller ex-pected error (Fig. 3e) but a larger expected cost (Fig. 3c) and completion time (Fig. 3d), and the expected error is only controlledby γ , the optimal γ should be the smallest possible γ , i.e., the one that yields error = ϵ . The optimal F (b1) should be the one thatyields the completion time equal to the deadline under the optimal γ (Fig. 3b).

Theorem 5 (N-Sync SGD with Exponential Runtimes). Sup-pose the running time per iteration is i.i.d. drawn from an exponentialdistribution exp(λ) over all time and all workers and the number ofiterations J = ˆϕ−1(N , ϵ). Then the optimal bid prices are:

bi = bj = F−1

(ˆϕ−1(ϵ,N ) log

nn−N

λθ

).

Intuitively, since N -synchronous SGD always processes N mini-

batches per iteration, having more workers active when the spot

price is low does not change the error. It does, however, increase

the cost due to paying for all workers’ computing usage. If the

mini-batch runtimes are exponentially distributed, this is not offset

by the lower iteration runtime due to canceling stragglers. It is

then optimal to have the same number of active workers in each

iteration, i.e., uniform bids.

6.3 N -batch-synchronous SGDWe finally consider N -batch-synchronous SGD, where the parame-

ters are updated and a new iteration begins once N mini-batches

are processed. We consider a more general bidding model where

each worker is allowed to have a distinct bid price, denoting the

bid prices for workers 1, · · · ,n as b1 ≥ b2 ≥ · · · ≥ bn .As in Section 6.2, we assume that the wall-clock time of finish-

ing each mini-batch for all workers is an i.i.d. random variable

drawn from an exponential distribution with parameter λ. UnlikeN -synchronous SGD, for N -batch-synchronous SGD, workers can

continuously process mini-batches in each iteration, until a total

of N mini-batches are finished. Therefore, the arrival rate of mini-

batch completion at each time equals y(pt , ®b)λ. We then observe

that the product of the number of active workers y(pt , ®b) and therunning time 1

y(pt , ®b)λwill be 1

λ , a constant. This product can be

regarded as Resource-time, which is the total amount of resource

units in each iteration we need to pay for. Since it is a constant, the

expected cost of each iteration will be determined only by the spot

price distribution, the largest bid price b1, and λ. This result thenimplies that the optimal bid prices are the same for all workers:

Theorem 6 (N-Batch-Sync SGD with Exponential Runtime).

Given a deadline θ and maximum error threshold ϵ , suppose the run-ning time per iteration is i.i.d. drawn from an exponential distribution

exp(λ) over all time and all workers. Then the cost-minimizing bid

prices equal F−1

(N ˆϕ−1(ϵ,N )

nλθ

)for each worker.

Theorem 6 indicates that a smaller bid price comes with a larger

number of workers (n), a larger deadline (θ ), and a smaller number

of required mini-batches (N ˆϕ−1(ϵ,N )) to achieve a smaller-than-ϵerror. Further, it also implies that as we increase the number of

workers, n, the expected cost keeps decreasing until it becomes

N ˆϕ−1(ϵ,N )

λ ׯ

p. This non-intuitive conclusion follows from the ex-

ponentially distributed running times per mini-batch, under which

the expected running time per iteration converges to zero asn → ∞.

In practice, however, the time to process N mini-batches should not

converge to zero, due to unavoidable delays such as fetching the

parameters. We thus assume a more realistic shifted exponential

distribution for the running time per mini-batch and re-derive the

optimal bids. Our expected running time per iteration is then:

E[R] = E[N

/ (λy(pt , ®b)

)+ Rmin

��� y(pt , ®b) > 0

](26)

Deriving the closed-form optimal bids is then intractable under

general spot price distributions, but we can derive the following:

Corollary 3 (N-Batch-Sync SGD with Shifted Exponential

Runtime). Suppose that the spot price follows an i.i.d. uniform dis-tribution with a PDF f (p) = 1

p−¯

p , and the user chooses bid prices in

an arithmetic progression form: bi =¯

p + α(n − i + 1), ∀1 ≤ i ≤ n,where α ≥ 0. Then, the parameter α for the optimal bid prices equals:

α = max

ˆϕ−1(ϵ,N )(

¯

p − p)

n(θ )

©­« Nnλn∑j=1

1

j+ Rmin

ª®¬ ,¯

p

√n − 1

n2NλRmin

∑nj=1

(n − j + 1)2

(27)

Intuitively, the bid prices should be scattered enough to guar-

antee the job completion before the deadline. A larger distance

between consecutive bid prices implies a larger bid price for each

worker, which leads to a smaller completion time in expectation.

7 PERFORMANCE EVALUATIONWe evaluate our bidding strategies for fully synchronous SGD from

Sections 5 and 6 on two image classification benchmark datasets:

8

Page 9: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country

MNIST and CIFAR-10, using a three-layer convolutional neural

network [5] and ResNet-50 [17], respectively. We run the MNIST

experiments on Amazon EC2 and the CIFAR10 ones on a local clus-

ter. We run J = 1000 iterations to train on the MNIST dataset and

J = 5000 iterations for the CIFAR-10 dataset. We set the deadline

(θ ) to be twice the estimated runtime of using 8 workers to process

J iterations without interruptions, i.e., the empirical average of the

runtime per mini-batch multiplied by J ; and the error threshold

ϵ = 0.98. In our technical report [2], we show that our runtimes

per mini-batch approximately follow a Gaussian distribution.

Since we do not have perfect knowledge of all the required pa-

rameters to derive J (Section 5.1) and Q(ϵ) (cf. Theorem 4), we

choose J based on experience and uniformly at random generate a

value from [ 1

n ,1

n1

] to use as Q(ϵ), demonstrating the robustness of

our optimized strategies to mis-estimations. We consider two dis-

tributions to simulate i.i.d. spot prices: a normal distribution in the

range [0.2, 1] and a Gaussian distribution with mean and variance

equal to 0.6 and 0.175; we draw the spot price when each iteration

starts and re-draw it every 4 seconds after the job is interrupted.

Our strategies and benchmark. We compare three strategies

with our optimal bid prices against a benchmarkNo-interruptionsstrategy that chooses the bid price to be 1.1 (larger than the maxi-

mum spot price) so as to guarantee the job is always running, as

recommended by Amazon [6, 31]. We aim to evaluate the cost reduc-

tion and our ability to meet the accuracy and deadline constraints

using our strategic bids, compared to this aggressive benchmark.

We evaluate the bidding strategies with both the optimal sin-

gle bid price for all workers derived in Theorem 3 (Optimal-one-bid) and the optimal bid prices for two groups of workers derived

in Theorem 4 (Optimal-two-bids). To further minimize the ex-

pected total cost of running each job while guaranteeing a high

training/test accuracy, we propose a Dynamic strategy, whichdynamically updates the optimal two bid prices when we increase

the total number of workers. More specifically, we initially launch

four workers and apply our optimal two bid prices, each for two

workers. After completing 80% of the total number of iterations, we

add four more workers and re-compute the optimal two bid prices

in (24): we assign n1 = 4,n = 8 as the new numbers of workers in

each group, subtract the consumed time from the original deadline

θ , and use the remaining number of iterations as the new value of J .This strategy is an initial attempt to improve Optimal-two-bids,which does not optimize the number of workers. One could further

divide the training into multiple stages, e.g., each corresponding to

100 iterations, and recompute the optimal two bid prices using (24)

with the updated parameters in each stage. Frequent re-optimizing

will likely induce intolerable interruption overheads, but infrequent

optimization may reduce the cost with tolerable overhead.

Superiority of our strategies. Figures 4 and 5 compare the

performance of our strategies onMNIST and CIFAR-10, respectively.

Figures 4a, 4b, 5a, and 5b show the achieved test accuracy vs. cost as

we run both jobs for uniform and Gaussian spot prices. We observe

that our dynamic strategy leads to a lower cost and the no interruptionsbenchmark to a higher cost for any given accuracy, compared to the

optimal-one- and two-bid strategies. In Figures 4c, 4d, 5c, and 5d, we

indicate the cumulative cost as we run the jobs. After 1000 iterations,

the no interruptions benchmark accumulates nearly twice the cost

0 1 2 3 4 5 6 7Cost

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Test

acc

urac

y

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(a) Accuracy-vs-cost, uniformspot price distribution

0 1 2 3 4 5 6 7Cost

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Test

acc

urac

y

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(b) Accuracy-vs-cost, Gaussianspot price distribution

0 1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)

0

1

2

3

4

5

6

7

Cost

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(c) Cost-vs-time, uniform spotprice distribution

0 1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)

0

1

2

3

4

5

6

7

Cost

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(d) Cost-vs-time, Gaussian spotprice distribution

1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Test

acc

urac

y

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(e) Accuracy-vs-time, uniformspot price distribution

1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Test

acc

urac

y

No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy

(f) Accuracy-vs-time, Gaussianspot price distribution

Figure 4: Under both spot price distributions, (a,b) the dy-namic strategy achieves the highest test accuracy under anygiven cost. The markers on the curves in (c,d) show thecost when achieving a 98% test accuracy; No-interruptions,Optimal-one-bid, and Optimal-two-bids incur an 134%, 82%,46% higher cost under uniform spot price distribution, and103%, 101%, 43% higher cost under Gaussian spot price dis-tribution at this accuracy than the dynamic strategy, respec-tively. All strategies (e,f) converge rapidly to similar (> 98%)accuracies over time.

of our dynamic strategy. The markers indicate the points where

we achieve 98% accuracy; while the no interruptions benchmark

achieves this accuracy much faster, it costs nearly three times asmuch as our dynamic strategy and twice as much as our optimal-two-bids strategy on both datasets. We observe that the cost reduction

for the dynamic strategy at 98% accuracy is greater for the CIFAR-10

compared to the MNIST dataset, likely due to the greater number

of iterations required to achieve a high accuracy. Figures 4e, 4f, 5e,

and 5f verify that all strategies yield qualitatively similar accuracy

over time, though the no interruptions benchmark converges faster.

8 CONCLUSIONIn this paper, we consider the use of multiple, unreliably available

workers that run distributed SGD algorithms to train machine learn-

ing models. We focus in particular on Amazon EC2 spot instances,

9

Page 10: Machine Learning on the Cheap: An Optimized Strategy to ...gaurij/spot_instance_ml.pdfEC2 can reduce the setup costs, but may still be prohibitively ex-pensive since distributed training

SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong

0 1000 2000 3000 4000 5000

Cost

0

0.2

0.4

0.6

0.8

1

Tra

inin

g a

ccura

cy

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(a) Accuracy-vs-cost, uniformspot price distribution

0 1000 2000 3000 4000 5000

Cost

0

0.2

0.4

0.6

0.8

1

Tra

inin

g a

ccura

cy

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(b) Accuracy-vs-cost, Gaussianspot price distribution

0 200 400 600 800 1000 1200

Wall-clock time (s)

0

1000

2000

3000

4000

5000

Cost

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(c) Cost-vs-time, uniform spotprice distribution

0 200 400 600 800 1000 1200

Wall-clock time (s)

0

1000

2000

3000

4000

5000

Cost

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(d) Cost-vs-time, Gaussian spotprice distribution

0 200 400 600 800 1000 1200

Wall-clock time (s)

0

0.2

0.4

0.6

0.8

1

Tra

inin

g a

ccura

cy

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(e) Accuracy-vs-time, uniformspot price distribution

0 200 400 600 800 1000 1200

Wall-clock time (s)

0

0.2

0.4

0.6

0.8

1

Tra

inin

g a

ccura

cy

No interruptions

Optimal-one-bid

Opimal-two-bids

Dynamic Strategy

(f) Accuracy-vs-time, Gaussianspot price distribution

Figure 5: The results under the CIFAR-10 dataset are similarto those under the MNIST dataset (Figure 4). Figures 5c and5d show that the dynamic strategy realizes a greater reduc-tion of the completion time compared to Figs. 4c and 4d.

which allow users to reduce the cost of running jobs at the expense

of a longer job training time to achieve the same model accuracy.

Spot instances allow users to choose the amount they are willing to

pay for computing resources, thus allowing them to choose how to

trade off a higher cost (achieved by placing higher bids) compared

to a longer completion time or higher training error. We quantify

these tradeoffs and derive new bounds on the error of a model that

is trained with different numbers of workers at each iteration. We

finally use this tradeoff characterization to derive optimized bidding

strategies for users on spot instances. We validate these strategies

by training neural network models on the MNIST and CIFAR-10

image datasets and demonstrating their cost savings compared to

heuristic bidding strategies.

REFERENCES[1] How spot instances work. https://docs.aws.amazon.com/aws-technical-content/

latest/cost-optimization-leveraging-ec2-spot-instances/how-spot-instances-

work.html.

[2] Machine learning on the cheap: An optimized strategy to exploit spot instances.

http://andrew.cmu.edu/user/cjoewong/Spot_ML.pdf.

[3] Great Internet Mersenne Primes Search. https://www.mersenne.org/, 2018.

[4] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,

Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,

Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden,

P., Wicke, M., Yu, Y., , and Zheng., X. Tensorflow: A system for large-scale

machine learning. In In Proceedings of USENIX OSDI (2016).

[5] Alex Krizhevsky, IIya Sutskever, G. E. H. Imagenet classification with deep

convolutional neural networks. In Proceedings of NIPS (2012).[6] Amazon EC2. Amazon ec2 spot instances. https://aws.amazon.com/ec2/spot/,

2018.

[7] Anderson, D. P., and Fedak, G. The computational and storage potential of

volunteer computing. In Proc. of IEEE CCGRID (2006), vol. 1, IEEE, pp. 73–80.

[8] Billock, J. Five augmented reality experiences that bring museum exhibits to

life. Smithsonian Magazine (June 2017).[9] Bottou, L. Large-scale machine learning with stochastic gradient descent. In

Proceedings of COMPSTAT (2010).

[10] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale

machine learning. SIAM Review 60, 2 (2018), 223–311.[11] Boyd, S., and Vandenberghe, L. Convex optimization. Cambridge university

press (2014).[12] Carbunar, B., and Tripunitara, M. V. Payments for outsourced computations.

IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012), 313–320.[13] Dutta, S., Cadambe, V., and Grover, P. Short-dot: Computing large linear

transforms distributedly using coded short dot products. In Proc. of Advances inNeural Information Processing Systems (NIPS) (2016).

[14] Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. Slow and stale gra-

dients can win the race: Error-runtime trade-offs in distributed sgd. In Proceedingsof AISTATS (2018).

[15] et al., J. D. Large scale distributed deep networks. In International Conferenceon Neural Information Processing Systems (NIPS) (2012), vol. 1, pp. 1223–1231.

[16] Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Proteus:

Agile ml elasticity through tiered reliability in dynamic resource markets. In

Proc. of European Conference on Computer Systems (2017).[17] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recogni-

tion, 2015.

[18] Kamp, M., Adilova, L., Sicking, J., Hüger, F., Schlicht, P., Wirtz, T., and

Wrobel, S. Efficient decentralized deep learning by dynamic model averaging.

arXiv preprint arXiv:1807.03210 (2018).[19] Konečny, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon,

D. Federated learning: Strategies for improving communication efficiency.

[20] Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset. https://www.cs.

toronto.edu/~kriz/cifar.html.

[21] Lee, K., and Son, M. Deepspotcloud: leveraging cross-region gpu spot instances

for deep learning. In Proceedings of IEEE CLOUD (2017), IEEE, pp. 98–105.

[22] Levy, N. How PokÃľmon Go evolved to remain popular after the initial craze

cooled. GeekWire (Oct. 2018). https://www.geekwire.com/2018/pokemon-go-

evolved-remain-popular-initial-craze-cooled/.

[23] Lynley, M. Snark ai looks to help companies get on-demand access to idle

gpus. TechCrunch, https://techcrunch.com/2018/07/25/snark-ai-looks-to-help-

companies-get-on-demand-access-to-idle-gpus/, 2018.

[24] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A.

Communication-efficient learning of deep networks from decentralized data. In

Proceedings of AISTATS (2017).[25] Ofer Dekel, Ran Gilad-Bachrach, O. S., and Xiao., L. Optimal distributed

online prediction using mini-batches. Journal of Machine Learning Research 13, 1(2012), 165–202.

[26] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: an efficient dynamic

resource scheduler for deep learning clusters. In Proc. of EuroSys (2018).[27] Robbins, H., and Monro, S. A stochastic approximation method. The Annals of

Mathematical Statistics 22, 3 (1951), 400–407.[28] Satyanarayanan, M. The emergence of edge computing. Computer 50, 1 (2017),

30–39.

[29] Search for Extraterrestrial Intelligence. SETI@Home, 2018.

[30] Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed opti-

mization using an approximate newton-type method. In International conferenceon machine learning (2014), pp. 1000–1008.

[31] Sharma, P., Irwin, D., and Shenoy, P. How not to bid the cloud. In Proc. USENIXConference on Hot Topics in Cloud Computing (HotCloud) (2016).

[32] Tianqi, C., Mu, L., Yutian, L., Min, L., Naiyan, W., Minjie, W., Tianjun, X.,

Bing, X., Chiyuan, Z., and Zheng., Z. Mxnet: A flexible and efficient machine

learning library for heterogeneous distributed systems. In Proc. of NIPS Workshopon Machine Learning Systems (LearningSys) (2016).

[33] Zafer, M., Song, Y., and Lee, K.-W. Optimal bids for spot vms in a cloud for

deadline constrained jobs. In Proc. of International Conference on Cloud Computing(2012).

[34] Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: quality-driven

scheduling for distributed machine learning. In Proceedings of SoCC (2017).

[35] Zheng, L., Joe-Wong, C., Tan, C. W., Chiang, M., and Wang, X. How to bid the

cloud. In Proc. ACM SIGCOMM (2015).

10