machine learning on the cheap: an optimized strategy to ...gaurij/spot_instance_ml.pdfec2 can reduce...
TRANSCRIPT
Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances
Xiaoxi Zhang
Carnegie Mellon University
Jianyu Wang
Carnegie Mellon University
Fangjing Wu
Carnegie Mellon University
Gauri Joshi
Carnegie Mellon University
Carlee Joe-Wong
Carnegie Mellon University
ABSTRACTStochastic gradient descent (SGD) is at the backbone of most state-
of-the-art machine learning problems. Due to the massive size of
the neural network models and training datasets used today, it is
imperative to distribute the task of computing gradients across
multiple computing nodes. Yet even running distributed SGD algo-
rithms can be prohibitively expensive, as these jobs require large
amounts of computing resources and large training times. We pro-
pose cost-effective strategies that exploit spot instances on Amazon
EC2 for users to run SGD algorithms to train large-scale machine
learning models. To do so, we quantify the resulting tradeoffs be-
tween cost, training error, and job completion time, accounting
for job interruptions on spot instances and the resulting variation
in the number of active nodes on the training performance. We
use this analysis to optimize the spot instance bids. Experimental
results show that our optimized strategies achieve good training
performance at substantially lower costs than heuristic strategies.
ACM Reference Format:Xiaoxi Zhang, JianyuWang, FangjingWu, Gauri Joshi, and Carlee Joe-Wong.
2017. Machine Learning on the Cheap: An Optimized Strategy to Exploit
Spot Instances. In Proceedings of ACM Long Conference Name conference(SHORTNAME’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.
475/123_4
1 INTRODUCTIONStochastic gradient descent (SGD) is the core algorithm used by
most state-of-the-art machine learning problems today [9, 15, 27].
Yet as ever more complex models are trained on ever larger amounts
of data, most SGD implementations have been forced to distribute
the task of computing gradients across multiple “worker” nodes,
thus reducing the computational burden on any single node while
speeding up the model training through parallelization. Currently,
even such distributed training jobs require high-performance com-
puting infrastructure such as GPUs to train accurate models in a
reasonable amount of time. However, purchasing GPUs outright is
expensive and requires intensive setup and maintenance. Renting
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SHORTNAME’17, July 1997, City, State, Country© 2017 Copyright held by the owner/author(s).
ACM ISBN 123-4567-24-567/08/06.
https://doi.org/10.475/123_4
such machines as on-demand instances from services like Amazon
EC2 can reduce the setup costs, but may still be prohibitively ex-
pensive since distributed training jobs can take hours or even days
to complete.
One common way to save money on cloud instances is to utilize
Amazon EC2 spot instances, which have lower prices but may
experience interruptions [6, 31, 35]. Spot instances allow users to
specify a maximum price (i.e., a “bid”) that they are willing to pay:
if the (dynamically adjusted) spot price set by Amazon exceeds
this bid, then the user’s job is interrupted and resumes once the
spot price falls below the bid. By intelligently specifying their bids,
users can control the cost of running their jobs, saving up to 90%
compared to typical on-demand instances [35]. Distributed SGD
algorithms may deploy each worker on a spot instance to save
money. However, setting the bids is then difficult: simply trying
to minimize the cost with a low bid can lead to more frequent
interruptions, and thus a model with greater training error and
longer training time. To the best of our knowledge, this work is thefirst to quantify the tradeoffs between these cost, error, and completiontime objectives and derive the resulting optimal bidding strategiesfor running distributed SGD algorithms on spot instances, validatingthem by running distributed SGD jobs on Amazon EC2.
1.1 Use CasesSpot instances are a common way to run distributed machine learn-
ing jobs at lower cost over interruptible devices, but not the only
one. We detail two other example scenarios below and outline how
our analysis can benefit users in these scenarios as well.
Distributed machine learning jobs could be run on volunteer
computing platforms, where users make computing cycles on their
laptops or computers available to otherswhen theywould otherwise
be idle [7, 12]. Such platforms have been used for computationally
intensive workloads such as searching for prime numbers [3] or
extraterrestrial life [29], and some startups are already building
platforms that use spare GPU resources to run large-scale machine
learning jobs [23]. Users on such platforms will naturally experi-
ence fluctuations in available resources as machines become idle.
Introducing a mechanism in which users could pay for access to
more idle capacity would induce cost, error, and completion time
tradeoffs similar to those experienced on spot instances. Our analy-
sis could then help users and compute providers navigate them.
The Internet-of-Things points towards another set of applica-
tions where machine learning jobs might be run on devices that
could be interrupted. Applications such as multi-user virtual or
1
SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong
augmented reality generally require models to be rapidly trained
on data collected at multiple devices in the same geographical area.
Devices could then act as computing nodes running distributed
SGD, with a nearby edge server [28] aggregating their gradient
computations. However, the number of devices present can vary
as users move in and out of the area of interest (e.g., a museum
exhibition [8] or location-based game like Pokemon Go [22]). Our
analysis of the error and completion timewhen the number of active
devices varies over time due to spot instance interruptions can help
developers understand the expected performance of distributed
SGD-based learning algorithms in these applications.
1.2 Challenges and ContributionsIn the course of achieving our goals of quantifying the cost, error
and completion time tradeoffs and deriving the optimal bidding
strategies, we encounter several research challenges. It is unclear,
for instance, how the optimal bids should be determined: higher
bids can allow us to minimize interruptions and maintain a larger
average number of active workers to compute the gradients, but the
marginal decrease in error lessens as we maintain more and more
workers by increasing the bids. Successfully navigating this tradeoff
becomes even more complex when different workers can submit
different bids, as we must then co-optimize the bids to control the
number of active workers at different times. Even worse, the future
spot prices and runtimes for each worker’s gradient computations
are unknown, forcing us to consider stochastic performance models.
Straggling [14] and synchronization bottle-necks in the system, for
instance, can lead to varied worker runtimes. We elaborate on our
contributions and solutions to these challenges below.
Monotonicity of cost and completion time w.r.t. uniformbids (Section 5). We first consider a scenario in which all workers
place the same bid for spot instances. While this simplifies the
optimal bidding strategy by reducing it to a single decision (the bid
price for all workers), finding the optimal bid is still challenging.
Our first result is that for any runtime and spot price distributions,
the expected cost (Lemma 1) and completion time (Lemma 2) oftraining a model to a given error threshold monotonically increases
and decreases with a higher bid, respectively. Higher bids increase
the expected price that the user must pay, while shortening the
number of interruptions and thus the overall job completion time.
We then derive the cost-minimizing bid subject to completion time
and error constraints (Theorem 3 and Corollary 1).Quantifying the training error with dynamic numbers of
workers (Section 6). We next consider a scenario in which differ-
ent workers may submit different bids: allowing some workers to
submit lower bids, for instance, ensures that more workers compute
gradients, reducing the training error, when the spot price is low
and workers are inexpensive. In doing so, however, we encounter
a new challenge: prior analyses of distributed SGD algorithms do
not consider the possibility that the number of active workers will
change over time. Since workers will submit different bids, they
may be active at different times, requiring us to derive new bounds
on the training error under these dynamic conditions. By exploiting
the correlation in worker activity due to the bidding process, we
show that the error can be upper-bounded in terms of the expected
reciprocal of the number of workers at any given time, given that
at least one worker is active (Lemma 5).Deriving the optimal bidding strategies (Section 6).Wenext
find the optimal bidding strategies when two groups of workers
can submit different bids (Theorem 4 and Corollary 2). Using our
new error bounds from Lemma 5, we find the bids that minimize
the expected cost subject to constraints on the final training error
and expected completion time. By focusing on two distinct bids, we
illuminate the roles that higher and lower bids play in achieving
each performance metric. We show that the error is only affected by
the number of active workers given that some workers are active,
which is controlled by the difference between the lower and higher
bid prices: some workers are active whenever the spot price is be-
low the higher bid, and the lower bid controls whether additional
workers are active. Thus, the higher bid also controls the waiting
time of the job (when no workers are active), while the lower bid
controls the running time of each iteration.
We finally examine two variants of the standard distributed SGD
algorithm, in which an SGD iteration need not wait for all active
workers to complete. In this case, we show that it is optimal for all
workers to bid the same price when their runtimes are exponen-
tially distributed (Theorems 5 and 6); having more workers when
the spot price is low does not save any cost. Under more realistic
distributions, however, a relatively large difference between the bid
prices for different workers can save cost (Corollary 3).Experimental validation on Amazon EC2 (Section 7). We
validate our results by running distributed SGD jobs analyzing
the MNIST [5] and CIFAR-10 [20] datasets on Amazon EC2. We
simulate uniform and Gaussian spot price distributions, and show
that our derived optimal bid prices can reduce users’ cost by 62%
while meeting the same error and completion time requirements,
compared with bidding a high price to minimize interruptions as
suggested in [31]. Moreover, we show that adding workers later in
the job and re-optimizing the bids according to the realized error
and training time at the beginning of the job further reduces the
cost, leading to a better cost/completion time/error tradeoff.
We first give an overview of related work in Section 2 and then
introduce some preliminaries on distributed SGD and our problem
formulation in Sections 3 and 4 respectively. After deriving our
above contributions in Sections 5–7, we conclude in Section 8. All
proofs may be found in our online technical report [2].
2 RELATEDWORKOur work is broadly related to prior works on algorithms and
resource allocation for distributed machine learning, as well as
exploiting spot instances to efficiently run computational jobs.
Distributed machine learning. Distributed machine learning
generally assumes that multiple workers send local computation
results to be aggregated at a central server, which then sends them
updated parameter values. The SGD algorithm [27], in which work-
ers compute the gradients of a given objective function with respect
to model parameters, is particularly popular. To avoid computing
the gradient over the entire training dataset, SGD processes stochas-
tic samples (usually a mini-batch [25]) chosen from data residing
at each worker in each iteration. Recent work has attempted to
limit device-server communication to reduce the training time of
2
Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country
SGD and related models [18, 19, 24, 30], while others analyze the
effect of the mini-batch size [25] or learning rate [10, 14] on SGD
algorithms’ training accuracy and convergence. Bottou et al. [10]
provide a comprehensive analysis of the convergence of training
error in SGD algorithms, but they do not consider the running time
of each iteration. Dutta et al. [14] analyze the trade-off between
the training error and the actual training time (wall-clock time) of
distributed SGD, accounting for stochastic runtimes for the gradient
computation at different workers. Our work is similar in spirit but
focuses on spot instances, which introduces cost as another perfor-
mance metric. We also go beyond [10, 14] to derive error bounds
when the number of active workers changes in different iterations,
as may occur with some of our proposed bidding strategies.
Scheduling improvement for distributed ML. Large-scaledistributed machine learning training jobs can benefit from efficient
resource schedulers, e.g., [4, 32] optimize the resources allocated to
jobs so as to accelerate job training and reduce the communication
overhead. In [26], dynamic resource allocation is proposed for deep
learning training jobs with a goal of minimizing job completion time
and makespan, while [34] designs a dynamic resource scheduler
that maximizes the collective quality of the trained models. By
controlling the resources dedicated to each job, these schedulers
substantially improve the average job quality and delay compared
to existing fair resource schedulers. In contrast, we provide bidding
strategies for users to minimize the expected costs of running their
jobs on spot instances subject to deadline and error constraints.
Optimized bidding strategies in Spot markets. Designingbidding strategies for cloud users has been extensively studied. The
authors of [35] design optimal bids to minimize the cost of com-
pleting jobs with a pre-determined execution time and no deadline.
Other works derive different cost-aware bidding strategies that
consider jobs’ deadline constraints [33] or jointly optimize the use
of spot and on-demand instances [16]. However, these frameworks
cannot handle the dependencies between workers that exist in dis-
tributed SGD. Another line of work attempts to instead optimize
the markets in which users bid for spot instances. Sharma et al. [31]
advocate using a bid price equal to that of an on-demand instance
with the same resource configuration migrating to VM instances in
other spot markets as needed to handle interruptions. The result-
ing migration overhead, however, requires complex checkpointing
and migration strategies due to SGD’s substantial communication
dependencies between workers, realizing limited savings [21].
3 DISTRIBUTED SGD PRIMERMost state-of-the-art machine learning systems employ stochastic
Gradient Descent (SGD) to train a neural network model so as to
minimize an objective functionG : Rd → R as defined in (1) below.
We use the vector w to denote the model parameters, that is the
neural network weights that are trained.
G(w) ≜1
|S|
|S |∑s=1
l(h(xs ,w),ys ) (1)
Here, G(w) is the empirical average of the loss l(h(xs ,w),ys ) thatcompares our model’s prediction h(xs ,w) to the true output ys , foreach training sample (xs ,ys ). In (1), S is the training dataset with
a total of |S| training samples. To minimizeG(w), the SGD method
i=1i=2i=3
PS
w0 w1 w2
i=1i=2i=3
PS
w0 w1 w2
FullySync-SGD K-SyncSGD
i=1i=2i=3
PS
K-BatchSyncSGD
w0 w1 w2
Figure 1: Gradient computations on three workers for twoiterations in fully, K = 2-, and K = 2-batch synchronousSGD. The x-axis indicates time, and lighter colored arrowsindicate workers cancelled by the Parameter Server (PS).
iteratively updates the prediction parameters w in the opposite
direction of the objective function’s gradient in each iteration.
Computing the gradient of G over the entire dataset in each
iteration can be prohibitively expensive. Thus, SGD is generally
implemented using a mini-batch approach, where the gradients are
computed over a small, randomly chosen subset of data samples. Let
α j and Sj represent the step size and the mini-batch respectively,
at iteration j. The update rule in the jth iteration is given by:
wj+1 = wj −α j
|Sj |
∑s ∈Sj
∂l(h(xs ,wj ),ys )
∂wj(2)
To further speed up the training, gradients can be computed
in parallel by distributed workers and aggregated by a parameter
sever at the end of each iteration. In this work, we consider a central
parameter server and n worker nodes. Each worker has access to
a subset of the data, and in each iteration each worker fetches the
current parameterswj from the parameter server, computes the gra-
dients of l(h(xs ,wj ),ys ) over one mini-batch of its data, and pushes
them to the parameter server. Based on the gradient aggregation
method used by the server, there are two distributed SGD frame-
works widely adopted today: Synchronous SGD and AsynchronousSGD. We focus on synchronous policies in this paper since asyn-
chronous gradient aggregation causes staleness in the gradients
returned by workers, which can give inferior SGD convergence. We
consider three variants of synchronous SGD, as shown in Figure 1.
1. Fully-synchronous SGD: In each iteration j, the parameter
server waits for all workers to finish processing their mini-batches
and push the computed gradients, before updating the parameters
wj+1 to all workers for the next iteration.
2. N -Synchronous SGD: In fully synchronous SGD, waiting forthe slowest worker to finish its gradient computation can become a
bottleneck. To overcome this bottleneck, inN -synchronous SGD the
parameter server waits for the first N out of n workers to push their
computed gradients. It updates the parameters to wj+1 according
to these N gradients, and pusheswj+1 to all workers, canceling the
outstanding gradient computations at slow workers.
3. N -Batch-synchronous SGD: To further reduce synchroniza-tion delays, upon finishing a gradient computation, each worker
continues to process another mini-batch using the same parameters
wj until N mini-batches are finished collectively by the workers. In
contrast to N -synchronous SGD, the parameter server waits for the
first N mini-batches to be finished rather than the first N workers.
Once the parameter server receives N gradient updates, it cancels
the remaining gradient computations, and updatesw at all workers.
3
SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong
4 PROBLEM FORMULATIONIn order to save cost, one can run distributed SGD on spot instances
purchased fromAmazon EC2 by deploying each worker on one spot
instance. We adopt three metrics to evaluate the performance of
such strategies: 1) the error ϕ in the trained model, 2) the completiontime of the job τ (i.e., the wall-clock time from when the job is initiatedto when it completes), and 3) the total cost of renting spot instancesfor the workers. Specifically, we construct an optimization problem
that attempts to minimize the total expected cost, subject to the
constraints that the expected training error and completion time
cannot exceed given thresholds. We summarize this optimization
problem in (3)-(5), where ϵ and θ denotes the maximum allowed
error and the job completion time respectively.
minimize : Expected total cost E[C] (3)
st.: Expected training error E[ϕ] ≤ ϵ, (4)
Expected completion time E[τ ] ≤ θ , (5)
Alternative formulations might instead minimize the expected train-
ing error or the expected completion time, subject to a cost budget
constraint. Our strategy of optimizing the bid price and analysis of
the training error can easily be adapted to these alternatives.
To solve (3)-(5), we first derive the mathematical expressions for
the three performance metrics.
Cost of running SGD on spot instances. To evaluate the cost,we consider that the n workers are deployed on the same type of
spot instance. Let pt denote the spot price of each VM instance
at time slot t . We assume pt is i.i.d. drawn from a distribution
with a lower-bound
¯
p and upper-bound p, following prior works
on optimal bidding in spot markets [35]. Let f (·) and F (·) denotethe probability density function (PDF) and the cumulative density
function (CDF) of the random variable pt .Let bi represent the bid price for worker i ∈ {1, · · · ,n}. Thus,
worker i will be running at time t if and only if bi ≥ pt , and we
formalize the expected total cost in (3) as:
Expected cost := E[C] = Ept
[ τ∑t=1
pt
n∑i=1
1(pt ≤ bi )
](6)
Further, the feasible range of the bid price should be:
¯
p ≤ bi ≤ p, ∀i = 1, · · · ,n (7)
According to Amazon’s policy [6], bi is determined upon the job
submission without knowing the future spot prices and will be
fixed for the job’s lifetime. Although the user can effectively change
the bid price by terminating the original request and re-bidding for
a new VM, doing so induces significant migration overhead. Thus,
we assume that users employ persistent spot requests: a worker
with a persistent request will be resumed once the spot price falls
below its bid price, exiting the system once its job completes.
Bounds on training error. The constraint (4) is satisfied whenSGD is run for a sufficient number of iterations J . In order to com-
pute the minimum required J , we need to analyze the convergence
of the expected error E[ϕ] = E[G(wJ ) −G∗]. For this, we employ
two convergence results from [10], which require three assump-
tions on the objective function G:
Assumption 1 (Lipschitz-continuous objective gradients). The
objective function G : Rd → R is continuously differentiable and
its gradient ∇G is L-Lipschitz continuous (for L > 0), i.e.,
∥ ∇G(w) − ∇G(w′) ∥2≤ L ∥ w −w′ ∥2,∀w,w′ ∈ Rd (8)
Assumption 2 (First and second moments). Let ESj[∇G(wj ,Sj )
]represent the expected gradient at iteration j for a mini-batch Sjof the training data. Then there exist scalars µG ≥ 1 such that
∇G(wj )T ESj
[∇G(wj ,Sj )
]≥∥ ∇G(w) ∥2
2
and ∥ ESj[∇G(wj ,Sj )
]∥2 ≤ µG ∥ ∇G(w) ∥2 (9)
and scalarsM,MV ≥ 0 andMG = MV + µ2
G ≥ 0 such that
ESj[∥ ∇G(wj ,Sj ) ∥
2
2
]≤ M +MG ∥ ∇G(w) ∥2
2. (10)
Assumption 3 (Strong convexity). The objective function G(·) :
Rd → R is c-strongly convex, i.e., there exists c > 0 such that
G(w′) ≥ G(w) + ∇T (w′ −w) +c
2
∥ w′ −w ∥2
2,∀(w′,w) (11)
Theorem 1 (Theorem 4.6 in [10]). Suppose that the objectivefunction G(·) satisfies Assumptions 1, 2, and 3. If the SGD algorithmruns with a fixed step-size α satisfying 0 < α ≤ 1
LMGfor a total of
J iterations and waits for N ≤ n mini-batches in each iteration, wehave
E[G(wJ ) −G∗
]≤
αLM
2cN+ (1 − αc)J−1
(G(w1) −
αLM
2Nc
). (12)
Our framework also applies to general non-convex objective
functions, for which we can apply the following convergence result.
Theorem 2 (Theorem 4.8 in [10]). Suppose that the objectivefunction G(·) satisfies Assumptions 1 and 2. If the SGD algorithm isrun with a fixed step-size α satisfying 0 < α ≤ 1
LMGfor a total of J
iterations and waits for N ≤ n workers in each iteration, we have
E
1
J
J∑j=1
∥ ∇G(wj ) ∥2
2
≤αLM
N+
2G(w1)
Nα J, (13)
Expected completion time. Constraint (5) imposes a deadline
on the job’s completion time. This time depends on two factors:
the bid and spot prices, which control the number of time slots
when workers will be active; and the time taken by the workers
to compute mini-batch gradients in each iteration. To model the
latter, we define the random variableXi as the running time of each
worker i processing one mini-batch. Assume that X1,X2, · · · ,Xnare i.i.d., with the CDF FX (·) and PDF fX (·). These gradient com-
putation delays and the gradient aggregation method (fully, N -,
or N -batch-synchronous) used by the parameter server affect R,which we define as the running time of each iteration when at least
one worker is active. We derive the resulting expected completion
time for different bidding strategies below.
5 IDENTICAL BIDS FORWORKERSWe first consider the simplest case where we optimize a single
bid bi = b for each worker i ∈ {1, 2, . . . ,n}, and solve the cost
minimization problem (3)-(5). We first derive a general result that
holds for each SGD variant and all possible distributions of worker
running times (Theorem 3) and then derive specific bid prices for
fully, N -, and N -batch-synchronous SGD in Corollary 1.
4
Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country
5.1 Translating Error to Number of IterationsIn examining the expected training error, a simple observation is
that processing more mini-batches leads to a smaller training error,
given other parameters such as the step size and the uniform data
size of each mini-batch. Therefore, we ask: how many mini-batches
we should process to meet the error requirement ϵ?Constant number of active workers when the job is run-
ning.With the same bid price for all workers, we observe that at
any time, either all workers are active or no workers are active.
Thus, the number of active workers at any time, given the job is
running (i.e., some workers are active), is independent with the
bid price. Therefore, in each iteration for which the job is running,
n mini-batches are processed for fully-synchronous SGD, while
N < n mini-batches are processed for N -synchronous and N -batch-
synchronous SGDs.
Required iterations J is independent of bid price. We de-
fine J as the required number of iterations for the job, not counting
iterations where the bid price is below the spot price and all workers
are inactive. Since the number of mini-batches processed in each
iteration is constant, deriving the required number of mini-batches
to process is equivalent to deriving the value of J that can guaranteean expected training error no larger than ϵ . To derive this value
of J , we suppose we are given a functionˆϕ that upper-bounds the
expected training error as a function of J . As long asˆϕ is mono-
tonically non-increasing, i.e., more iterations (a larger J ) leads toa smaller error upper-bound, we can derive the required J by the
inverse function ofˆϕ := ˆϕ−1
. Following [10], we can obtain the
error upper-boundˆϕ from Theorems 1 and 2 as described below.
Error bound ˆϕ. We choose the representation for the error ϕ
and its upper-boundˆϕ according to the convexity of the objective
function G(·). If G(·) is strongly convex, ϕ (ˆϕ) represents the ex-
pected gap of the objective value under our derived compared to
the optimal parameters. IfG(·) is not strongly convex, we abuse the
term “error” and defineˆϕ as the average norm of the gradients of
G(·). A value ofˆϕ near zero then indicates that our derived param-
eters approach a local optimum of the aggregate loss function G(·)[11]. In each case, we respectively use Theorems 1 and 2 to set
ˆϕ = right-hand side of (12) ≤ ϵ, ˆϕ = right-hand side of (13) ≤ ϵ .
Mapping error bound to number of iterations. We define
ˆϕ−1(ϵ,N ) as the minimum number of iterations that the job needs
to process to achieve an error no larger than ϵ in expectation,
given N ≤ n mini-batches processed per iteration, and we set
J ≥ ˆϕ−1(ϵ,N ). For fully-synchronous SGD, we have N = n, andˆϕ−1(ϵ,N ) is the same forN -synchronous andN -batch-synchronous
SGD, since they both wait for N mini-batches in each iteration.
Note that depending on the tightness of the error upper-bound
ˆϕ, ˆϕ−1(ϵ,N ) may over-estimate the required number of iterations.
Nevertheless, having more thanˆϕ−1(ϵ,N ) iterations can guarantee
that the expected error is no larger than ϵ ; having fewer iterations
may still meet this error requirement but does not give a theoretical
guarantee. Thus, in order to ensure the feasibility of the cost min-
imization problem defined in (3) – (7), we need at leastˆϕ−1(ϵ,N )
iterations, with N mini-batches processed in each iteration, for each
of our three variants of synchronous SGD.
0Spot price (p)
0
0.2
0.4
0.6
0.8
1
1.2F (p)
F (b)
p p p
Area (hatched) ⇥ JnE[R] = E[cost(b)]
Area (blue) ⇥ JnE[R] = E[cost(b + �)]
FunctionF (p)
F (b)Function
F (p)
F (b + �)
Figure 2: Increasing the bid price from b to b + ∆ alwaysleads to a larger expected cost (blue region compared to thehatched region). The area of the hatched region equals theterm b −
∫ b
¯
pF (p)F (b)dp in the expected cost (14) and the area of
the blue region equals b + ∆ −∫ b+∆
¯
pF (p)
F (b+∆)dp, where ∆ > 0.
5.2 Optimizing the Bid PriceGiven that the upper bound of the training error ϕ can be translated
into having a sufficient number of iterations J , we now optimize
the bid price b with the number of iterations J ≥ ˆϕ−1(ϵ,N ), given
that N mini-batches are processed in each iteration. The derivation
of the optimal bid price, shown in Theorem 3, is based on two
observations that are summarized in Lemmas 1 and 2.
Lemma 1 (Monotonicity of expected cost). Using one bidprice for all workers, the expected cost of finishing a fully, N -, orN -batch-synchronous SGD job is given by
E[C] = JnE[R]
(b −
∫ b
¯
p
F (p)
F (b)dp
), (14)
which is monotonically non-decreasing with the bid price b, givenany number of iterations J .
Equation (14) removes the completion time variable τ in the
summation operation of the expected cost in (6). Instead, we relate
the expected cost to the spot price distribution and the total number
of iterations. Figure 2 illustrates that the expected cost under bid
price b (or b +∆) is equal to the area of the hatched (or blue) regionscaled by the same factor JnE[R]. The area of the region, and thus
the expected cost, is non-decreasing with the bid price.
The expected completion time is also monotonic with b:
Lemma 2. Using one bid price for all workers, the expected com-pletion time of running a synchronous SGD (three variants) for anygiven number of iterations J is given by
E[τ ] = JE[R]/F (b), (15)
which is monotonically non-increasing with the bid price b.
Given that both E[τ ] and E[C] increase with J , and we have
J ≥ ˆϕ−1(ϵ,N ) to meet the error requirement, we should have a
total of J = ˆϕ−1(ϵ,N ) iterations. Moreover, since the expected
5
SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong
completion time decreases with the bid price, the optimal bid price
should be the one that makes the expected completion time equal
to the deadline. Taking the above, we have the following theorem.
Theorem 3 (Optimal Uniform Bid). Suppose the job performsa synchronous SGD in which the parameter server waits for N ≤ nmini-batches in each iteration, the deadline of the job is θ , the spotprice is i .i .d . over time, and the running time of processing eachmini-batch by each worker is i .i .d . over all mini-batches and over allworkers. Then if E[R] denotes the expected runtime of each iteration,
the optimal bid price for all workers equals F−1
(ˆϕ−1(ϵ,N )E[R]
θ
).
Theorem 3 provides a general form of the optimal bid price, given
the number of mini-batches processed per iteration, N , and other
parameters. In order to derive the specific value of the optimal b,we need to obtain E[R]. Let fR (·) denote the probability density
function of R. Below, we give fR (·) for N -synchronous SGD (N =n for the fully-synchronous SGD), given fX (·), the distribution
of the running time of processing each mini-batch. For N -batch-
synchronous SGD, finding fR (·)with an arbitrary distribution fX (·)is intractable, but we show an existing result in Lemma 4 when
fX (·) is the PDF of an exponential distribution exp(λ).
Lemma 3. Suppose the job is performing an N -synchronous SGDor a fully-synchronous SGD (N = n). For any 1 ≤ N ≤ n, we have:
fR (x) =n!
(N − 1)!(n − N )!(FX (x))
N−1(1 − FX (x))n−N fX (x) (16)
Lemma 4 (Lemma 5 in [14]). Suppose the job is performing anN -batch-synchronous SGD and Xi ’s (i = 1, · · · ,n) are i.i.d. drawnfrom an exponential distribution exp(λ). For any 1 ≤ N ≤ n,
E[R] = NE[Xi ]/n = N /(λn) (17)
Based on Lemmas 3 and 4, we derive the optimal bid price when
the running time per mini-batch follows a shifted exponential dis-
tribution (generalizing the exponential distribution).
Corollary 1 ((Shifted) Exponential Running Time Distri-
bution). Suppose the job performs a synchronous SGD for ˆϕ(ϵ,N )
(N ≤ n)iterations, the spot price is i .i .d . over time, and the runningtime of processing each mini-batch is i .i .d . over all mini-batches andover all workers, with a PDF fX (x) = λ exp(−λ(x −α)),x ≥ α ,α ≥ 0.The optimal bid price for each SGD variant equals F−1(z), where:
z =
ˆϕ−1(ϵ,n)
θ
(nλ logn + α
)fully − synchronous
ˆϕ−1(ϵ,N )
θ
(nλ log
nn−N + α
)N − synchronous, N < n
ˆϕ−1(ϵ,N )
θ
(Nn ( 1
λ + α))
N − batch − synchronous, N ≤ n.
6 OPTIMAL HETEROGENEOUS BIDSIn this section, we propose a bidding strategy with two distinct
bid prices for two different groups of workers. This strategy is
motivated by the observation that bidding lower prices for some
workers yield a larger number of active workers when the spot
price is relatively low, which improves the training error but will
not cost much. We assume a strongly convex objective function
G(·) (Assumption 3) in this section.
Time-variant number of active workers. The largest differ-ence between using uniform bids as in Section 5 and this bidding
strategy is that the number of active workers can vary over time,depending on the spot price in each iteration and the bid prices forall workers. Thus, we first seek to understand how the time-variant
number of active workers affects the final training error, completion
time, and cost. Based on this analysis, we can proceed to derive the
bids that solve our optimization problem (3)–(5).
Formally, we place bids of b1 for workers 1, · · · ,n1 and b2 (< b1)
for workers n1 + 1, · · · ,n. We define the random variable y(pt , ®b)as the number of active workers at time t , which is a function of
the spot price pt and the bid prices®b = (b1,b2). At any given time,
with probability F (b1) − F (b2), y(·, ®b) = n1 workers are active; with
probability F (b2), all the y(·, ®b) = n workers are active.
Runtime and error bound w.r.t. bid price. We make two ob-
servations when we have different bids for different workers: (1)the running time of each iteration is a function of the bid prices ®b, as®b controls the number of active workers, which affects the running
time of each iteration. Therefore, we rewrite R, the running time
per iteration, as R(®b). (2) The number of iterations required to ensurea final error below ϵ depends on both ®b and ϵ for fully synchronous
SGD, as there can be different numbers of mini-batches processed,
in different iterations, depending on®b. Thus, we cannot define
J ≥ ˆϕ−1(ϵ,N ) from Theorem 1 as we did for uniform bids. We
instead defineˆϕ(®b) as the upper-bound of the error under
®b for a
fixed number of J iterations (dropping ϵ for simplicity). We derive
ˆϕ(®b) in the course of proving Lemma 5 and show that®b and J can
be co-optimized in Corollary 2.
Static spot price within each iteration assumption. If work-ers change their statuses (i.e., become active or inactive) due to spot
price changes within each iteration, the distributions of the cost
and of the completion time will become much more complex. For
instance, suppose the parameter server runs 4-synchronous SGD,
waiting for 4 workers each iteration. Then if 5 workers are active in
the beginning of a given iteration but only 3 of them remain active
in the middle, the parameter server will still wait for the fourth
worker to finish this iteration, i.e., until the spot price decreases
below this worker’s bid price. Therefore, to simplify the model,
we assume that each worker maintains the same (in)active status
through each iteration. In practice, the spot price only changes
once per hour [1], compared to a runtime of several minutes per
iteration, and thus the number of active workers will usually not
change during an iteration. If each worker has an independent ac-tive/inactive distribution, this assumption can be removed as we
can independently consider the random idle time of each individual
worker. We leave the detailed study of spot price changes within
an iteration to our future work.
6.1 Fully-synchronous SGDWe first consider fully-synchronous SGD, where the parameter
server waits for all active workers in each iteration. We initially
assume that n1, the number of workers in each group, and J , therequired number of iterations, are fixed; thus, we optimize the
trade-off between the expected cost, expected completion time, and
the training error using only the bid prices®b. After deriving the
closed-form optimal solutions of b1 and b2 in Theorem 4, we discuss
co-optimizing n1 and J with the bid price vector®b.
6
Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country
The expected cost minimization problem is given as follows.
minimizeb
J
∫ b1
¯
pE[R(®b)
��� ®b]y(®b)p f (p)
F (b1)dp (18)
subject to:
J
F (b1)
∫ b1
¯
pE[R(®b)
��� ®b] f (p)
F (b1)dp ≤ θ (19)
ˆϕ(®b) ≤ ϵ (Error constraint) (20)
p ≥ b1 ≥ b2 ≥¯
p, ∀i ≤ j (21)
Letting fR |p (x) denote the PDF of R(®b) conditioned on a given spot
price p ∈ [¯
p,b1], we have:
fR |p (x) =
{n(FX (x))
n−1 fX (x), if
¯
p ≤ p ≤ b2
n1(FX (x))n1−1 fX (x) if b2 < p ≤ b1
(22)
We give a more concrete form of E[C] and E[τ ] in [2]. We now
examine the constraint (20), showing that the error is less than the
threshold ϵ when we control the number of active workers y(®b):
Lemma 5. Suppose the objective functionG(·) satisfies Assumptions1, 2, and 3. Given a number of iterations (J ) that can guarantee1
n < Q(ϵ) ≤ 1
n1
; parameters L,M , and c from Assumptions 1–3; anda fixed step size α , the training error is at most ϵ if we have:
E
[1
y(®b)
����� y(®b) > 0
]≤ Q(ϵ) :=
2c(ϵ − (1 − αc)JE[G(w1)]
)αLM
(1 − (αc)J
) (23)
Lemma 5 connects the distribution of the spot price and our
bid prices to the training error through some average form of the
number of active workers, i.e., E[
1
y( ®b)
���� y(®b) > 0
]. Intuitively, the
harmonic expectation of the number of workers should be large
enough to limit the error. To derive the bound Q(ϵ) in practice, we
need to estimate parameters c andM ; we show that our results are
robust to mis-estimation in Section 7. Using this result, we now
derive the optimal bid prices:
Theorem 4 (Optimal-Two Bids with a Fixed J ). Suppose theobjective function G(·) satisfies Assumptions 1, 2, and 3. Let R(x)represent the expected running time per iteration with x workers activein the entire iteration. Let b∗
1and b∗
2denote the optimal bid prices
for two groups of workers, workers {1, · · · ,n1} and {n1 + 1, · · · ,n}.Given a number of iterations (J ) that can guarantee 1
n < Q(ϵ) ≤ 1
n1
,a fixed step size α , and a reasonable deadline (θ ≥ J R(n)), we have:
F (b∗1) =
J
θ
((R(n) − R(n1)
) 1
n1
−Q(ϵ)
1
n1
− 1
n+ R(n1)
)F (b∗
2) =
1
n1
−Q(ϵ)
1
n1
− 1
n× F (b∗
1), (24)
for any i.i.d. spot price and any i.i.d. running time per mini-batch, i.e.,F (·) and R(n) (or R(n1)) do not change during the training process.
Figure 3 illustrates the proof of Theorem 4. We first replace®b
in the cost minimization problem (18) by F (b1) and γ =F (b2)
F (b1). We
then show that the expected cost, completion time, and error are
monotonic w.r.t. to F (b1) and γ . The error is not a function of F (b1),
the probability at least some workers are active, as it depends only
on the number of active workers given that some workers are active,which is controlled by the relative difference between F (b1) and
F (b2): γ . We can thus choose γ as small as possible to satisfy the
error constraint while minimizing cost, and then choose F (b1) to
satisfy the completion time constraint. Intuitively, F (b1) should
be high enough to guarantee that some workers are active often
enough that the job completes before the deadline θ .
Co-optimizing n1 and ®b. Note that (24) provides the functionof the optimal bid prices w.r.t. any given n1, the number of workers
with the higher bid price. If n1 is not a known input but a variable
to be co-optimized with®b, we can write n1 and b∗
2in terms of F (b∗
1)
according to (24) and plug them into (18)-(21) to solve for b∗1first,
and then derive b∗2and the optimal n1.
Co-optimizing J and ®b. Taking J as an optimization variable
may allow us to further reduce the job’s cost. For instance, allow-
ing the job to run for more iterations, i.e., increasing J , increasesQ(ϵ) in Lemma 5’s error constraint (23). We can then increase
E
[1
y( ®b)
���� y(®b) > 0
]by submitting lower bids b2, making it less likely
that workers n1 + 1, . . . ,n will be active, while still satisfying (23).
A lower b2 may decrease the expected cost by making workers less
expensive, though this may be offset by the increased number of
iterations. To co-optimize J , we show it is a function of®b and ϵ :
Corollary 2 (Relationship of J and ®b). To guarantee a trainingerror no greater than ϵ , the minimum J is a function of our bid prices:
J = log(1−αcµ) H (®b, ϵ)
H (®b, ϵ) =
ϵ − αLM2cµ E
[1
y( ®b)
���� y(®b) > 0
]E[G(w1)] −
αLM2cµ E
[1
y( ®b)
���� y(®b) > 0
] (25)
We use Corollary 2 to give the main idea of co-optimizing J
and®b; we omit the details for reasons of space. The first step is
to replace J in (18) and (19) by (25). The error constraint (20) is
already guaranteed by (25) and can be removed. We then solve for
the remaining optimization variables, the bids®b.
6.2 N -synchronous SGDWe now consider N -synchronous SGD, where in each iteration,
the parameter server waits for N (< n) workers. We suppose that
N workers have the same bid price b1, and that the bid prices for
workers N + 1, · · · ,n are b2, · · · ,bn−N+1, respectively with b1 ≥
b2 ≥ . . . ≥ bn−N+1. Thus, N ≤ y(pt , ®b) ≤ n when
¯
p ≤ pt ≤ b1.
We use this strategy to guarantee that at least N workers are
active when the job is running (pt ≤ b1), avoiding cases where
fewer than N workers are active at the start of the iteration and
the algorithm must wait to start the next iteration until the spot
price falls sufficiently far that more workers become active. Since
we allow a distinct bid price for each remaining worker, the above
model is harder to analyze than the two bid prices for two groups
of workers adopted in Section 6.1. To derive the closed-form of the
optimal bid prices, we simplify the problem by assuming that the
running time per mini-batch is randomly drawn from an exponen-
tial distribution, as assumed in [13, 14]. Surprisingly, we can prove
that the optimal bid prices should be the same for all workers:
7
SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong
0 0.2 0.4 0.6 0.8 1.0
F(b1)
0.2
0.4
0.6
0.8
1.0
1.2
Cost Cost
increases with F(b1) (given γ)
(a) Cost-vs-F (b1)
0 0.2 0.4 0.6 0.8 1.0
F(b1)
200
400
600
800
1000
Compl. tim
e (s) Compl. time
decreases with F(b1) (given γ)
(b) Compl. time-vs-F (b1)
0 0.2 0.4 0.6 0.8 1.0
γ= F(b2)/F(b1)0.5
0.6
0.7
0.8
0.9
1.0
Cost
Cost increases with γ (given F(b1))
(c) Cost-vs-γ
0 0.2 0.4 0.6 0.8 1.0
γ= F(b2)/F(b1)400
420
440
460
480
500
Compl. tim
e (s) Compl. time increases
with γ (given F(b1))
(d) Compl. time-vs-γ
0 0.2 0.4 0.6 0.8 1.0
γ= F(b2)/F(b1)0.900
0.925
0.950
0.975
1.000
1.025
1.050
1.075
1.100
Error
Error decreases with γ
(e) Error-vs-γ
Figure 3: We plot the monotonocity of each performance metric w.r.t. to our new variables F (b1) and γ =F (b2)
F (b1), with any given
spot price and runtime distributions. We use Compl. time to abbreviate Completion time. As a larger γ leads to a smaller ex-pected error (Fig. 3e) but a larger expected cost (Fig. 3c) and completion time (Fig. 3d), and the expected error is only controlledby γ , the optimal γ should be the smallest possible γ , i.e., the one that yields error = ϵ . The optimal F (b1) should be the one thatyields the completion time equal to the deadline under the optimal γ (Fig. 3b).
Theorem 5 (N-Sync SGD with Exponential Runtimes). Sup-pose the running time per iteration is i.i.d. drawn from an exponentialdistribution exp(λ) over all time and all workers and the number ofiterations J = ˆϕ−1(N , ϵ). Then the optimal bid prices are:
bi = bj = F−1
(ˆϕ−1(ϵ,N ) log
nn−N
λθ
).
Intuitively, since N -synchronous SGD always processes N mini-
batches per iteration, having more workers active when the spot
price is low does not change the error. It does, however, increase
the cost due to paying for all workers’ computing usage. If the
mini-batch runtimes are exponentially distributed, this is not offset
by the lower iteration runtime due to canceling stragglers. It is
then optimal to have the same number of active workers in each
iteration, i.e., uniform bids.
6.3 N -batch-synchronous SGDWe finally consider N -batch-synchronous SGD, where the parame-
ters are updated and a new iteration begins once N mini-batches
are processed. We consider a more general bidding model where
each worker is allowed to have a distinct bid price, denoting the
bid prices for workers 1, · · · ,n as b1 ≥ b2 ≥ · · · ≥ bn .As in Section 6.2, we assume that the wall-clock time of finish-
ing each mini-batch for all workers is an i.i.d. random variable
drawn from an exponential distribution with parameter λ. UnlikeN -synchronous SGD, for N -batch-synchronous SGD, workers can
continuously process mini-batches in each iteration, until a total
of N mini-batches are finished. Therefore, the arrival rate of mini-
batch completion at each time equals y(pt , ®b)λ. We then observe
that the product of the number of active workers y(pt , ®b) and therunning time 1
y(pt , ®b)λwill be 1
λ , a constant. This product can be
regarded as Resource-time, which is the total amount of resource
units in each iteration we need to pay for. Since it is a constant, the
expected cost of each iteration will be determined only by the spot
price distribution, the largest bid price b1, and λ. This result thenimplies that the optimal bid prices are the same for all workers:
Theorem 6 (N-Batch-Sync SGD with Exponential Runtime).
Given a deadline θ and maximum error threshold ϵ , suppose the run-ning time per iteration is i.i.d. drawn from an exponential distribution
exp(λ) over all time and all workers. Then the cost-minimizing bid
prices equal F−1
(N ˆϕ−1(ϵ,N )
nλθ
)for each worker.
Theorem 6 indicates that a smaller bid price comes with a larger
number of workers (n), a larger deadline (θ ), and a smaller number
of required mini-batches (N ˆϕ−1(ϵ,N )) to achieve a smaller-than-ϵerror. Further, it also implies that as we increase the number of
workers, n, the expected cost keeps decreasing until it becomes
N ˆϕ−1(ϵ,N )
λ ׯ
p. This non-intuitive conclusion follows from the ex-
ponentially distributed running times per mini-batch, under which
the expected running time per iteration converges to zero asn → ∞.
In practice, however, the time to process N mini-batches should not
converge to zero, due to unavoidable delays such as fetching the
parameters. We thus assume a more realistic shifted exponential
distribution for the running time per mini-batch and re-derive the
optimal bids. Our expected running time per iteration is then:
E[R] = E[N
/ (λy(pt , ®b)
)+ Rmin
��� y(pt , ®b) > 0
](26)
Deriving the closed-form optimal bids is then intractable under
general spot price distributions, but we can derive the following:
Corollary 3 (N-Batch-Sync SGD with Shifted Exponential
Runtime). Suppose that the spot price follows an i.i.d. uniform dis-tribution with a PDF f (p) = 1
p−¯
p , and the user chooses bid prices in
an arithmetic progression form: bi =¯
p + α(n − i + 1), ∀1 ≤ i ≤ n,where α ≥ 0. Then, the parameter α for the optimal bid prices equals:
α = max
ˆϕ−1(ϵ,N )(
¯
p − p)
n(θ )
©« Nnλn∑j=1
1
j+ Rmin
ª®¬ ,¯
p
√n − 1
n2NλRmin
∑nj=1
(n − j + 1)2
(27)
Intuitively, the bid prices should be scattered enough to guar-
antee the job completion before the deadline. A larger distance
between consecutive bid prices implies a larger bid price for each
worker, which leads to a smaller completion time in expectation.
7 PERFORMANCE EVALUATIONWe evaluate our bidding strategies for fully synchronous SGD from
Sections 5 and 6 on two image classification benchmark datasets:
8
Machine Learning on the Cheap:An Optimized Strategy to Exploit Spot Instances SHORTNAME’17, July 1997, City, State, Country
MNIST and CIFAR-10, using a three-layer convolutional neural
network [5] and ResNet-50 [17], respectively. We run the MNIST
experiments on Amazon EC2 and the CIFAR10 ones on a local clus-
ter. We run J = 1000 iterations to train on the MNIST dataset and
J = 5000 iterations for the CIFAR-10 dataset. We set the deadline
(θ ) to be twice the estimated runtime of using 8 workers to process
J iterations without interruptions, i.e., the empirical average of the
runtime per mini-batch multiplied by J ; and the error threshold
ϵ = 0.98. In our technical report [2], we show that our runtimes
per mini-batch approximately follow a Gaussian distribution.
Since we do not have perfect knowledge of all the required pa-
rameters to derive J (Section 5.1) and Q(ϵ) (cf. Theorem 4), we
choose J based on experience and uniformly at random generate a
value from [ 1
n ,1
n1
] to use as Q(ϵ), demonstrating the robustness of
our optimized strategies to mis-estimations. We consider two dis-
tributions to simulate i.i.d. spot prices: a normal distribution in the
range [0.2, 1] and a Gaussian distribution with mean and variance
equal to 0.6 and 0.175; we draw the spot price when each iteration
starts and re-draw it every 4 seconds after the job is interrupted.
Our strategies and benchmark. We compare three strategies
with our optimal bid prices against a benchmarkNo-interruptionsstrategy that chooses the bid price to be 1.1 (larger than the maxi-
mum spot price) so as to guarantee the job is always running, as
recommended by Amazon [6, 31]. We aim to evaluate the cost reduc-
tion and our ability to meet the accuracy and deadline constraints
using our strategic bids, compared to this aggressive benchmark.
We evaluate the bidding strategies with both the optimal sin-
gle bid price for all workers derived in Theorem 3 (Optimal-one-bid) and the optimal bid prices for two groups of workers derived
in Theorem 4 (Optimal-two-bids). To further minimize the ex-
pected total cost of running each job while guaranteeing a high
training/test accuracy, we propose a Dynamic strategy, whichdynamically updates the optimal two bid prices when we increase
the total number of workers. More specifically, we initially launch
four workers and apply our optimal two bid prices, each for two
workers. After completing 80% of the total number of iterations, we
add four more workers and re-compute the optimal two bid prices
in (24): we assign n1 = 4,n = 8 as the new numbers of workers in
each group, subtract the consumed time from the original deadline
θ , and use the remaining number of iterations as the new value of J .This strategy is an initial attempt to improve Optimal-two-bids,which does not optimize the number of workers. One could further
divide the training into multiple stages, e.g., each corresponding to
100 iterations, and recompute the optimal two bid prices using (24)
with the updated parameters in each stage. Frequent re-optimizing
will likely induce intolerable interruption overheads, but infrequent
optimization may reduce the cost with tolerable overhead.
Superiority of our strategies. Figures 4 and 5 compare the
performance of our strategies onMNIST and CIFAR-10, respectively.
Figures 4a, 4b, 5a, and 5b show the achieved test accuracy vs. cost as
we run both jobs for uniform and Gaussian spot prices. We observe
that our dynamic strategy leads to a lower cost and the no interruptionsbenchmark to a higher cost for any given accuracy, compared to the
optimal-one- and two-bid strategies. In Figures 4c, 4d, 5c, and 5d, we
indicate the cumulative cost as we run the jobs. After 1000 iterations,
the no interruptions benchmark accumulates nearly twice the cost
0 1 2 3 4 5 6 7Cost
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Test
acc
urac
y
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(a) Accuracy-vs-cost, uniformspot price distribution
0 1 2 3 4 5 6 7Cost
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Test
acc
urac
y
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(b) Accuracy-vs-cost, Gaussianspot price distribution
0 1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)
0
1
2
3
4
5
6
7
Cost
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(c) Cost-vs-time, uniform spotprice distribution
0 1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)
0
1
2
3
4
5
6
7
Cost
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(d) Cost-vs-time, Gaussian spotprice distribution
1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Test
acc
urac
y
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(e) Accuracy-vs-time, uniformspot price distribution
1000 2000 3000 4000 5000 6000 7000Wall-clock time (s)
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Test
acc
urac
y
No-interruptionsOptimal-one-bidOptimal-two-bidsDynamic strategy
(f) Accuracy-vs-time, Gaussianspot price distribution
Figure 4: Under both spot price distributions, (a,b) the dy-namic strategy achieves the highest test accuracy under anygiven cost. The markers on the curves in (c,d) show thecost when achieving a 98% test accuracy; No-interruptions,Optimal-one-bid, and Optimal-two-bids incur an 134%, 82%,46% higher cost under uniform spot price distribution, and103%, 101%, 43% higher cost under Gaussian spot price dis-tribution at this accuracy than the dynamic strategy, respec-tively. All strategies (e,f) converge rapidly to similar (> 98%)accuracies over time.
of our dynamic strategy. The markers indicate the points where
we achieve 98% accuracy; while the no interruptions benchmark
achieves this accuracy much faster, it costs nearly three times asmuch as our dynamic strategy and twice as much as our optimal-two-bids strategy on both datasets. We observe that the cost reduction
for the dynamic strategy at 98% accuracy is greater for the CIFAR-10
compared to the MNIST dataset, likely due to the greater number
of iterations required to achieve a high accuracy. Figures 4e, 4f, 5e,
and 5f verify that all strategies yield qualitatively similar accuracy
over time, though the no interruptions benchmark converges faster.
8 CONCLUSIONIn this paper, we consider the use of multiple, unreliably available
workers that run distributed SGD algorithms to train machine learn-
ing models. We focus in particular on Amazon EC2 spot instances,
9
SHORTNAME’17, July 1997, City, State, Country Xiaoxi Zhang, Jianyu Wang, Fangjing Wu, Gauri Joshi, and Carlee Joe-Wong
0 1000 2000 3000 4000 5000
Cost
0
0.2
0.4
0.6
0.8
1
Tra
inin
g a
ccura
cy
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(a) Accuracy-vs-cost, uniformspot price distribution
0 1000 2000 3000 4000 5000
Cost
0
0.2
0.4
0.6
0.8
1
Tra
inin
g a
ccura
cy
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(b) Accuracy-vs-cost, Gaussianspot price distribution
0 200 400 600 800 1000 1200
Wall-clock time (s)
0
1000
2000
3000
4000
5000
Cost
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(c) Cost-vs-time, uniform spotprice distribution
0 200 400 600 800 1000 1200
Wall-clock time (s)
0
1000
2000
3000
4000
5000
Cost
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(d) Cost-vs-time, Gaussian spotprice distribution
0 200 400 600 800 1000 1200
Wall-clock time (s)
0
0.2
0.4
0.6
0.8
1
Tra
inin
g a
ccura
cy
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(e) Accuracy-vs-time, uniformspot price distribution
0 200 400 600 800 1000 1200
Wall-clock time (s)
0
0.2
0.4
0.6
0.8
1
Tra
inin
g a
ccura
cy
No interruptions
Optimal-one-bid
Opimal-two-bids
Dynamic Strategy
(f) Accuracy-vs-time, Gaussianspot price distribution
Figure 5: The results under the CIFAR-10 dataset are similarto those under the MNIST dataset (Figure 4). Figures 5c and5d show that the dynamic strategy realizes a greater reduc-tion of the completion time compared to Figs. 4c and 4d.
which allow users to reduce the cost of running jobs at the expense
of a longer job training time to achieve the same model accuracy.
Spot instances allow users to choose the amount they are willing to
pay for computing resources, thus allowing them to choose how to
trade off a higher cost (achieved by placing higher bids) compared
to a longer completion time or higher training error. We quantify
these tradeoffs and derive new bounds on the error of a model that
is trained with different numbers of workers at each iteration. We
finally use this tradeoff characterization to derive optimized bidding
strategies for users on spot instances. We validate these strategies
by training neural network models on the MNIST and CIFAR-10
image datasets and demonstrating their cost savings compared to
heuristic bidding strategies.
REFERENCES[1] How spot instances work. https://docs.aws.amazon.com/aws-technical-content/
latest/cost-optimization-leveraging-ec2-spot-instances/how-spot-instances-
work.html.
[2] Machine learning on the cheap: An optimized strategy to exploit spot instances.
http://andrew.cmu.edu/user/cjoewong/Spot_ML.pdf.
[3] Great Internet Mersenne Primes Search. https://www.mersenne.org/, 2018.
[4] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,
Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden,
P., Wicke, M., Yu, Y., , and Zheng., X. Tensorflow: A system for large-scale
machine learning. In In Proceedings of USENIX OSDI (2016).
[5] Alex Krizhevsky, IIya Sutskever, G. E. H. Imagenet classification with deep
convolutional neural networks. In Proceedings of NIPS (2012).[6] Amazon EC2. Amazon ec2 spot instances. https://aws.amazon.com/ec2/spot/,
2018.
[7] Anderson, D. P., and Fedak, G. The computational and storage potential of
volunteer computing. In Proc. of IEEE CCGRID (2006), vol. 1, IEEE, pp. 73–80.
[8] Billock, J. Five augmented reality experiences that bring museum exhibits to
life. Smithsonian Magazine (June 2017).[9] Bottou, L. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT (2010).
[10] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale
machine learning. SIAM Review 60, 2 (2018), 223–311.[11] Boyd, S., and Vandenberghe, L. Convex optimization. Cambridge university
press (2014).[12] Carbunar, B., and Tripunitara, M. V. Payments for outsourced computations.
IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012), 313–320.[13] Dutta, S., Cadambe, V., and Grover, P. Short-dot: Computing large linear
transforms distributedly using coded short dot products. In Proc. of Advances inNeural Information Processing Systems (NIPS) (2016).
[14] Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. Slow and stale gra-
dients can win the race: Error-runtime trade-offs in distributed sgd. In Proceedingsof AISTATS (2018).
[15] et al., J. D. Large scale distributed deep networks. In International Conferenceon Neural Information Processing Systems (NIPS) (2012), vol. 1, pp. 1223–1231.
[16] Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Proteus:
Agile ml elasticity through tiered reliability in dynamic resource markets. In
Proc. of European Conference on Computer Systems (2017).[17] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recogni-
tion, 2015.
[18] Kamp, M., Adilova, L., Sicking, J., Hüger, F., Schlicht, P., Wirtz, T., and
Wrobel, S. Efficient decentralized deep learning by dynamic model averaging.
arXiv preprint arXiv:1807.03210 (2018).[19] Konečny, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon,
D. Federated learning: Strategies for improving communication efficiency.
[20] Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset. https://www.cs.
toronto.edu/~kriz/cifar.html.
[21] Lee, K., and Son, M. Deepspotcloud: leveraging cross-region gpu spot instances
for deep learning. In Proceedings of IEEE CLOUD (2017), IEEE, pp. 98–105.
[22] Levy, N. How PokÃľmon Go evolved to remain popular after the initial craze
cooled. GeekWire (Oct. 2018). https://www.geekwire.com/2018/pokemon-go-
evolved-remain-popular-initial-craze-cooled/.
[23] Lynley, M. Snark ai looks to help companies get on-demand access to idle
gpus. TechCrunch, https://techcrunch.com/2018/07/25/snark-ai-looks-to-help-
companies-get-on-demand-access-to-idle-gpus/, 2018.
[24] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A.
Communication-efficient learning of deep networks from decentralized data. In
Proceedings of AISTATS (2017).[25] Ofer Dekel, Ran Gilad-Bachrach, O. S., and Xiao., L. Optimal distributed
online prediction using mini-batches. Journal of Machine Learning Research 13, 1(2012), 165–202.
[26] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: an efficient dynamic
resource scheduler for deep learning clusters. In Proc. of EuroSys (2018).[27] Robbins, H., and Monro, S. A stochastic approximation method. The Annals of
Mathematical Statistics 22, 3 (1951), 400–407.[28] Satyanarayanan, M. The emergence of edge computing. Computer 50, 1 (2017),
30–39.
[29] Search for Extraterrestrial Intelligence. SETI@Home, 2018.
[30] Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed opti-
mization using an approximate newton-type method. In International conferenceon machine learning (2014), pp. 1000–1008.
[31] Sharma, P., Irwin, D., and Shenoy, P. How not to bid the cloud. In Proc. USENIXConference on Hot Topics in Cloud Computing (HotCloud) (2016).
[32] Tianqi, C., Mu, L., Yutian, L., Min, L., Naiyan, W., Minjie, W., Tianjun, X.,
Bing, X., Chiyuan, Z., and Zheng., Z. Mxnet: A flexible and efficient machine
learning library for heterogeneous distributed systems. In Proc. of NIPS Workshopon Machine Learning Systems (LearningSys) (2016).
[33] Zafer, M., Song, Y., and Lee, K.-W. Optimal bids for spot vms in a cloud for
deadline constrained jobs. In Proc. of International Conference on Cloud Computing(2012).
[34] Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: quality-driven
scheduling for distributed machine learning. In Proceedings of SoCC (2017).
[35] Zheng, L., Joe-Wong, C., Tan, C. W., Chiang, M., and Wang, X. How to bid the
cloud. In Proc. ACM SIGCOMM (2015).
10