themis: fair and efficient gpu cluster schedulingthemis: fair and efficient gpu cluster scheduling...

15
T HEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University of Wisconsin - Madison Arjun Singhvi University of Wisconsin - Madison Shivaram Venkataraman University of Wisconsin - Madison Aditya Akella University of Wisconsin - Madison Amar Phanishayee Microsoft Research Shuchi Chawla University of Wisconsin - Madison Abstract: Modern distributed machine learning (ML) train- ing workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key ques- tion is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads’ unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks’ relative placement. We propose THEMIS, a new scheduling framework for ML training workloads. It’s GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, THEMIS uses a two-level scheduling archi- tecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off effi- ciency for fairness in the short term, but ensuring finish-time fairness in the long term. Our evaluation on a production trace shows that THEMIS can improve fairness by more than 2.25X and is 5% to 250% more cluster efficient in comparison to state-of-the-art schedulers. 1 Introduction With the widespread success of machine learning (ML) for tasks such as object detection, speech recognition, and ma- chine translation, a number of enterprises are now incorporat- ing ML models into their products. Training individual ML models is time- and resource-intensive with each training job typically executing in parallel on a number of GPUs. With different groups in the same organization training ML models, it is beneficial to consolidate GPU resources into a shared cluster. Similar to existing clusters used for large scale data analytics, shared GPU clusters for ML have a number of operational advantages, e.g., reduced development overheads, lower costs for maintaining GPUs, etc. However, today, there are no ML workload-specific mechanisms to share a GPU cluster in a fair manner. Our conversations with cluster operators indicate that fair- ness is crucial; specifically, that sharing an ML cluster be- comes attractive to users only if they have the appropriate sharing incentive. That is, if there are a total N users sharing a cluster C, every user’s performance should be no worse than using a private cluster of size C N . Absent such incentive, users are either forced to sacrifice performance and suffer long wait times for getting their ML jobs scheduled, or abandon shared clusters and deploy their own expensive hardware. Providing sharing incentive through fair scheduling mech- anisms has been widely studied in prior cluster scheduling frameworks, e.g., Quincy [18], DRF [8], and Carbyne [11]. However, these techniques were designed for big data work- loads, and while they are used widely to manage GPU clusters today, they are far from effective. The key reason is that ML workloads have unique charac- teristics that make existing “fair” allocation schemes actually unfair. First, unlike batch analytics workloads, ML jobs have long running tasks that need to be scheduled together, i.e., gang-scheduled. Second, each task in a job often runs for a number of iterations while synchronizing model updates at the end of each iteration. This frequent communication means that jobs are placement-sensitive, i.e., placing all the tasks for a job on the same machine or the same rack can lead to sig- nificant speedups. Equally importantly, as we show, ML jobs differ in their relative placement-sensitivity (Section 3.1.2). In Section 3, we show that having long-running tasks means that established schemes such as DRF – which aims to equally allocate the GPUs released upon task completions – can arbi- trarily violate sharing incentive. We show that even if GPU resources were released/reallocated on fine time-scales [13], placement sensitivity means that jobs with same aggregate resources could have widely different performance, violating sharing incentive. Finally, heterogeneity in placement sensi- tivity means that existing scheduling schemes also violate Pareto efficiency and envy-freedom, two other properties that are central to fairness [35]. Our scheduler, THEMIS, address these challenges, and sup- ports sharing incentive, Pareto efficiency, and envy-freedom for ML workloads. It multiplexes a GPU cluster across ML 1 arXiv:1907.01484v2 [cs.DC] 29 Oct 2019

Upload: others

Post on 17-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

THEMIS: Fair and Efficient GPU Cluster Scheduling

Kshiteej MahajanUniversity of Wisconsin - Madison

Arjun BalasubramanianUniversity of Wisconsin - Madison

Arjun SinghviUniversity of Wisconsin - Madison

Shivaram VenkataramanUniversity of Wisconsin - Madison

Aditya AkellaUniversity of Wisconsin - Madison

Amar PhanishayeeMicrosoft Research

Shuchi ChawlaUniversity of Wisconsin - Madison

Abstract: Modern distributed machine learning (ML) train-ing workloads benefit significantly from leveraging GPUs.However, significant contention ensues when multiple suchworkloads are run atop a shared cluster of GPUs. A key ques-tion is how to fairly apportion GPUs across workloads. Wefind that established cluster scheduling disciplines are a poorfit because of ML workloads’ unique attributes: ML jobs havelong-running tasks that need to be gang-scheduled, and theirperformance is sensitive to tasks’ relative placement.

We propose THEMIS, a new scheduling framework for MLtraining workloads. It’s GPU allocation policy enforces thatML workloads complete in a finish-time fair manner, a newnotion we introduce. To capture placement sensitivity andensure efficiency, THEMIS uses a two-level scheduling archi-tecture where ML workloads bid on available resources thatare offered in an auction run by a central arbiter. Our auctiondesign allocates GPUs to winning bids by trading off effi-ciency for fairness in the short term, but ensuring finish-timefairness in the long term. Our evaluation on a production traceshows that THEMIS can improve fairness by more than 2.25Xand is 5% to 250% more cluster efficient in comparison tostate-of-the-art schedulers.

1 IntroductionWith the widespread success of machine learning (ML) fortasks such as object detection, speech recognition, and ma-chine translation, a number of enterprises are now incorporat-ing ML models into their products. Training individual MLmodels is time- and resource-intensive with each training jobtypically executing in parallel on a number of GPUs.

With different groups in the same organization training MLmodels, it is beneficial to consolidate GPU resources into ashared cluster. Similar to existing clusters used for large scaledata analytics, shared GPU clusters for ML have a number ofoperational advantages, e.g., reduced development overheads,lower costs for maintaining GPUs, etc. However, today, thereare no ML workload-specific mechanisms to share a GPUcluster in a fair manner.

Our conversations with cluster operators indicate that fair-

ness is crucial; specifically, that sharing an ML cluster be-comes attractive to users only if they have the appropriatesharing incentive. That is, if there are a total N users sharinga cluster C, every user’s performance should be no worse thanusing a private cluster of size C

N . Absent such incentive, usersare either forced to sacrifice performance and suffer long waittimes for getting their ML jobs scheduled, or abandon sharedclusters and deploy their own expensive hardware.

Providing sharing incentive through fair scheduling mech-anisms has been widely studied in prior cluster schedulingframeworks, e.g., Quincy [18], DRF [8], and Carbyne [11].However, these techniques were designed for big data work-loads, and while they are used widely to manage GPU clusterstoday, they are far from effective.

The key reason is that ML workloads have unique charac-teristics that make existing “fair” allocation schemes actuallyunfair. First, unlike batch analytics workloads, ML jobs havelong running tasks that need to be scheduled together, i.e.,gang-scheduled. Second, each task in a job often runs for anumber of iterations while synchronizing model updates atthe end of each iteration. This frequent communication meansthat jobs are placement-sensitive, i.e., placing all the tasks fora job on the same machine or the same rack can lead to sig-nificant speedups. Equally importantly, as we show, ML jobsdiffer in their relative placement-sensitivity (Section 3.1.2).

In Section 3, we show that having long-running tasks meansthat established schemes such as DRF – which aims to equallyallocate the GPUs released upon task completions – can arbi-trarily violate sharing incentive. We show that even if GPUresources were released/reallocated on fine time-scales [13],placement sensitivity means that jobs with same aggregateresources could have widely different performance, violatingsharing incentive. Finally, heterogeneity in placement sensi-tivity means that existing scheduling schemes also violatePareto efficiency and envy-freedom, two other properties thatare central to fairness [35].

Our scheduler, THEMIS, address these challenges, and sup-ports sharing incentive, Pareto efficiency, and envy-freedomfor ML workloads. It multiplexes a GPU cluster across ML

1

arX

iv:1

907.

0148

4v2

[cs

.DC

] 2

9 O

ct 2

019

Page 2: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

applications (Section 2), or apps for short, where every appconsists of one or more related ML jobs, each running withdifferent hyper-parameters, to train an accurate model for agiven task. To capture the effect of long running tasks andplacement sensitivity, THEMIS uses a new long-term fairnessmetric, finish-time fairness, which is the ratio of the runningtime in a shared cluster with N apps to running alone in a 1

Ncluster. THEMIS’s goal is thus to minimize the maximum fin-ish time fairness across all ML apps while efficiently utilizingcluster GPUs. We achieve this goal using two key ideas.

First, we propose to widen the API between ML apps andthe scheduler to allow apps to specify placement preferences.We do this by introducing the notion of a round-by-roundauction. THEMIS uses leases to account for long-running MLtasks, and auction rounds start when leases expire. At the startof a round, our scheduler requests apps for their finish-timefairness metrics, and makes all available GPUs visible to afraction of apps that are currently farthest in terms of theirfairness metric. Each such app has the opportunity to bid forsubsets of these GPUs as a part of an auction; bid valuesreflect the app’s new (placement sensitive) finish time fairnessmetric from acquiring different GPU subsets. A central arbiterdetermines the global winning bids to maximize the aggregateimprovement in the finish time fair metrics across all biddingapps. Using auctions means that we need to ensure that appsare truthful when they bid for GPUs. Thus we use a partialallocation auction that incentivizes truth telling, and ensuresPareto-efficiency and envy-freeness by design.

While a far-from-fair app may lose an auction round, per-haps because it placed less ideally than another app, its bidvalues for subsequent auctions naturally increase (because alosing app’s finish time fairness worsens), thereby improvingthe odds of it winning future rounds. Thus, our approach con-verges to fair allocations over the long term, while stayingefficient and placement-sensitive in the short term.

Second, we present a two-level scheduling design that con-tains a centralized inter-app scheduler at the bottom level,and a narrow API to integrate with existing hyper-parametertuning frameworks at the top level. A number of existingframeworks such as Hyperdrive [30] and HyperOpt [3] can in-telligently apportion GPU resources between various jobs ina single app, and in some cases also terminate a job early if itsprogress is not promising. Our design allows apps to directlyuse such existing hyper parameter tuning frameworks. We de-scribe how THEMIS accommodates various hyper-parametertuning systems and how its API is exercised in extractingrelevant inputs from apps when running auctions.

We implement THEMIS atop Apache YARN 3.2.0, andevaluate by replaying workloads from a large enterprise trace.Our results show that THEMIS is atleast 2.25X better thanstate-of-the-art schedulers while also improving cluster effi-ciency by 5% to 250%. To further understand our schedulingdecisions, we perform an event-driven simulation using thesame trace, and our results show that THEMIS offers greater

benefits when we increase the fraction of network intensiveapps, and the cluster contention.

2 MotivationWe start by defining the terminology used in the rest of thepaper. We then study the unique properties of ML workloadtraces from a ML training GPU cluster at a large softwarecompany. We end by stating our goals based on our traceanalysis and conversations with the cluster operators.

2.1 PreliminariesWe define an ML app, or simply an “app”, as a collection ofone or more ML model training jobs. Each app correspondsto a user training an ML model for a high-level goal, suchas speech recognition or object detection. Users train thesemodels knowing the appropriate hyper-parameters (in whichcase there is just a single job in the app), or they train a closelyrelated set of models (n jobs) that explore hyper-parameterssuch as learning rate, momentum etc. [22, 30] to identify andtrain the best target model for the activity at hand.

Each job’s constituent work is performed by a number ofparallel tasks. At any given time, all of a job’s tasks collec-tively process a mini-batch of training data; we assume thatthe size of the batch is fixed for the duration of a job. Each tasktypically processes a subset of the batch, and, starting froman initial version of the model, executes multiple iterations ofthe underlying learning algorithm to improve the model. Weassume all jobs use the popular synchronous SGD [4].

We consider the finish time of an app to be when the bestmodel and relevant hyper-parameters have been identified.Along the course of identifying such a model, the app maydecide to terminate some of its constituent jobs early [3, 30];such jobs may be exploring hyper-parameters that are clearlysub-optimal (the jobs’ validation accuracy improvement overiterations is significantly worse than other jobs in the sameapp). For apps that contain a single job, finish time is the timetaken to train this model to a target accuracy or maximumnumber of iterations.

2.2 Characterizing Production ML AppsWe perform an analysis of the properties of GPU-based MLtraining workloads by analyzing workload traces obtainedfrom a large internet company. The GPU cluster we studysupports over 5000 unique users. We restrict our analysisto a subset of the trace that contains 85 ML training appssubmitted using a hyper-parameter tuning framework.

GPU clusters are known to be heavily contented [19], andwe find this also holds true in the subset of the trace of MLapps we consider (Figure 1). For instance, we see that GPUdemand is bursty and the average GPU demand is ~50 GPUs.

We also use the trace to provide a first-of-a-kind viewinto the characteristics of ML apps. As mentioned in Section2.1, apps may either train a single model to reach a targetaccuracy (1 job) or may use the cluster to explore varioushyper-parameters for a given model (n jobs). Figure 2 shows

2

Page 3: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

100 200 300 400 500 600Time (hours)

0

200

400

GPU

Dem

and

Figure 1: Aggregate GPU demand ofML apps over time

0 20 40 60 80 100Number of jobs per app

0.00.20.40.60.81.0

Frac

tion

of a

pps

Figure 2: Job count distribution acrossdifferent apps

0 200 400 600 800Total GPU time (hours)

0.00.20.40.60.81.0

Frac

tion

of a

pps

Figure 3: ML app time ( = total GPUtime across all jobs in app) distribution

0 400 800 1200 1600Total GPU time (mins)

0.00.20.40.60.81.0

Frac

tion

of T

asks

Figure 4: Distribution of Task GPUtimes

that ~10% of the apps have 1 job, and around ~90% of theapps perform hyper-parameter exploration with as many as100 jobs (median of 75 jobs). Interestingly, there is also a sig-nificant variation in the number of hyper-parameters exploredranging from a few tens to about a hundred (not shown).

We also measure the GPU time of all ML apps in the trace.If an app uses 2 jobs with 2 GPUs each for a period of 10minutes, then the task GPU times would be 10 minutes each,the job GPU times would be 20 minutes each, and the appGPU time would be 40 GPU minutes. Figure 3 and Figure 4show the long running nature of ML apps: the median apptakes 11.5 GPU days and the median task takes 3.75 GPUhours. There is a wide diversity with a significant fraction ofjobs and apps that are more than 10X shorter and many thatare more than 10X longer.

From our analysis we see that ML apps are heterogeneousin terms of resource usage, and number of jobs submitted.Running times are also heterogeneous, but at the same timemuch longer than, e.g., running times of big data analytics jobs(typical a few hours [12]). Handling such heterogeneity can bechallenging for scheduling frameworks, and the long runningnature may make controlling app performance particularlydifficult in a shared setting with high contention.

We next discuss how some of these challenges manifest inpractice from both cluster user and cluster operator perspec-tives, and how that leads to our design goals for THEMIS.

2.3 Our GoalOur many conversations with operators of GPU clusters re-vealed a common sentiment, reflected in the following quote:

“ We were scheduling with a balanced approach ... with guidanceto ’play nice’. Without firm guard rails, however, there were alwaysindividuals who would ignore the rules and dominate the capacity. ”

— An operator of a large GPU clusterWith long app durations, users who dominate capacity im-

pose high waiting times on many other users. Some such usersare forced to “quit” the cluster as reflected in this quote:

“Even with existing fair sharing schemes, we do find users frus-trated with the inability to get their work done in a timely way... Thefrustration frequently reaches the point where groups attempt orsucceed at buying their own hardware tailored to their needs. ”

— An operator of a large GPU clusterWhile it is important to design a cluster scheduler that

ensures efficient use of highly contended GPU resources,the above indicates that it is perhaps equally, if not moreimportant for the scheduler to allocate GPU resources in a

fair manner across many diverse ML apps; in other words,roughly speaking, the scheduler’s goal should be to allow allapps to execute their work in a “timely way”.

In what follows, we explain using examples, measurements,and analysis, why existing fair sharing approaches when ap-plied to ML clusters fall short of the above goal, which weformalize next. We identify the need both for a new fairnessmetric, and for a new scheduler architecture and API thatsupports resource division according to the metric.

3 Finish-Time Fair AllocationWe present additional unique attributes of ML apps and dis-cuss how they, and the above attributes, affect existing fairsharing schemes.

3.1 Fair Sharing Concerns for ML AppsThe central question is - given R GPUs in a cluster C and NML apps, what is a fair way to divide the GPUs.

As mentioned above, cluster operators indicate that the pri-mary concern for users sharing an ML cluster is performanceisolation that results in “timely completion”. We formalizethis as: if N ML Apps are sharing a cluster then an app shouldnot run slower on the shared cluster compared to a dedicatedcluster with 1

N of the resources. Similar to prior work [8],we refer to this property as sharing incentive (SI). Ensuringsharing incentive for ML apps is our primary design goal.

In addition, resource allocation mechanisms must satisfytwo other basic properties that are central to fairness [35]:Pareto Efficiency (PE) and Envy-Freeness (EF) 1

While prior systems like Quincy [18], DRF [8] etc. aim atproviding SI, PE and EF, we find that they are ineffective forML clusters as they fail to consider the long durations of MLtasks and placement preferences of ML apps.

3.1.1 ML Task DurationsWe empirically study task durations in ML apps and show howthey affect the applicability of existing fair sharing schemes.

Figure 4 shows the distribution of task durations for MLapps in a production cluster. We note that the tasks are, ingeneral, very long, with the median task roughly 3.75 hourslong. This is in stark contrast with, e.g., big data analyticsjobs, where tasks are typically much shorter in duration [27].

State of the art fair allocation schemes such as DRF [8]

1Informally, a Pareto Efficient allocation is one where no app’s allocationcan be improved without hurting some other app. And, envy-freeness meansthat no app should prefer the resource allocation of an other app.

3

Page 4: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

VGG16 VGG19 AlexNet Inceptionv3 ResNet50

Model Architecture

0

100

200

300

400

500

Thr

ough

put (

Imag

es/s

ec)

4 P100 GPUs on 1 server4 P100 GPUs across 2 servers (2x2)

Figure 5: Effect of GPU resource allocation on job throughput.

provide instantaneous resource fairness. Whenever resourcesbecome available, they are allocated to the task from an appwith the least current share. For big data analytics, wheretask durations are short, this approximates instantaneous re-source fairness, as frequent task completions serve as oppor-tunities to redistribute resources. However, blindly applyingsuch schemes to ML apps can be disastrous: running the muchlonger-duration ML tasks to completion could lead to newlyarriving jobs waiting inordinately long for resources. Thisleads to violation of SI for late-arriving jobs.

Recent “attained-service” based schemes address this prob-lem with DRF. In [13], for example, GPUs are leased for acertain duration, and when leases expire, available GPUs aregiven to the job that received the least GPU time thus far;this is the “least attained service”, or LAS allocation policy.While this scheme avoids the starvation problem above forlate-arriving jobs, it still violates all key fairness propertiesbecause it is placement-unaware, an issue we discuss next.

3.1.2 Placement PreferencesNext, we empirically study placement preferences of ML apps.We use examples to show how ignoring these preferences infair sharing schemes violates key properties of fairness.Many apps, many preference patterns: ML cluster userstoday train a variety of ML apps across domains like com-puter vision, NLP and speech recognition. These models havesignificantly different model architectures, and more impor-tantly, different placement preferences arising from differentcomputation, communication needs. For example, as shownin Figure 5, VGG16 has a strict machine-local task placementpreference while Resnet50 does not. This preference inher-ently stems from the fact that VGG-like architectures havevery large number of parameters and incur greater overheadsfor updating gradients over the network.

We use examples to show the effect of placement on DRF’sallocation strategy. Similar examples and conclusions applyfor the LAS allocation scheme.Ignoring placement affects SI: example: Consider the In-stance 1 in Figure 6. In this example, there are two placementsensitive ML apps - A1 reflecting VGG16, and A2 reflectingVGG19. Each ML app has just one job in it with 4 tasks andthe cluster has two 4 GPU machines. As shown above, giventhe same number of GPUs both apps prefer GPUs to be in thesame server than spread across servers.

For this example, DRF [8] equalizes the dominant resourceshare of both the apps under resource constraints and allocates

A1

A2A1 A2

Instance 1: 2 4-GPU Instance 2: 1 4-GPU; 2 2-GPUFigure 6: By ignoring placement preference, DRF violates sharing incentive.

4 GPUs to each ML app. In Instance 1 of Figure 6 we showan example of a valid DRF allocation. Both apps get thesame type of placement with GPUs spread across servers.This allocation violates SI for both apps as their performancewould be better if each app just had its own dedicated server.Ignoring placement affects PE, EF: example: Consider In-stance 2 in Figure 6 with two apps - A1 (Inceptionv3) which isnot placement sensitive and A2 (VGG16) which is placementsensitive. Each app has one job with four tasks and and thecluster has two machines: one 4 GPU and two 2 GPU.

Now consider the allocation in Instance 2, where A1 is al-located on the 4 GPU machine whereas A2 is allocated acrossthe 2 GPU machines. This allocation violates EF, becauseA2 would prefer A1’s allocation. It also violates PE becauseswapping the two apps’ allocation would improve A2’s per-formance without hurting A1.

In fact, we can formally show that:Theorem 3.1. Existing fair schemes (DRF, LAS) ignore place-ment preferences and violate SI, PE, EF for ML apps.Proof Refer to Appendix.

In summary, existing schemes fail to provide fair sharingguarantees as they are unaware of ML app characteristics.Instantaneous fair schemes such as DRF fail to account forlong task durations. While least-attained service schemesovercome that limitation, neither approach’s input encodesplacement preferences. Correspondingly, the fairness metricsused - i.e., dominant resource share (DRF) or attained service(LAS) - do not capture placement preferences.

This motivates the need for a new placement-aware fairnessmetric, and corresponding scheduling discipline. Our obser-vations about ML task durations imply that, like LAS, our fairallocation discipline should not depend on rapid task comple-tions, but instead should operate over longer time scales.

3.2 Metric: Finish-Time FairnessWe propose a new metric called as finish-time fairness, ρ.ρ = Tsh

Tid.

Tid is the independent finish-time and Tsh is the sharedfinish-time. Tsh is the finish-time of the app in the sharedcluster and it encompasses the slowdown due to the placementand any queuing delays that an app experiences in gettingscheduled in the shared cluster. The worse the placement thehigher is the value of Tsh. Tid , is the finish-time of the ML appin its own independent and exclusive 1

N share of the cluster.Given the above definition, sharing incentive for an ML appcan be attained if ρ≤ 1.

To ensure this, it is necessary for the allocation mechanismto estimate the values of ρ for different GPU allocations.Given the difficulty in predicting how various apps will react

4

Page 5: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

−→G [0,0] [0,1] = [1,0] [1,1]ρ ρold

200400 = 1

2100400 = 1

4

Table 1: Example table of bids sent from apps to the scheduler

Pseudocode 1 Finish-Time Fair Policy1: Applications {Ai} . set of apps2: Bids {ρi(.)} . valuation function for each app i3: Resources

−→R . resource set available for auction

4: Resource Allocations {−→G i} . resource allocation for each app i

5: procedure AUCTION({Ai}, {ρi(.)},−→R )

6: −→G i,p f = arg max ∏i 1/ρi(

−→Gi) . proportional fair (pf) allocation per app i

7: −→G−i

j,p f = arg max ∏ j!=i 1/ρ j(−→G j) . pf allocation per app j without app i

8: ci =∏ j!=i 1/ρ j (

−→G j,p f )

∏ j!=i 1/ρ j (−→G−i

j,p f )

9: −→Gi = ci *

−→G i,p f . allocation per app i

10: −→L = ∑i 1− ci *

−→G i,p f . aggregate leftover resource

11: return {−→Gi},−→L

12: end procedure13: procedure ROUNDBYROUNDAUCTIONS({Ai}, {ρi(.)})14: while True do15: ONRESOURCEAVAILABLEEVENT ~R′:16: {Asort

i } = SORT({Ai}) on ρcurrenti

17: {A f ilteri } = get top 1− f fraction of apps from {Asort }

18: {ρf ilteri (.)} = get updated ρ(.) from apps in {A f ilter

i }

19: {−−−→G f ilter

i },−→L = AUCTION({A f ilter

i }, {ρf ilteri (.)}, ~R′)

20: {Aun f ilteri } = {Ai}−{A f ilter

i }21: allocate

−→L to {Aun f ilter

i } at random22: end while23: end procedure

to different allocations, it is intractable for the schedulingengine to predict or determine the values of ρ.

Thus we propose a new wider interface between the appand the allocation engine that can allow the app to expressa preference for each allocation. We propose that apps canencode this information as a table. In Table 1, each column hasa permutation of a potential GPU allocation and the estimateof ρ on receiving this allocation. We next describe how thescheduling engine can use this to provide fair allocations.

3.3 Mechanism: Partial Allocation AuctionsThe finish-time fairness ρi(.) for an ML app Ai is a functionof the GPU allocation ~Gi that it receives. The allocation policytakes these ρi(.)’s as inputs and outputs allocations ~Gi.

Intuitively the scheduling engine would pick the apps thatcan most benefit from an allocation and assign resources tothat app. In other words allocations which can decrease anapp’s ρi will be preferred. However, one of the main chal-lenges in a two level design where apps submit their respec-tive ρi’s is that apps could potentially return false informationto boost the chances that they receive a particular allocation.Our conversations with cluster operators indicate that appsrequest for more resources than required and the require mon-itoring (“We also monitor the usage. If they don’t use it, wereclaim it and pass it on to the next approved project”). Thusa simple scheme like sorting the apps based on their reportedρ values would fail to provide another key property, namely,strategy proofness (SP).

To address this challenge, we propose to use auctions inTHEMIS. We begin by describing a simple mechanism that

runs a single-round auction and then extend to a round-by-round mechanism that also considers online updates.

3.3.1 One-Shot AuctionDetails of the inputs necessary to run the auction are givenfirst, followed by how the auction works given these inputs.Inputs: Resources and Bids. ~R represents the total GPUresources to be auctioned, where each element is 1 and thenumber of dimensions is the number of GPUs to be auctioned.

Each ML app bids for these resources. The bid for each MLapp consists of the estimated finish-time fair metric (ρi) valuesfor several different GPU allocations (~Gi). Each element in~Gi can be {0,1}. A set bit implies that GPU is allocated tothe app. Example of a bid can be seen in Table 1.Auction Overview. To ensure that the auction can providestrategy proofness, we propose using a partial allocationauction (PA) mechanism [5]. Partial allocation auctions havebeen shown to incentivize truth telling and are an appropriatefit for modeling subsets of indivisible goods to be auctionedacross apps. Pseudocode 1, line 5 shows the PA mechanism.There are two aspects to auctions that are described next.1. Initial allocation. PA starts by calculating an intrinsicallyproportionally-fair allocation ~Gi,p f for each app Ai by maxi-mizing the product of the valuation functions i.e. the finish-time fair metric values for all apps (Pseudocode 1, line 6).Such an allocation ensures that it is not possible to increasethe allocation of an app without decreasing the allocation ofat least another app (satisfying PE [5]).2. Incentivizing Truth Telling. To induce truthful reportingof the bids, the PA mechanism allocates app Ai only a fractionci < 1 of Ai’s proportional fair allocation ~Gi,p f , and takes1− ci as a hidden payment (Pseudocode 1, line 9). The ci isdirectly proportional to the decrease in collective valuation ofthe other bidding apps in a market with and without app Ai(Pseudocode 1, line 8). This yields the final allocation ~Gi forapp Ai (Pseudocode 1, line 9).

Note that the final result, ~Gi is not a market-clearing alloca-tion and there could be unallocated GPUs~L that are leftoverfrom hidden payments. Hence, PA is not work-conserving.Thus, while the one-shot auction provides a number of prop-erties related to fair sharing it does not ensure SI is met.Theorem 3.2. The one-shot partial allocation auction guaran-tees SP, PE and EF, but does not provide SI.Proof Refer to Appendix.

The intuitive reason for this is that the partial allocationmechanism does not explicitly ensure that ρ≤ 1 across apps.To address this we next look at multi-round auctions that canmaximize sharing incentive for ML apps.

3.3.2 Multi-round auctionsTo maximize sharing incentive and to ensure work conserva-tion, our goal is to ensure ρ≤ 1 for as many apps as possible.We do this using three key ideas described below.Round-by-Round Auctions: With round-by-round auctions,the outcome of an allocation from an auction is binding only

5

Page 6: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

for a lease duration. At the end of this lease, the freed GPUsare re-auctioned. This also handles the online case as any auc-tion is triggered on a resource available event. This takes careof app failures and arrivals, as well as cluster reconfigurations.

At the beginning of each round of auction, the policy so-licits updated valuation functions ρ(.) from the apps. Theestimated work and the placement preferences for the caseof ML apps are typically time varying. This also makes ourpolicy adaptive to such changes.Round-by-Round Filtering: To maximize the number ofapps with ρ≤ 1, at the beginning of each round of auctionswe filter the 1− f fraction of total active apps with the greatestvalues of current estimate of their finish-time fair metric ρ.Here, f ∈ (0,1) is a system-wide parameter.

This has the effect of restricting the auctions to the apps thatare at risk of not meeting SI. Also, this restricts the auction toa smaller set of apps which reduces contention for resourcesand hence results in smaller hidden payments. It also makesthe auction computationally tractable.

Over the course of many rounds, filtering maximizes thenumber of apps that have SI. Consider a far-from-fair app ithat lost an auction round. It will appear in future rounds withmuch greater likelihood relative to another less far-from-fairapp k that won the auction round. This is because, the win-ning app k was allocated resources; as a result, it will see itsρ improve over time; thus, it will eventually not appear in thefraction 1− f of not-so-fairly-treated apps that participate infuture rounds. In contrast, i’s ρ will increase due to the wait-ing time, and thus it will continue to appear in future rounds.Further an app that loses multiple rounds will eventually loseits lease on all resources and make no further progress, caus-ing its ρ to become unbounded. The next auction round theapp participates in will likely see the app’s bid winning, be-cause any non-zero GPU allocation to that app will lead to ahuge improvement in the app’s valuation.

As f → 1, our policy provides greater guarantee on SI.However, this increase in SI comes at the cost of efficiency.This is because f → 1 restricts the set of apps to which avail-able GPUs will be allocated; with f → 0 available GPUs canbe allocated to apps that benefit most from better placement,which improves efficiency at the risk of violating SI.Leftover Allocation: At the end of each round we have left-over GPUs due to hidden payments. We allocate these GPUsat random to the apps that did not participate in the auction inthis round. Thus our overall scheme is work-conserving.

Overall, we prove that:Theorem 3.3. Round-by-round auctions preserve the PE, EFand SP properties of partial auctions and maximize SI.Proof. Refer to Appendix.

To summarize, in THEMIS we propose a new finish-timefairness metric that captures fairness for long-running, place-ment sensitive ML apps. To perform allocations, we proposeusing a multi-round partial allocation auction that incentivizestruth telling and provides pareto efficient, envy free alloca-

tions. By filtering the apps considered in the auction, we max-imize sharing incentive and hence satisfy all the propertiesnecessary for fair sharing among ML applications.

4 System DesignWe first list design requirements for an ML cluster schedulertaking into account the fairness metric and auction mechanismdescribed in Section 3, and the implications for the THEMISscheduler architecture. Then, we discuss the API between thescheduler and the hyper-parameter optimizers.

4.1 Design RequirementsSeparation of visibility and allocation of resources. Coreto our partial allocation mechanism is the abstraction of mak-ing available resources visible to a number of apps but al-locating each resource exclusively to a single app. As weargue below, existing scheduling architectures couple theseconcerns and thus necessitate the design of a new scheduler.Integration with hyper-parameter tuning systems. Hyper-parameter optimization systems such as Hyperband [22], Hy-perdrive [30] have their own schedulers that decide the re-source allocation and execution schedule for the jobs withinthose apps. We refer to these as app-schedulers. One of ourgoals in THEMIS is to integrate with these systems with mini-mal modifications to app-schedulers.

These two requirements guide our design of a new two-level semi-optimistic scheduler and a set of correspondingabstractions to support hyper-parameter tuning systems.

4.2 THEMIS Scheduler ArchitectureExisting scheduler architectures are either pessimistic or fullyoptimistic and both these approaches are not suitable for real-izing multi-round auctions. We first describe their shortcom-ings and then describe our proposed architecture.

4.2.1 Need for a new scheduling architectureTwo-level pessimistic schedulers like Mesos [17] enforce pes-simistic concurrency control. This means that visibility andallocation go hand-in-hand at the granularity of a single app.There is restricted single-app visibility as available resourcesare partitioned by a mechanism internal to the lower-level(i.e., cross-app) scheduler and offered only to a single app ata time. The tight coupling of visibility and allocation makes itinfeasible to realize round-by-round auctions where resourcesneed to be visible to many apps but allocated to just one app.

Shared-state fully optimistic schedulers like Omega [31]enforce fully optimistic concurrency control. This means thatvisibility and allocation go hand-in-hand at the granularity ofmultiple apps. There is full multi-app visibility as all clusterresources and their state is made visible to all apps. Also, allapps contend for resources and resource allocation decisionsare made by multiple apps at the same time using transactions.This coupling of visibility and allocation in a makes it hardto realize a global policy like finish-time fairness and alsoleads to expensive conflict resolution (needed when multiple

6

Page 7: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

THEMIS ARBITER

Ask all apps for 𝜌estimates

Offersto subset of apps

Allocate Winning

Bids

Make Bids

Receive Allocation

Give 𝜌 Estimates

AGENT

App1 App2 Appn

. . . . . Appm

1 2 3 4

5

Resource Alloc 𝜌new

0 𝜌old

M1:2GM1:1G,M2:1G 𝜌old - 𝜹

𝜌old - 2𝜹

(a)

(b)

Apps make bids

Figure 7: THEMIS Design. (a) Sequence of events in THEMIS - starts witha resource available event and ends with resource allocations. (b) Showsa typical bid valuation table an App submits to ARBITER. Each row has asubset of the complete resource allocation and the improved value of ρnew.

apps contend for the same resource) when the cluster is highlycontented, which is typically the case in shared GPU clusters.

Thus, the properties required by multi-round auctions, i.e.,multi-app resource visibility and single-app resource alloca-tion granularity, makes existing architectures ineffective.

4.2.2 Two-Level Semi-Optimistic SchedulingThe two-levels in our scheduling architecture comprise ofmultiple app-schedulers and a cross-app scheduler that wecall the ARBITER. The ARBITER has our scheduling logic.The top level per-app schedulers are minimally modified tointeract with the ARBITER. Figure 7 shows our architecture.

Each GPU in a THEMIS-managed cluster has a lease associ-ated with it. The lease decides the duration of ownership of theGPU for an app. When a lease expires, the resource is madeavailable for allocation. THEMIS’s ARBITER pools availableresources and runs a round of the auctions described earlier.During each such round, the resource allocation proceeds in5 steps spanning 2 phases (shown in Figure 7):

The first phase, called the visibility phase, spans steps 1–3.1 The ARBITER asks all apps for current finish-time fair

metric estimates. 2 The ARBITER initiates auctions, andmakes the same non-binding resource-offer of the availableresources to to a fraction f ∈ [0,1] of ML apps with worstfinish-time fair metrics (according to round-by-round filter-ing described earlier). To minimize changes in the ML appscheduler to participate in auctions, THEMIS introduces anAGENT that is co-located with each ML app scheduler. TheAGENT serves as an intermediary between the ML app andthe ARBITER. 3 The apps examine the resource offer in par-allel. Each app’s AGENT then replies with a single bid thatcontains preferences for desired resource allocations.

The second phase, allocation phase, spans steps 4–5. 4The ARBITER, upon receiving all the bids for this round,picks winning bids according to previously described partialallocation algorithm and leftover allocation scheme. It thennotifies each AGENT of its winning allocation (if any). 5 TheAGENT propagates the allocation to the ML app scheduler,which can then decide the allocation among constituent jobs.

In sum, the two phase resource allocation means that our

scheduler enforces semi-optimistic concurrency control. Sim-ilar to fully optimistic concurrency control, there is multi-appvisibility as the cross-app scheduler offers resources to multi-ple apps concurrently. At the same time, similar to pessimisticconcurrency control, the resource allocations are conflict-freeguaranteeing exclusive access of a resource to every app.

To enable preparation of bids in step 3, THEMIS imple-ments a narrow API from the ML app scheduler to the AGENTthat enables propagation of app-specific information. AnAGENT’s bid contains a valuation function (ρ(.)) that pro-vides, for each resource subset, an estimate of the finish-timefair metric the app will achieve with the allocation of theresource subset. We describe how this is calculated next.

4.3 AGENT and AppScheduler InteractionAn AGENT co-resides with an app to aid participation inauctions. We now describe how AGENTs prepare bids basedon inputs provided by apps, the API between an AGENTand its app, and how AGENTs integrate with current hyper-parameter optimization schedulers.4.3.1 Single-Job ML AppsFor ease of explanation, we first start with the simple case ofan ML app that has just one ML training job which can useat most job_demandmax GPUs. We first look at calculationof the finish-time fair metric, ρ. We then look at a multi-jobapp example so as to better understand the various steps andinterfaces in our system involved in a multi-round auction.Calculating ρ(

−→G ). Equation 1 shows the steps for calculating

ρ for a single job given a GPU allocation of−→G in a cluster

C with RC GPUs. When calculating ρ we assume that theallocation

−→G is binding till job completion.

ρ(−→G ) = Tsh(

−→G )/Tid

Tsh = Tcurrent −Tstart+

iter_le f t ∗ iter_time(−→G )

Tid = Tcluster ∗Navg

iter_time(−→G ) =

iter_time_serial ∗S(−→G )

min(||−→G ||1, job_demandmax)

Tcluster =iter_total ∗ iter_serial_time

min(RC, job_demandmax)

(1)

Tsh is the shared finish-time and is a function of the allo-cation

−→G that the job receives. For the single job case, it has

two terms. First, is the time elapsed (= Tcurrent −Tstart ). Timeelapsed also captures any queuing delays or starvation time.Second, is the time to execute remaining iterations whichis the product of the number of iterations left (iter_le f t)and the iteration time (iter_time(

−→G )). iter_time(

−→G ) depends

on the allocation received. Here, we consider the common-case of the ML training job executing synchronous SGD.In synchronous SGD, the work in an iteration can be paral-lelized across multiple workers. Assuming linear speedup,this means that the iteration time is the serial iteration time

7

Page 8: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

(iter_time_serial) reduced by a factor of the number of GPUsin the allocation, ||−→G ||1 or job_demandmax whichever islesser. However, the linear speedup assumption is not truein the common case as network overheads are involved. Wecapture this via a slowdown penalty, S(−→G ), which dependson the placement of the GPUs in the allocation. Values forS(−→G ) can typically be obtained by profiling the job offline fora few iterations. The slowdown is captured as a multiplicativefactor, S(−→G )≥ 1, by which Tsh is increased.

Tid is the estimated finish-time in an independent 1Navg

clus-ter. Navg is the average contention in the cluster and is theweighed average of the number of other apps in the systemduring the lifetime of the app. We approximate this as thefinish-time of the app in the whole cluster, Tcluster dividedby the average contention. Tcluster assumes linear speedupwhen the app executes with all the cluster resources RC ormaximum app demand whichever is lesser. It also assumes noslowdown. Thus, it is approximated as iter_total∗iter_serial_time

min(RC , job_demandmax).

4.3.2 Generalizing to Mutiple-Job ML AppsML app schedulers for hyper-parameter optimization systemstypically go from aggressive exploration of hyper-parametersto aggressive exploitation of best hyper-parameters. Whilethere are a number of different algorithms for choosing thebest hyper-parameters [3, 22] to run, we focus on early stop-ping criteria as this affects the finish time of ML apps.

As described in prior work [9], automatic stopping algo-rithms can be divided into two categories: Successive Halvingand Performance Curve Stopping. We next discuss how tocompute Tsh for each case.Successive Halving refers to schemes which start with a totaltime or iteration budget B and apportion that budget by peri-odically stopping jobs that are not promising. For example,if we start with n hyper parameter options, then each one issubmitted as a job with a demand of 1 GPU for a fixed numberof iterations I. After I iterations, only the best n

2 ML trainingjobs are retained and assigned a maximum demand of 2 GPUsfor the same number of iterations I. This continues until weare left with 1 job with a maximum demand of n GPUs. Thusthere are a total of log2n phases in Successive Halving. Thisscheme is used in Hyperband [22] and Google Vizier [9].

We next describe how to compute the Tsh and Tid for succes-sive halving. We assume that the given allocation

−→G lasts till

app completion and the total time can be computed by addingup the time the app spends for each phase. Consider the caseof phase i which has J = n

2i−1 jobs. Equation 2 shows the calcu-lation of Tsh(i), the shared finish time of the phase. We assumea separation of concerns where the hyper-parameter optimizercan determine the optimal allocation of GPUs within a phaseand thus estimate the value of S(−→G j). Along with iter_le f t,serial_iter_time, the AGENT can now estimate Tsh( j) for eachjob in the phase. We mark the phase as finished when theslowest or last job in the app finishes the phase (max j). Then

the shared finish time for the app is the sum of the finish timesof all constituent phases.

To estimate the ideal finish-time we compute the total timeto execute the app on the full cluster. We estimate this usingthe budget B which represents the aggregate work to be doneand, as before, we assume linear speedup to the maximumnumber of GPUs the app can use app_demandmax.

Tsh(i) = max j{T (−→G j)}

Tsh = ∑i

Tsh(i)

Tcluster =B

min(RC,app_demandmax)

Tid = Tcluster ∗Navg

(2)

The AGENT generates ρ using the above procedure for allpossible subsets of {−→G} and produced a bid table similarto the one shown in Table 1 before. The API between theAGENT and hyper-parameter optimizer is shown in Figure 8and captures the functions that need to implemented by thehyper-parameter optimizer.Performance Curve Stopping refers to schemes where theconvergence curve of a job is extrapolated to determine whichjobs are more promising. This scheme is used by Hyper-drive [30] and Google Vizier [9]. Computing Tsh proceedsby calculating the finish time for each job that is currentlyrunning by estimating the iteration at which the job will beterminated (Thus Tsh is determined by the job that finisheslast). As before, we assume that the given allocation

−→G lasts

till app completion. Since the estimations are usually proba-bilistic (i.e. the iterations at which the job will converge hasan error bar), we over-estimate and use the most optimisticconvergence curve that results in the maximum forecastedcompletion time for that job. As the job progresses, the es-timates of the convergence curve get more accurate and im-proving the estimated finish time Tsh. The API implementedby the hyper-parameter optimizer is simpler and only involvesgetting a list of running jobs as shown in Figure 8.

We next present an end-to-end example of a multi-job appshowing our mechanism in action.

4.3.3 End-to-end Example.We now run through a simple example that exercises thevarious aspects of our API and the interfaces involved.

Consider a 16 GPU cluster and an ML app that has 4 MLjobs and uses successive halving, running along with 3 otherML apps in the same cluster. Each job in the app is tuning adifferent hyper-parameter and the serial time taken per itera-tion for the jobs are 80,100,100,120 seconds respectively.2

The total budget for the app is to 10,000 seconds of GPU timeand we assume the job_demandmax is 8 GPUs and S(−→G ) = 1.

2The time per iteration depends on the nature of the hyper-parameterbeing tuned. Some hyper-parameters like batch size or quantization usedaffect the iteration time while others like learning rate don’t.

8

Page 9: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

class JobInfo(int itersRemaining,float avgTimePerIter,float localitySensitivity);

// Successive HalvingList<JobInfo> getJobsInPhase(int phase,

List<Int> gpuAlloc);int getNumPhases();// Performance CurveList<JobInfo> getJobsRemaining(List<Int> gpuAlloc);

Figure 8: API between AGENT and hyperparameter optimizer

||−→G ||1 0 1 2 4 8 16ρ ρold 4 2 1 0.5 0.34

Table 2: Example of bids submitted by AGENT

Given we start with 4 ML jobs, the hyper-parameter opti-mizer divides this into 3 phases each having 4,2,1 jobs, resply.To evenly divide the budget across the phases, the hyper-parameter optimizer budgets ≈ 8,16,36 iterations in eachphase. First we calculate the Tid by considering the budget,total cluster size, and cluster contention as: 10000×4

16 = 2500s.Next we consider the computation of Tsh assuming that 16

GPUs are offered by the ARBITER. The AGENT now computesthe bid for each subset of GPUs offered. Consider the casewith 2 GPUs. In this case in the first phase we have 4 jobswhich are serialized to run 2 at a time. This leads to Tsh(1) =(120×8)+(80×8) = 1600 seconds. (Assume two 100s jobsrun serially on one GPU, and the 80 and 120s jobs run seriallyon the other. Tsh is the time when the last job finishes.)

When we consider the next stage the hyper-parameter opti-mizer currently does not know which jobs will be chosen fortermination. We use the median job (in terms of per-iterationtime) to estimate Tsh(i) for future phases. Thus, in the sec-ond phase we have 2 jobs so we run one job on each GPUeach of which we assume to take the median 100 secondsper iteration leading to Tsh(2) = (100×16) = 1600 seconds.Finally for the last phase we have 1 job that uses 2 GPUsand runs for 36 iterations leading to Tsh(3) =

(100×36)2 = 1800

(again, the “median” jobs takes 100s per iteration). Thus Tsh =1600+1600+1800 = 5000 seconds, making ρ = 5000

2500 = 2.Note that since placement did not matter here we consideredany 2 GPUs being used. Similarly ignoring placement, thebids for the other allocations are shown in Table 2.

We highlight a few more points about our example above.If the jobs that are chosen for the next phase do not matchthe median iteration time then the estimates are revised in thenext round of the auction. For example if the jobs that arechosen for the next round have iteration time 120,100 then theabove bid will be updated with Tsh(2) = (120×16) = 32003

and Tsh(3) =(120×36)

2 = 2160. Further we also see that thejob_demandmax = 8 means that the ρ value for 16 GPUsdoes not linearly decrease from that of 8 GPUs.

3Because the two jobs run on one GPU each, and the 120s-per-iterationjob is the last to finish in the phase

5 ImplementationWe implement THEMIS on top of a recent release of ApacheHadoop YARN [1] (version 3.2.0) which includes, Subma-rine [2], a new framework for running ML training jobs atopYARN. We modify the Submarine client to support submittinga group of ML training jobs as required by hyper-parameterexploration apps. Once an app is submitted, it is managed bya Submarine Application Master (AM) and we make changesto the Submarine AM to implement the ML app scheduler(we use Hyperband [22]) and our AGENT.

To prepare accurate bids, we implement a profiler in theAM that parses TensorFlow logs, and tracks iteration timesand loss values for all the jobs in an app. The allocation ofa job changes over time and iteration times are used to accu-rately estimate the placement sensitivity (S ) from differentGPU placements. Loss values are used in our Hyperbandimplementation to determine early stopping. THEMIS’s AR-BITER is implemented as a separate module in YARN RM.We add gRPC-based interfaces between the AGENT and theARBITER to enable offers, bids, and final winning allocations.Further, the ARBITER tracks GPU leases to offer reclaimedGPUs as a part of the offers.

All the jobs we use in our evaluation are TensorFlow pro-grams with configurable hyper-parameters. To handle allo-cation changes at runtime, the programs checkpoint modelparameters to HDFS every few iterations. After a change inallocation, they resume from the most recent checkpoint.

6 EvaluationWe evaluate THEMIS on a 64 GPU cluster and also use aevent-driven simulator to model a larger 256 GPU cluster. Wecompare against other state-of-the-art ML schedulers. Ourevaluation shows the following key highlights -• THEMIS is better than other schemes on finish-time fair-

ness while also offering better cluster efficiency (Figure 9- 10).These experiments are done in our testbed.

• THEMIS’s benefits compared to other schemes improvewith increasing fraction of placement sensitive apps and in-creasing contention in the cluster, and these improvementshold even with errors – random and strategic – in finish-timefair metric estimations (Figure 12 - 15).

• THEMIS enables a trade-off between finish-time fairnessin the long-term and placement efficiency in the short-term.Sensitivity analysis (Figure 16) using simulations show thatf = 0.8 and a lease time of 10 minutes gives maximum fair-ness while also utilizing the cluster efficiently.

6.1 Experimental SetupTestbed Setup. Our testbed is a 64 GPU, 20 machine clusteron Microsoft Azure [24]. We use NC-series instances. Wehave 8 NC12-series instances each with 2 Tesla K80 GPUsand 12 NC24-series instances each with 4 Tesla K80 GPUs.Simulator. We develop an event-based simulator to evaluate

9

Page 10: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

THEMIS at large scale. The simulator assumes that estimatesof the loss function curves for jobs are known ahead of time soas to predict the total number of iterations for the job. Unlessstated otherwise, all simulations are done on a heterogeneous256 GPU cluster. Our simulator assumes a 4-level hierarchicallocality model for GPU placements. Individual GPUs fit ontoslots on machines occupying different cluster racks. 4

Workload. We experiment with 2 different traces that havedifferent workload characteristics in both the simulator andthe testbed - (i) Workload 1. A publicly available trace ofDNN training workloads at Microsoft [20,25]. We scale-downthe trace, using a two week snapshot and focusing on subsetof jobs from the trace that correspond to hyper-parameterexploration jobs triggered by Hyperdrive. (ii) Workload 2.We use the app arrival times from Workload 1, generate jobsper app using the successive halving pattern characteristic ofthe Hyperband algorithm [22], and increase the number oftasks per job compared to Workload 1. The distribution ofnumber of tasks per job and number of jobs per app for thetwo workloads is shown in the Appendix A (Figure 17).Baselines. We compare THEMIS against four state-of-the-artML schedulers - Gandiva [40], Tiresias [13], Optimus [28],and SLAQ [43]; these represent the best possible baselines formaximizing efficiency, maximizing fairness, minimizing av-erage app completion time, and maximizing aggregate modelquality, respectively. We implement these baselines in ourtestbed as well as the simulator as described below:Ideal Efficiency Baseline - Gandiva. Gandiva improvescluster utilization by packing jobs on as few machines as pos-sible. In our implementation, Gandiva introspectively profilesML job execution to infer placement preferences and migratesjobs to better meet these placement preferences. On any re-source availability, all apps report their placement preferencesand we allocate resources in a greedy highest preference firstmanner which has the effect of maximizing the average place-ment preference across apps. We do not model time-slicingand packing of GPUs as these system-level techniques can beintegrated with THEMIS as well and would benefit Gandivaand THEMIS to equal extents.Ideal Fairness Baseline - Tiresias. Tiresias defines a newservice metric for ML jobs – the aggregate GPU-time allo-cated to each job – and allocates resources using the LeastAttained Service (LAS) policy so that all jobs obtain equalservice over time. In our implementation, on any resourceavailability, all apps report their service metric and we allo-cate the resource to apps that have the least GPU service. Wedo not implement the Gittin’s index policy from Tiresias as itenables a trade-off between service-based fairness and aver-age app completion time. Instead, we use a different baselinefor average app completion time.Ideal Average App Completion Time Baseline - Optimus.

4The heterogeneous cluster consists of 16 8-GPU machines 4 slots and2 GPUs per slot), 6 4-GPU machines (4 slots and 1 GPU per slot), and 161-GPU machines

0 1 2 3 4Finish Time Fairness

0.00.20.40.60.81.0

Frac

tion

of A

pps

ThemisSLAQGandivaTiresiasOptimus

(a) Workload 1

0 1 2 3 4Finish Time Fairness

0.00.20.40.60.81.0

Frac

tion

of A

pps

ThemisSLAQGandivaTiresiasOptimus

(b) Workload 2Figure 9: [TESTBED] Comparison of finish-time fairness across schedulers

Optimus proposes the Shortest Remaining Time First (SRTF)policy for ML jobs. In our implementation, on any resourceavailability, all apps report their remaining time with allo-cations from the available resources and we allocate theseresources using the SRTF policy.Ideal Aggregate Model Quality - SLAQ. SLAQ proposes agreedy scheme for improving aggregate model quality acrossall jobs. In our implementation, on any resource availabilityevent, all apps report the decrease in loss value with allo-cations from the available resources and we allocate theseresources in a greedy highest loss first manner.Metrics. We use a variety of metrics to evaluate THEMIS.

(i) Finish-time fairness: We evaluate the fairness ofschemes by looking at the finish-time fair metric (ρ) distribu-tion and the maximum value across apps. A tighter distribu-tion and a lower value of maximum value of ρ across appsindicate higher fairness. (ii) GPU Time: We use GPU Timeas a measure of how efficiently the cluster is utilized. For twoscheduling schemes S1 and S2 that have GPU times G1 andG2 for executing the same amount of work, S1 utilizes thecluster more efficiently than S2 if G1 < G2. (iii) PlacementScore: We give each allocation a placement score (≤ 1). Thisis inversely proportional to slowdown, S , that app experiencesdue to this allocation. The slowdown is dependent on the MLapp properties and the network interconnects between theallocated GPUs. A placement score of 1.0 is desirable for asmany apps as possible.

6.2 MacrobenchmarksIn our testbed, we evaluate THEMIS against all baselines onall the workloads. We set the fairness knob value f as 0.8and lease as 10 minutes, which is informed by our sensitivityanalysis results in Section 6.4.

Figure 9 shows the distribution of finish-time fairness met-ric, ρ, across apps for THEMIS and all the baselines. We seethat THEMIS has a narrower distribution for the ρ valueswhich means that THEMIS comes closest to giving all jobsan equal sharing incentive. Also, THEMIS has the maximumρ value 2.2X , 2.25X , 1.75X , and 3.25X better than Gandiva,Tiresias, Optimus, and SLAQ, respectively. Figure 10 showsa comparison of the efficiency in terms of the aggregate GPUtime to execute the complete workload. Workload 1 has simi-lar efficiency across THEMIS and the baselines as all jobs areeither 1 or 2 GPU jobs and almost all allocations, irrespectiveof the scheme, end up as efficient. With workload 2, THEMIS

10

Page 11: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

Job Type GPU Time # GPUs ρTHEMIS ρTiresiasLong Job ~580 mins 4 ~1 ~0.9Short Job ~83 mins 2 ~1.2 ~1.9

Table 3: [TESTBED] Details of 2 jobs to understand the benefits of THEMIS

Themis Gandiva SLAQ Tiresias Optimus0

200

400

600

800

1000

1200

GP

U T

ime

(hou

rs)

(a) Workload 1

Themis Gandiva SLAQ Tiresias Optimus0

200

400

600

800

1000

1200

1400

GP

U T

ime

(hou

rs)

(b) Workload 2Figure 10: [TESTBED] Comparison of total GPU times across schemes.Lower GPU time indicates better utilization of the GPU cluster

0.5 0.6 0.7 0.8 0.9 1.0Placement Score

0.0

0.2

0.4

0.6

0.8

1.0

Fra

ctio

n of

allo

catio

ns

ThemisGandivaSLAQTiresiasOptimus

Figure 11: TESTBED] CDF of place-ment scores across schemes

1X 2X 4XFactor of contention

0.0

2.5

5.0

7.5

10.0

Max

Fai

rnes

s

ThemisGandivaSLAQ

TiresiasOptimus

Figure 12: [TESTBED] Impact ofcontention on finish-time fairness

betters Gandiva by ~4.8% and outperforms SLAQ by ~250%.This improvement is because THEMIS leads to global optimal-ity in placement-driven packing due to the auction abstraction.Gandiva in contrast takes greedy locally optimal decisions.

6.2.1 Sources of ImprovementIn this section, we deep-dive into the reasons behind the winsin fairness and cluster efficiency in THEMIS.

Table 3 compares the finish-time fair metric value for apair of short- and long-lived apps from our testbed run forTHEMIS and Tiresias. THEMIS offers better sharing incentivefor both the short and long apps. THEMIS induces altruisticbehavior in long apps. We attribute this to our choice of ρ

metric. With less than ideal allocations, even though longapps see an increase in Tsh, their ρ values do not increasedrastically because of a higher Tid value in the denominator.Whereas, shorter apps see a much more drastic degradation,and our round-by-round filtering of farthest-from-finish-timefairness apps causes shorter apps to participate in auctionsmore often. Tiresias offers poor sharing incentive for shortapps as it treats short- and long-apps as the same. This onlyworsens the sharing incentive for short apps.

Figure 11 shows the distribution of placement scores forall the schedulers. THEMIS gives the best placement scores(closer to 1.0 is better) in workload 2, with Gandiva comingclosest. Workload 1 has jobs with very low GPU demand andalmost all allocations have a placement score of 1 irrespec-tive of the scheme. Tiresias, SLAQ, and Optimus are poor asthey are agnostic of placement preferences. Gandiva performsbetter but is not as efficient because of greedy local packing.

0 20 40 60 80 100% Network Intensive Apps

0

2

4

6

8

10

Max

Fai

rnes

s

ThemisGandivaSLAQ

TiresiasOptimus

(a) Impact on Max ρ

0 20 40 60 80 100% Network Intensive Apps

0200400600800

1000120014001600

GP

U T

ime

(hou

rs)

ThemisGandivaSLAQ

TiresiasOptimus

(b) Impact on GPU TimeFigure 13: [SIMULATOR] Impact of placement sensitivity for varyingcompute-network apps distributions

6.2.2 Effect of ContentionIn this section, we analyze the effect of contention on finish-time fairness. We decrease the size of the cluster to half andquarter the original size to induce a contention of 2X and 4Xrespectively. Figure 12 shows the change in max value of ρ

as the contention changes with workload 1. THEMIS is theonly scheme that maintains sharing incentive even in highcontention scenarios. Optimus comes close as it preferablyallocates resources to shorter apps. This behavior is similarto that in THEMIS. THEMIS induces altruistic shedding ofresources by longer apps (Section 6.2.1), giving shorter appsa preference in allocations during higher contention.

6.2.3 Systems OverheadsFrom our profiling of the experiments above, we find thateach AGENT spends 29 (334) milliseconds to compute bids atthe median (95-%). The 95 percentile is high because enumer-ation of possible bids needs to traverse a larger search spacewhen the number of resources up for auction is high.

The ARBITER uses Gurobi [14] to compute partial allo-cation of resources to apps based on bids. This computationtakes 354 (1398) milliseconds at the median (95-%ile). Thehigh tail is once again observed when both the number ofoffered resources and the number of apps bidding are high.However, the time is small relative to lease time. The net-work overhead for communication between the ARBITERand individual apps is negligible since we use the existingmechanisms used by Apache YARN.

Upon receiving new resource allocations, the AGENTchanges (adds/removes) the number of GPU containers avail-able to its app. This change takes about 35 (50) seconds atthe median (95-%ile), i.e., an overhead of 0.2% (2%) of theapp duration at the median (95-%ile). Prior to relinquishingcontrol over its resources, each application must checkpointits set of parameters. We find that that this is model dependentbut takes about 5-10 seconds on an average and is drivenlargely by the overhead of check-pointing to HDFS.

6.3 MicrobenchmarksPlacement Sensitivity: We analyze the impact on finish-timefairness and cluster efficiency as the fraction of network-intensive apps in our workload increases. We syntheticallyconstruct 6 workloads and vary the percentage of network-intensive apps in these workloads across from 0%-100%.

From Figure 13(a), we notice that sharing incentive de-

11

Page 12: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

0% 5% 10% 20%% error in bid valuations

0.00.20.40.60.81.01.21.4

Max

. Fai

rnes

s

Figure 14: [SIMULATOR] Impact oferror in bid values on max fairness

0 20 40 60 80 100% lying

0

50

100

150

% im

pact

% dec. in Avg. ACT - honest apps% inc. in Avg. ACT - lying app

Figure 15: [SIMULATOR] Strategiclying is detrimental

grades most when there is a heterogeneous mix of computeand network intensive apps (at 40% and 60%). THEMIS has amax ρ value closest to 1 across all these scenarios and is theonly scheme to ensure this sharing incentive. When the work-load consists solely of network-intensive apps, THEMIS per-forms ~1.24X , 2.04X , 1.72X and 1.77X better than Gandiva,SLAQ, Tiresias, and Optimus respectively on max fairness.

Figure 13(b) captures the impact on cluster efficiency. Withonly compute-intensive apps, all scheduling schemes utilizethe cluster equally efficiently. As the percentage of networkintensive apps increases, THEMIS has lower GPU times to exe-cute the same workload. This means that THEMIS utilizes thecluster more efficiently than other schemes. In the workloadwith 100% network-intensive apps, THEMIS performs ~8.1%better than Gandiva (state-of-the-art for cluster efficiency).Error Analysis: Here, we evaluate the ability of THEMIS tohandle errors in estimation of number of iterations and theslowdown (S ). For this experiment, we assume that all appsare equally susceptible to making errors in estimation. Thepercentage error is sampled at random from [-X , X] range foreach app. Figure 14 shows the changes in max finish-timefairness as we vary X . Even with X = 20%, the change inmax finish-time fairness is just 10.76% and is not significant.Truth-Telling: To evaluate strategy-proofness, we use sim-ulations. We use a cluster of 64 GPUs with 8 identical appswith equivalent placement preferences. The cluster has a sin-gle 8 GPU machine and the others are all 2 GPU machines.The most preferred allocation in this cluster is the 8 GPU ma-chine. We assume that there is a single strategically lying appand 7 truthful apps. In every round of auction it participatesin, the lying app over-reports the slowdown with staggeredmachine placement or under-reports the slowdown with densemachine placement by X%. Such a strategy would ensurehigher likelihood of winning the 8 GPU machine. We vary thevalue of X in the range [0,100] and analyze the lying app’scompletion time and the average app completion time of thetruthful apps in Figure 15. We see that at first the lying appdoes not experience any decrease in its own app completiontime. On the other hand, we see that the truthful apps do betteron their average app completion time. This is because the hid-den payment from the partial allocation mechanism in eachround of the auction for the lying app remains the same whilethe payment from the rest of the apps keeps decreasing. Wealso observe that there is a sudden tipping point at X > 34%.At this point, there is a sudden increase in the hidden paymentfor the lying app and it loses a big chunk of resources to other

0.0 0.2 0.4 0.6 0.8 1.0Fairness Knob

2

4

6

8

10

Max

Fai

rnes

s Lease 5Lease 10Lease 20Lease 40

(a) Impact on Max Fairness

0.00 0.25 0.50 0.75 1.00Fairness Knob

2200

2400

2600

GPU

Tim

e (h

ours

)

Lease 5Lease 10Lease 20Lease 40

(b) Impact on GPU TimeFigure 16: [SIMULATOR] Sensitivity of fairness knob and lease time.

apps. In essence, THEMIS incentivizes truth-telling.

6.4 Sensitivity AnalysisWe use simulations to study THEMIS’s sensitivity to fairnessknob f and the lease time. Figure 16 (a) shows the impacton max ρ as we vary the fairness knob f . We observe thatfiltering (1− f ) fraction of apps helps with ensuring bettersharing incentive. As f increases from 0 to 0.8, we observethat fairness improves. Beyond f = 0.8, max fairness wors-ens by around a factor of 1.5X . We see that the quality ofsharing incentive, captured by max ρ, degrades at f = 1 be-cause we observe that only a single app with highest ρ valueparticipates in the auction. This app is forced sub-optimalallocations because of poor placement of available resourceswith respect to the already allocated resources in this app. Wealso observe that smaller lease times promote better fairnesssince frequently filtering apps reduces the time that queuedapps wait for an allocation.

Figure 16 (b) shows the impact on the efficiency of clusterusage as we vary the fairness knob f . We observe that the ef-ficiency decreases as the value of f increases. This is becausethe number of apps that can bid for an offer reduces as weincrease f leading to fewer opportunities for the ARBITER topack jobs efficiently. Lower lease values mean than modelsneed to be check-pointed more often (GPUs are released onlease expiry) and hence higher lease values are more efficient.

Thus we choose f = 0.8 and lease = 10 minutes.

7 Related WorkCluster scheduling for ML workloads has been targeted by anumber of recent works including SLAQ [43], Gandiva [40],Tiresias [13] and Optimus [28]. These systems target differentobjectives and we compare against them in Section 6.

We build on rich literature on cluster scheduling disci-plines [8, 10–12] and two level schedulers [17, 31, 37]. Whilethose disciplines/schedulers don’t apply to our problem, webuild upon some of their ideas, e.g., resource offers in [17].Sharing incentive was outlined by DRF [8], but we focus onlong term fairness with our finish-time metric. Tetris [10]proposes resource-aware packing with an option to trade-off for fairness using multi-dimensional bin-packing as themechanism for achieving that. In THEMIS, we instead focuson fairness with an option to trade-off for placement-awarepacking, and use auctions as our mechanism.

Some earlier schemes [11,12] also attempted to emulate thelong term effects of fair allocation. Around occasional barri-

12

Page 13: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

ers, unused resources are re-allocated across jobs. THEMIS dif-fers in many respects: First, earlier systems focus on batch ana-lytics. Second, earlier schemes rely on instantaneous resource-fairness (akin to DRF), which has issues with placement-insensitivity and not accounting for long tasks. Third, in theML context there are no occasional barriers. While barriersdo arise due to synchronization of parameters in ML jobs,they happen at every iteration. Also, resources unilaterallygiven up by a job may not be usable by another job due toplacement sensitivity.

8 ConclusionIn this paper we presented THEMIS, a fair scheduling frame-work for ML training workloads. We showed how existingfair allocation schemes are insufficient to handle long-runningtasks and placement preferences of ML workloads. To addressthese challenges we proposed a new long term fairness ob-jective in finish-time fairness. We then presented a two-levelsemi-optimistic scheduling architecture where ML apps canbid on resources offered in an auction. Our experiments showthat THEMIS can improve fairness and efficiency comparedto state of the art schedulers.

References[1] Apache Hadoop NextGen MapReduce (YARN).

Retrieved 9/24/2013, URL: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, 2013.

[2] Apache Hadoop Submarine. https://hadoop.apache.org/submarine/, 2019.

[3] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D.Cox. Hyperopt: a python library for model selection andhyperparameter optimization. Computational Science& Discovery, 8(1), 2015.

[4] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz.Revisiting distributed synchronous sgd. arXiv preprintarXiv:1604.00981, 2016.

[5] R. Cole, V. Gkatzelis, and G. Goel. Mechanism de-sign for fair division: allocating divisible items withoutpayments. In Proceedings of the fourteenth ACM con-ference on Electronic commerce, pages 251–268. ACM,2013.

[6] Common Voice Dataset. https://voice.mozilla.org/.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical im-age database. In 2009 IEEE conference on computervision and pattern recognition, pages 248–255. Ieee,2009.

[8] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,S. Shenker, and I. Stoica. Dominant resource fairness:Fair allocation of multiple resource types. In NSDI,2011.

[9] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro,and D. Sculley. Google vizier: A service for black-boxoptimization. In KDD, 2017.

[10] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao,and A. Akella. Multi-resource packing for cluster sched-ulers. ACM SIGCOMM Computer Communication Re-view, 44(4):455–466, 2015.

[11] R. Grandl, M. Chowdhury, A. Akella, and G. Anantha-narayanan. Altruistic scheduling in multi-resource clus-ters. In 12th {USENIX} Symposium on Operating Sys-tems Design and Implementation ({OSDI} 16), pages65–80, 2016.

[12] R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulka-rni. GRAPHENE: Packing and Dependency-AwareScheduling for Data-Parallel Clusters. In 12th USENIXSymposium on Operating Systems Design and Imple-mentation (OSDI 16), pages 81–97, 2016.

[13] J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon,J. Qian, H. Liu, and C. Guo. Tiresias: A {GPU} clus-ter manager for distributed deep learning. In 16th{USENIX} Symposium on Networked Systems Designand Implementation ({NSDI} 19), pages 485–500, 2019.

[14] Gurobi Optimization. http://www.gurobi.com/.

[15] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di-amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,A. Coates, et al. Deep speech: Scaling up end-to-endspeech recognition. arXiv preprint arXiv:1412.5567,2014.

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pages 770–778, 2016.

[17] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica.Mesos: A platform for fine-grained resource sharing inthe data center. In NSDI, 2011.

[18] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Tal-war, and A. Goldberg. Quincy: fair scheduling for dis-tributed computing clusters. In Proceedings of the ACMSIGOPS 22nd symposium on Operating systems princi-ples, pages 261–276. ACM, 2009.

13

Page 14: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

[19] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian,W. Xiao, and F. Yang. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. InUSENIX ATC, 2019.

[20] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian,W. Xiao, and F. Yang. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. InUSENIX ATC, 2019.

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks.In Advances in neural information processing systems,pages 1097–1105, 2012.

[22] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, andA. Talwalkar. Hyperband: A novel bandit-based ap-proach to hyperparameter optimization. arXiv preprintarXiv:1603.06560, 2016.

[23] M. Marcus, B. Santorini, and M. A. Marcinkiewicz.Building a large annotated corpus of english: The penntreebank. 1993.

[24] Microsoft Azure. https://azure.microsoft.com/en-us/.

[25] Microsoft Philly Trace. https://github.com/msr-fiddle/philly-traces, 2019.

[26] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu. Wavenet: A generative model for rawaudio. arXiv preprint arXiv:1609.03499, 2016.

[27] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica.Sparrow: distributed, low latency scheduling. In SOSP,pages 69–84, 2013.

[28] Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. Optimus:an efficient dynamic resource scheduler for deep learn-ing clusters. In Proceedings of the Thirteenth EuroSysConference, page 3. ACM, 2018.

[29] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad:100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250, 2016.

[30] J. Rasley, Y. He, F. Yan, O. Ruwase, and R. Fonseca. Hy-perdrive: Exploring hyperparameters with pop schedul-ing. In Proceedings of the 18th ACM/IFIP/USENIXMiddleware Conference, pages 1–13. ACM, 2017.

[31] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, andJ. Wilkes. Omega: flexible, scalable schedulers for largecompute clusters. In Eurosys, 2013.

[32] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi.Bidirectional attention flow for machine comprehension.arXiv preprint arXiv:1611.01603, 2016.

[33] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-jna. Rethinking the inception architecture for computervision. In Proceedings of the IEEE conference on com-puter vision and pattern recognition, pages 2818–2826,2016.

[35] H. R. Varian. Equity, envy, and efficiency. Journal ofEconomic Theory, 9(1):63 – 91, 1974.

[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.Attention is all you need. In Advances in neural infor-mation processing systems, pages 5998–6008, 2017.

[37] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer,E. Tune, and J. Wilkes. Large-scale cluster managementat google with borg. In Eurosys, 2015.

[38] WMT16 Dataset. http://www.statmt.org/wmt16/.

[39] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,et al. Google’s neural machine translation system: Bridg-ing the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016.

[40] W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu,N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang,et al. Gandiva: Introspective cluster scheduling for deeplearning. In 13th {USENIX} Symposium on Operat-ing Systems Design and Implementation ({OSDI} 18),pages 595–610, 2018.

[41] J. Yamagishi. English multi-speaker corpus for cstrvoice cloning toolkit. URL http://homepages. inf. ed. ac.uk/jyamagis/page3/page58/page58. html, 2012.

[42] W. Zaremba, I. Sutskever, and O. Vinyals. Recur-rent neural network regularization. arXiv preprintarXiv:1409.2329, 2014.

[43] H. Zhang, L. Stafman, A. Or, and M. J. Freedman. Slaq:quality-driven scheduling for distributed machine learn-ing. In Proceedings of the 2017 Symposium on CloudComputing, pages 390–404. ACM, 2017.

14

Page 15: THEMIS: Fair and Efficient GPU Cluster SchedulingTHEMIS: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan University of Wisconsin - Madison Arjun Balasubramanian University

A AppendixPROOF OF THEOREM 3.1. Examples in Figure 6 and Sec-tion 3.1.2 shows that DRF violates SI, EF, and PE. Sameexamples hold true for LAS policy in Tiresias. The servicemetric i.e. the GPU in Instance 1 and Instance 2 is the samefor A1 and A2 in terms of LAS and is deemed a fair allocationover time. However, Instance 1 violates SI as A1 (VGG16)and A2 (VGG19) would prefer there own independent GPUsand Instance 2 violates EF and PE as A2 (VGG19) prefersthe allocation of A1 (Inceptionv3) and PE as the optimal allo-cation after taking into account placement preferences wouldinterchange the allocation of A1 and A2.

PROOF OF THEOREM 3.2. We first show that the valuationfunction, ρ(.), for the case of ML jobs is homogeneous. Thismeans that ρ(.) has the following property: ρ(m∗−→G ) = m∗ρ−→G .Consider a job with GPUs spread across a set of some

M machines. If we keep this set of machines the same, andincrease the number of GPUs allocated on these same set ofmachines by a certain factor then the shared running time (Tsh)of this job decreases proportionally by the same factor. This isso because the slowdown, S remains the same. Slowdown isdetermined by the slowest network interconnect between themachines. The increased allocation does not change the setof machines M. The independent running time (Tid) remainsthe same. This means that ρ also proportionally changes bythe same factor.

Given, homogeneous valuation functions, the PA mecha-nism guarantees SP, PE and EF [5]. However, PA violates SIdue to the presence of hidden payments. This also make PAnot work-conserving.

PROOF OF THEOREM 3.3. With multi-round auctions weensure truth-telling of ρ estimates in the visibility phase. Thisis done by the AGENTby using the cached ρ(.) estimates fromthe last auction the app participated in. In case an app getsleftover allocations from the leftover allocation mechanism,the AGENTupdates the ρ estimate again by using the cachedρ(.) table. In this way we guarantee SP with multi-roundauctions.

As we saw in Theorem 3.2, an auction ensures PE andEF. In each round, we allocate all available resources usingauctions. This ensures end-to-end PE and EF.

For maximizing sharing incentive, we always take a frac-tion 1− f of apps in each round. A wise choice of f ensuresthat we filter in all the apps with ρ > 1 that have poor sharingincentive. We only auction the resources to such apps whichmaximizes sharing incentive.

A.1 Workload DetailsWe experiment with 2 different traces that have different work-load characteristics in both the simulator and the testbed - (i)Workload 1. A publicly available trace of DNN training

1.0 1.5 2.0 2.5 3.0 3.5 4.0Number of jobs

0.00.20.40.60.81.0

Frac

tion

of a

pps

Workload 1Workload 2

(a) CDF GPUs per job

0 20 40 60 80Number of jobs

0.00.20.40.60.81.0

Frac

tion

of a

pps

Workload 1Workload 2

(b) CDF jobs per appFigure 17: Details of 2 workloads used for evaluation of THEMIS

Model Type Dataset

10%

Inceptionv3 [34] CV ImageNet [7]Alexnet [21] CV ImageNet

Resnet50 [16] CV ImageNetVGG16 [33] CV ImageNetVGG19 [33] CV ImageNet

60%

Bi-Att-Flow [32] NLP SQuAD [29]LanguageModel [42] NLP PTB [23]

GNMT [39] NLP WMT16 [38]Transformer [36] NLP WMT16

30% Wavenet [26] Speech VCTK [41]DeepSpeech [15] Speech CommonVoice [6]

Table 4: Models used in our trace.

workloads at Microsoft [20, 25]. We scale-down the trace, us-ing a two week snapshot and focusing on subset of jobs fromthe trace that correspond to hyper-parameter exploration jobstriggered by Hyperdrive. (ii) Workload 2. We use the apparrival times from Workload 1, generate jobs per app usingthe successive halving pattern characteristic of the Hyperbandalgorithm [22]. The distribution of number of tasks per joband number of jobs per app for the two workloads is shownin Figure 17.

The traces comprise of models from three categories - com-puter vision (CV - 10%), natural language processing (NLP- 60%) and speech (Speech - 30%). We use the same mix ofmodels for each category as outlined in Gandiva [40]. Wesummarize the models in Table 4.

15