graphene: packing and dependency-aware scheduling for ...graphene: packing and dependency-aware...

19
This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). November 2–4, 2016 • Savannah, GA, USA ISBN 978-1-931971-33-1 Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison; Srikanth Kandula and Sriram Rao, Microsoft; Aditya Akella, Microsoft and University of Wisconsin—Madison; Janardhan Kulkarni, Microsoft https://www.usenix.org/conference/osdi16/technical-sessions/presentation/grandl_graphene

Upload: others

Post on 19-Sep-2020

8 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).November 2–4, 2016 • Savannah, GA, USA

ISBN 978-1-931971-33-1

Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters

Robert Grandl, Microsoft and University of Wisconsin—Madison; Srikanth Kandula and Sriram Rao, Microsoft; Aditya Akella, Microsoft and University of Wisconsin—Madison;

Janardhan Kulkarni, Microsoft

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/grandl_graphene

Page 2: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

Graphene: Packing and Dependency-aware SchedulingforData-Parallel Clusters

Robert Grandl†+, Srikanth Kandula†, Sriram Rao†, Aditya Akella†+, Janardhan Kulkarni†Microso† and University ofWisconsin-Madison+

Abstract– We present anew cluster scheduler,Graphene,aimed at jobs that have a complex dependency structureand heterogeneous resource demands. Relaxing either ofthese challenges, i.e., scheduling a DAG of homogeneoustasks or an independent set of heterogeneous tasks, leadsto NP-hard problems. Reasonable heuristics exist forthese simpler problems, but they perform poorly whenscheduling heterogeneous DAGs. Our key insights are:(1) focus on the long-running tasks and thosewith tough-to-pack resource demands, (2) compute a DAG sched-ule, oøine, by ûrst scheduling such troublesome tasksand then scheduling the remaining tasks without violat-ing dependencies. _ese oøine schedules are distilled toa simple precedence order and are enforced by an onlinecomponent that scales to many jobs. _e online compo-nent also uses heuristics to compactly pack tasks and totrade-oò fairness for faster job completion. Evaluation ona 200-server cluster and using traces of productionDAGsat Microso, shows that Graphene improves median jobcompletion time by 25% and cluster throughput by 30%.

1 IntroductionHeterogeneous DAGs are increasingly common in data-parallel clusters. We useDAG to refer to a directed acyclicgraph where each vertex represents a task and edges en-code input-output dependencies. Programming modelssuch as Dryad, SparkSQL and Tez compile user scriptsinto job DAGs [2, 19, 24, 43, 57, 67]. Our study of a largeproduction cluster inMicroso shows that jobs have largeand complex DAGs; themedian DAG has a depth of ûveand thousands of tasks. Furthermore, there is a substan-tial variation in task durations (sub-second to hundredsof seconds) and the resource usage of tasks (e.g., compute,memory, network and disk bandwidth). In this paper, weconsider the problem of scheduling such heterogeneousDAGs eõciently.

Given job DAGs and a cluster of machines, a clusterscheduler matches tasks to machines online. _is match-ing has tight timing requirements due to the scale ofmod-ern clusters. Consequently, schedulers use simple heuris-tics. _e heuristics leave gains on the table because theyignore crucial aspects of the problem. For example, crit-

ical path-based schedulers [36] only consider the criticalpath as determined by predicted task runtime and sched-ule tasks in the order of their critical path length. WhenDAGs have many parallel chains, running tasks that usediòerent resources together can lead to a better schedulebecause it allows more tasks to run at the same time. Asanother example,multi-resource packers [37] aim to runthe maximal number of pending tasks that ût within theavailable resources. WhenDAGs are deep, locally optimalchoices do not always result in the fastest completion timeof the whole DAG. Hence, intuitively, considering bothvariation in resource demands and dependencies may re-sult in better schedules for heterogeneous DAGs.By comparing the completion times of jobs in the pro-

duction cluster with those achieved by an oracle, we esti-mate that themedian job can be sped up by up to 50%. Weobserve that individual DAGs have fewer tasks runningrelative to the optimal schedule at some point in their life-time. _e cluster has lower overall utilization because (a)resources are idle evenwhen tasks are pending due to de-pendencies or resource fragmentation, and (b) fewer jobsare released because users wait for the output of previousjobs. Given the large investment in such clusters, even amodest increase in utilization and job latency can havebusiness impact [1, 10, 61].We note that the optimal schedule for heterogeneous

DAGs is intractable [54, 55]. Prior algorithmicwork existsespecially on simpler versions of the problem [18, 20, 21,35, 50, 60, 65]. However, we are yet to ûnd one that holdsin the practical setting of a data-parallel cluster. Speciû-cally, the solution has to work online, scale to large andcomplex DAGs as well as many concurrent jobs, copewith machine-level fragmentation as opposed to imagin-ing one cluster-wide resource pool, and handle multipleobjectives such as fairness, latency and throughput.

In this paper,we describe a cluster schedulerGraphenethat eõciently schedules heterogeneous DAGs. To iden-tify a good schedule for one DAG, we observe thatthe pathologically bad schedules in today’s approachesmostly arise due to two reasons: (a) long-running taskshave no other work to overlap with them, which reducesparallelism, and (b) the tasks that are runnable do not

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 81

Page 3: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

Figure 1: Steps taken by Graphene from a DAG on the leto its schedule on the right. Troublesome tasks T (in red) areplaced ûrst. _e remaining tasks (parents P, children C andsiblings S) are placed on top of T in a careful order to ensurecompactness and respect dependencies.

pack well with each other, which increases resource frag-mentation. Our approach is to identify the potentiallytroublesome tasks, such as those that run for a very longtime or are hard to pack, and place them ûrst onto a vir-tual resource-time space. _is space would have d + 1 di-mensionswhen tasks require d resources; the last dimen-sion being time. Our intuition is that placing the trou-blesome tasks ûrst leads to a good schedule since the re-maining tasks can be placed into resultant holes in thisspace.At job submission time, Graphene builds a preferred

schedule for a job as shown in Figure 1. Aer identifyinga subset of troublesome tasks, the remaining tasks are di-vided into the parent, child and sibling subsets. Grapheneûrst places the troublesome tasks onto a virtual resource-time space and then places the remaining subsets. Realiz-ing this idea has a few challenges. Which choice of trou-blesome tasks leads to the best schedule? Further, sincetroublesome tasks are placed ûrst, when a task is consid-ered for placement some of its parent tasks and some ofits children tasksmay already have been placed in the vir-tual space. How to guarantee that every task can be placedwithout violating dependencies? Our answers are in §4.

Graphene’s online component schedules the tasks ofeach DAG in the order of their starting time in the vir-tual resource-time space. Furthermore, across the manyDAGs thatmay be running in the cluster, the online com-ponent respects diòerent objectives– low job latency, highcluster throughput and fairness. _ese objectives cantranslate to discordant actions. For example, a fairnessscheme such as DRF [33] may want to give resources toa certain job but the shortest-job-ûrst heuristic that re-duces job latency may pick a diòerent job. Similarly, thetask that is most likely to reduce resource fragmenta-tion [37] may not start early in the virtual resource-timespace. Our reconciliation heuristic intuitively picks tasksby consensus (e.g., based on a weighted combination ofthe scores received by a task from each objective). How-ever, to maintain predictable performance, we limit un-fairness to an operator-conûgured threshold.We have implemented Graphene as extensions to

Apache YARN and Tez and have experimented with jobsfrom TPC-DS, TPC-H and other benchmarks on a 200server cluster. Furthermore, we evaluate Graphene insimulations on 20,000 DAGs from a production cluster.

To summarize, our key contributions are:1. A characterization of the DAGs seen in productionat Microso and an analysis of the performance ofvarious DAG scheduling algorithms (§2).

2. A novel DAG scheduler that combines multi-resource packing and dependency awareness (§4).

3. An online inter-job scheduler that mimics the pre-ferred per-job schedules while bounding unfair-ness (§5) for many fairness models [6, 33, 47].

4. In our extended tech report [38], we develop a newlower bound on the completion time of a DAG.Using this, we show that the schedules built byGraphene’s oøine component are within 1.04 timesthe theoretically optimal schedule (OPT) for half ofthe production DAGs; three quarters are within 1.13times and the worst is 1.75 times OPT.

5. Our experiments show that Graphene improves thecompletion time of half of the DAGs by 19% to 31%across the variousworkloads. ProductionDAGs im-prove relatively more because those DAGs aremorecomplex and have diverse resource demands. _egains accrue from running more tasks at a time; thecluster’s job throughput (e.g., makespan) also im-proves by about 25%.

While we present our work in the context of clusterscheduling, DAGs are a powerful and general abstractionfor scheduling problems. Scheduling the network trans-fers of a multi-way join or the work in a geo-distributedanalytics job etc. can be represented as DAGs. We oòerearly results in §8 from applying Graphene to schedulingthe DAGs that arise in distributed build systems [3, 34]and in request-response workows [46, 66].

2 Primer on Scheduling Job DAGs2.1 Problem deûnition

Let each job be represented as a directed acyclic graphG = V ,E. Each node in V is a task with demands forvarious resources. Edges in E encode precedence con-straints between tasks. Many jobs can simultaneously runin a cluster. Given a set of concurrent jobs G, the clus-ter schedulermaps tasks tomachineswhile respecting ca-pacity constraints and task dependencies. _e goals of atypical cluster scheduler are high performance (measuredusing job throughput, average job completion time andoverall cluster utilization) while oòering fairness (mea-sured w.r.t how resources are divided).

2.2 An illustrative example

We use the DAG shown in Figure 2 to illustrate the issuesin scheduling DAGs. Each node represents a task: thenode labels represent the duration (top) and the demandsfor two resources (bottom). Assume that the capacity is 1for both resources. Let ε represent a value approachingzero.

82 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 4: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

Technique Execution Order Time Worst-caseOPT t1 → t3 → t4 , t0→ t0 , t2 , t5 T −CPSched t0 → t3 → t4 → t5 → t1 → t2 3T O(n)×OPTTetris t0 → t1 → t3 → t2 → t4 → t5 3T O(d)×OPT

Figure 2: An example DAG where a packer (Tetris [37]) anda Critical Path scheduler take 3× the optimal algo OPT. Here,Graphene is close to OPT (see §2.2). Assume ε→ 0.

Figure 3: Illustrating the online case. Online, Grapheneschedules the tasks of each DAG in the order of their starttime in the oøine schedule. Resources are to be divided fairlybetween two jobs both of which have the same DAG shownin Figure 2. Notice that fairness interleaves allocation be-tween the jobs. Graphene’s average job completion time andmakespan improve by 25% to 75% and 50% to 100% respec-tively depending on the compared baseline.

Intuitively, a good schedule would overlap the long-running tasks shownwith a dark background. _e result-ing optimal schedule (OPT) is listed in the table below theûgure. OPT overlaps the execution of all the long-runningtasks, t0 , t2 and t5, and ûnishes in T .

Since such long-running or resource-intensive taskscan be present anywhere in the DAG, greedy schedulersoen perform poorly as we show next.Critical path-based schedulers (CPSched) pick tasks

along the critical path (CP) in the DAG. _e CP for atask is the longest path from the task to the job output._e table shows the task execution order with CPSched.1CPSched ignores the resources needed by tasks. In thisexample, CPSched performs poorly because it does notschedule tasks oò the critical path early (e.g., t1 , t3 , t4)even though doing so reduces resource fragmentation byoverlapping long-running tasks.

Packers, such as Tetris [37], match tasks to machinesso as to maximize the number of simultaneously runningtasks. Tetris greedily picks the task with the highest valueof the dot product between task’s demand vector and the

1CPof t0 , t1 , t3 isT(1+5ε),T(1+ε) andT(1+4ε) respectively. Taskscan run simultaneously only if their total demand is below capacity.

available resource vector. _e table also shows the task ex-ecution orderwithTetris.2 Tetris does not account for de-pendencies. Its packing heuristic only considers the tasksthat are currently schedulable. In this example,Tetris per-forms poorly because it will not choose locally inferiorpacking options (such as running t1 instead of t0) eventhough that can lead to a better global packing.

Graphene comes close to the optimal schedule for thisexample. When searching for troublesome subsets, itwillconsider the subset t0 , t2 , t5 because these tasks runfor much longer. As shown in Figure 1, the troublesometasks will be placed ûrst. Since there are no dependenciesamong them, they will run at the same time. _e par-ents (t1 , t3 , t4) and any children are then placed beforeand aer the troublesome tasks respectively in a compactmanner whilemaintaining inter-task dependencies.Online: Consider two jobs that have the DAG shownin Figure 2. Figure 3 illustrates the online schedulewhen resources are to be divided evenly between thesejobs (e.g., slot fairness [13]).

_e oøine schedule computed byGraphene for each ofthe jobs,whichoverlaps the long-running tasks (t0 , t2 , t5),is shown on top. _e online component distills theseschedules into a precedence order over tasks. For exam-ple, the order for both jobs is: t1 , t3 , t4 , t0 , t2 , t5. Figure 3,bottom, shows a time-lapse of the task execution.Capacity Scheduler (CS) [7], a widely used cluster

scheduler, checks for which DAG the next available slothas to be allocated, and then picks (in a breadth-ûrst or-der) a runnable task from the designated DAG that ûtsthe available resources. Figure 3 shows that CS resultsin an average job completion time (JCT) and makespanof 2.5T and 3T respectively. Fairness causes the sched-uler to interleave the tasks of the two jobs. Tetris happensto produce a similar schedule to CS. Note that this on-line schedule is far from the preferred per-DAG schedule,only a few of the long-running tasks overlap. Similar toCS,most production schedulers, including Spark, sched-ule tasks based on some topological ordering of the DAGwhile using fairness to decide which job to give resourcesto next. Hence, they behave similarly.Figure 3 also shows that CPSched has an average JCT

andmakespanof 3.5T and 4T respectively. _is is becauseCPSched ûnishes the t1 tasks of both the jobs late; becauset1 has a small CP length. _erefore the t2 tasks from bothjobs do not overlap with any other long task.Finally, the ûgure shows that Graphene has an average

JCT andmakespan of 2T . Graphene achieves this by justenforcing a precedence orderwithin eachDAG. In partic-ular, note that all of the schedulers are work-conserving;they leave resources idle only if no schedulable task can ût

2Tetris’ packing score for each task, in descending order, is t0=0.9,t1=0.85, t3=0.84, t2=0.8, t4=0.7 and t5=0.3.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 83

Page 5: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

Figure 4: Visualizing a few small productionDAGs. _e leg-end is in the top le. See §2.3.

in those resources. _e key diòerence, among the sched-ulers, is the order in which they consider the tasks forscheduling. Another diòerence is whether this order iscomputed based only on the runnable tasks (e.g., order-ing runnable tasks on their CP length, packing score oron their breadth-ûrst position) versus ordering based ona global optimization. Informally, Graphene’s gains arisefrom looking at the entire DAG and choosing a globallyoptimal schedule.2.3 AnalyzingDAGs in Production

To understand the problem with actual DAGs and atscale, we examine (a) the production jobs from a clusterof tens of thousands of servers at Microso, (b) jobs froma 200 server Hive [67] cluster and (c) jobs from a Condorcluster [5].Structural properties: As a preliminary, Figure 4 illus-trates some production DAGs at Microso. Each circledenotes a stage. By stage, we mean a collection of tasksthat perform the same computation on diòerent data (e.g.all map tasks). _e size of the circle corresponds to thenumber of tasks in logarithmic scale, the circle’s color cor-responds to the average task duration in linear scale andthe edge color denotes the type of dependency. We seeW-shaped DAGs (bottom le) that join multiple datasets,inverted V-shaped DAGs (middle) that perform diòerentanalysis on a dataset, and more complex shapes (right)wherein multiple datasets are analyzed leading to multi-pleoutputs. Note also the varying average durationsof thetasks (circle colors); the resource variations are not shownfor simplicity. Further, note cycles in the DAGswhich arepossibly due to self-joins and range-partitions.Figure 5 plots a CDF of various structural properties of

theDAGs from theMicroso cluster. Since the x-axis is inlog scale, we put the probability mass for x = 0 at x = 0.1.We see that the median DAG has a depth of ûve. To

compare, a map-reduce job has depth of one. A quarterof the DAGs have depth above ten.

While 40% of the DAGs are trees (i.e., no cycle aerignoring the direction of dependency), we see that manyhave cycles (half of the DAGs have at least 3 cycles); theaverage number of tasks in a cycle is 5 (not shown in the

0

0.2

0.4

0.6

0.8

1

0.1 1 10 100 1000CD

F (F

ract

ion o

f D

AG

s w

ith v

alu

e <

x)

Value

# StagesDepth# Cycles# Edges# Barriers

0

0.2

0.4

0.6

0.8

1

0.1 1 10 100 1000

Fract

ion o

f Task

s w

ith V

alu

e <

x

Value

In degreeOut degree

Figure 5: Characterizing structural properties of the DAGs

ûgure). Tree-like DAGs are an important special case asthey are, theoretically,more tractable to schedule [48].

_e DAGs can be quite large; themedian job has thou-sands of tasks and tens of stages. To compare, a map-reduce job has two stages.

We also see thatmost of the edges are not barriers (e.g.,those labeled “can be local”). Note the gap between theorange stars line and the black squares line in Figure 5which correspond to counts of all edges and barriers re-spectively. A barrier edge indicates that every task in theparent stage should ûnish before any task in the childstage begins.

We observe that DAGs can be cut into portions suchthat all tasks aer the cut can only begin aer every taskbefore the cut has ûnished. An example cut is shownwith a red dashed line on the DAG in Figure 4 (le bot-tom). Cuts oen arise because a dataset, perhaps newlygenerated by upstream tasks, has to be partitioned beforedownstream tasks can begin. Cuts are convenient becausethe optimal schedule for the DAG is a concatenation ofthe optimal schedules of the cut portions of that DAG.We observe that 24% of the productionDAGs can be splitinto four or more parts.

_e median (75th percentile) task in-degree and out-degree are 1 (8) and 3 (20) respectively. For amap-reducejobwithm mappers and r reducers, themedian in-degreewill be 0 if m ≥ r and m otherwise. _e larger out-degreeis because stages that read from the ûle-system are data re-ductive; hence, the query optimizer creates fewer down-stream tasks overall.

Overall, we conclude that DAGs are both large andhave complex structures.Diversity in resource demands: Similar to priorwork [33, 37], we observed substantial variability in theusage of various resources; the details are in [38].Potential for improvement: To quantify potential gains,we compare the runtime of productionDAGs to twomea-sures. _e ûrst, CPLength, is the duration of the DAG’scritical path. If the available parallelism is inûnite, theDAGwould ûnishwithin CPLength. _e second, TWork,

84 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 6: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Fract

ion o

f D

AG

s w

ith G

ap <

x

Gap = 1 - (Measure / DAG runtime)

Gap from NewLBGap from TWorkGap from CPLength

Figure 6: CDF of gap betweenDAG runtime and severalmea-sures. Gap is computed as 1− measure

DAG runtime .

is the total work in the DAG normalized by the clustershare of that DAG (a formula is in Table 1.) If there wereno dependencies and perfect packing, a DAG would ûn-ishwithin TWork. Figure 6 plots a CDF of the relative gapbetween the runtime of a DAG and thesemeasures. Halfof the jobs have a gap of over 70% for both CPLength andTWork.Understanding the gap: A careful reader would noticethat about 15% of the DAGs ûnish faster than somemea-sures. _is is because our production scheduler occa-sionally gives jobs more than their fair share if the clusterhas spare resources; hence, measures which assume thatthe cluster share will be theminimum guaranteed for theDAG can be larger than the actual completion time. Wewill ignore such DAGs for this analysis.

Suppose OPT is the optimal completion time for a DAGgiven a certain cluster share. We know that actual run-time is larger than OPT and that the above measures aresmaller than OPT. Now, the gap could be due to one oftwo reasons. (1) _e measure is loose (i.e., well belowOPT). In practice, we found this to be the case becauseCPLength ignores all the work oò the critical path andTWork ignores dependencies. (2) _e observed runtimesof DAGs are inated by runtime artifacts such as task fail-ures, stragglers and performance interference from othercluster activity [17, 74].

To correct for (2), we discount the eòects of runtimeartifacts on the above computed DAG runtime as follows.First, we chose the fastest completion time from a groupof recurring jobs. It is unlikely that every execution suf-fers from failures. Second, to correct for stragglers–oneor a few tasks holding up job progress–we deduct fromcompletion time the periods when the job ran fewer thanten concurrent tasks. Note that both these changes reducethe gap; hence they under-estimate the potential gain.Further, to correct for (1), we develop a new improved

lower bound NewLB that uses the speciûc structure ofdata-parallel DAGs. Further details are in [38]; but intu-itively NewLB leverages the fact that all the tasks in a jobstage (e.g., a map or reduce or join) have similar depen-dencies, durations and resource needs. _e gap relativeto NewLB is smaller, indicating that the newer bound istighter, but the gap is still over 50% for half of the jobs._at is, they take over two times longer than they could.

To summarize, (1) production jobs have large DAGsthat are neither a bunch of unrelated stages nor a chainof stages, and (2) a packing+dependency-aware schedulercan oòer substantial improvements.2.4 Analytical Results

Lemma 1 (Dependencies). Any scheduling algorithm,deterministic or randomized, that does not accountfor the DAG structure (e.g., only schedules currentlyrunnable tasks) is Ω(d) times OPT where d is the num-ber of resources.

_e proof, for deterministic algorithms, follows fromdesigning an adversarialDAG for any scheduler [38]. Weextend this to randomized algorithms by usingYao’smin-imax principle [38].

Lemma 1 applies to the following multi-resource pack-ers [37, 58, 69, 70] since they ignore dependencies.

Lemma 2 (Resource Variation). Schedulers that ig-nore resource heterogeneity have poor worst-case per-formance. For example, critical path scheduling can beΩ(n) times OPTwhere n is the number of tasks in a DAG.

_e proof is by designing adversarial DAGs [38].Combining these two principles, we conjecture that it

is possible to ûnd similar examples for any scheduler thatignores dependencies or ignores resource usages.

To place these results in context, note that d is about4 (cores,memory, network, disk) and can be larger whentasks require resources at other servers or on many net-work links. Further, the median DAG has hundreds oftasks (n). _e key intuition here is that DAGs are hardto schedule because of their complex structure and be-cause of discretization issues when tasks usemultiple re-sources (fragmentation, task placement etc.) Graphene isclose to OPT on all of the described examples and iswithin1.04 times OPT for half of the production DAGs (see §7).2.5 Acquiring annotated DAGs

Acquiring an annotated DAG is non-trivial. Much priorwork has similar requirements as Graphene (see Table 2in [38]). _ere are two parts to this: the structure of theDAG and the task proûles (resourceneeds anddurations).DAG structure: In order to launch a task only aer par-ent tasks ûnish, everyDAG scheduler is aware of theDAGstructure. Furthermore, the DAG is oen known beforethe job starts. Runtime changes to the DAG, if they hap-pen, only aòect small portions of a DAG. For example,our scheduler adds an aggregation tree in front of a re-duce stage depending upon runtime conditions.Task resource demands and durations: Graphene re-quires each task to be annotated with the demands forany resource that could be congested; the other resourcesdo not aòect scheduling. Here, we consider four re-sources (cores, memory, disk and network bandwidth).

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 85

Page 7: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

RM

Online Component +multiple objectives

DAG

Schedule Constructor

AM

DAG

Schedule Constructor

AM

Nodeheartbeats

Assign tasks

Figure 7: Graphene builds schedules per DAG at job sub-mission. _e runtime component handles online aspects.AM and RM refer to the YARN’s application and resourcemanager components.

Schedulers such asYarn,Mesos,Hive and Spark ask usersto annotate their tasks with cores and memory require-ments; for example, [1 core, 1 GB] is the default inHadoop2.6. Graphene requires annotations for more resources aswell as the durations of tasks.

_ere are some early eòorts to obtain these pro-ûles (tasks’ demands and durations) automatically. Forexample, in theproduction cluster atMicroso, up to 40%of the resources in the examined cluster are used by recur-ring jobs; the same script executes periodically on newlyarriving data. Recurring jobs can be identiûed based onthe job name (e.g., LogMiner date[ time]) and priorwork shows that the task proûles of these jobs can be es-timated from history (aer normalizing for the size of in-put) [16]. For the remaining jobs, some priorwork buildsproûles via sampling [59], program analysis [40], or basedon online observations of the actual usages of tasks in thesame stage [17]. Our method is described in §6; our sen-sitivity analysis in §7.4 shows that Graphene is robust tomodest amounts of estimation error.

3 Novel ideas in GrapheneCluster scheduling is the problem of matching tasks tomachines. Most production schedulers today do so inan onlinemanner and have very tight timing constraintssince clusters have thousands of servers, many jobs thateach havemany pending tasks and tasks that ûnish in sec-onds or less [8, 73]. Given such stringent time budget,carefully considering large DAGs seems daunting.As noted in §1, a key design decision in Graphene is

to divide this problem into two parts. An oøine compo-nent constructs careful schedules for a single DAG. Wecall these the preferred schedules. A second online compo-nent enforces the preferred schedules of the various jobsrunning in the cluster. We elaborate on each of these partsbelow. Figure 7 shows an example of how the two partsmay inter-operate in aYARN-style architecture. Dividinga complex problem into parts and independently solvingeach part oen leads to a sub-optimal solution. While wehave no guarantees for our particular division, we notethat it scales to large clusters and outperforms the state-of-art in experiments.

To ûnd a compact schedule for a single DAG, our ideais to place the troublesome tasks, i.e. those that can lead toa poor schedule, ûrst onto a virtual space. Intuitively, thismaximizes the likelihood that any holes, un-used parts

Term DeûnitionTask t an atomic unit of executionStage s a group of tasks that run same code on diòer-

ent dataTWork(s) maxresource r

1Cr∑t∈s tduration ∗ trdemands

ExecTime(s) estimated time to execute tasks in sV ,G V denotes all stages (and tasks) in a DAG GS a virtual schedule: i.e. a placement of a given

DAG of tasks in a resource-time spaceC(s,G), P(s,G),D(s,G),A(s,G), U(s,G)

Children, parents, descendants, ancestors andunordered neighbors of s in G; note thatU(s,G) = V −A(s,G)−D(s,G)−s

Table 1: Glossary of terms.

of the resource-time space, can be ûlled by other tasks.However, ûnding the best choice of troublesome tasks isas hard as ûnding a good schedule for theDAG.We use aneõcient search strategy that mimics dynamic program-ming: it picks subsets that are more likely to be usefuland avoids redundant exploration. Furthermore, placingtroublesome tasks ûrst can lead to dead-ends. We deûnedead-end to be an arrangement of a subset of the DAGin the virtual space on which the remaining tasks can-not be placed without violating dependencies. Our strat-egy is to divide the DAG into subsets of tasks and placeone subset at a time. While intra-subset dependenciesare handled directly during schedule construction, inter-subset dependencies are handled by restricting the orderinwhich the various subsets are placed. We prove that theresultant placement has no dead-ends.

_e online component has to co-ordinate betweensome potentially discordant directives. Each job runningin the cluster oòers a preferred schedule for its tasks (con-structed as above). Fairness models such as DRF maydictate which job (or queue) should be served next. _eset of tasks that is advantageous for packing (e.g., maxi-mal use ofmultiple resources) can be diòerent from boththe above choices. We oòer a simplemethod to reconcilethese various directives. Our idea is to compute a real-valued score for each pending task that incorporates theabove aspects soly. _at is, the score trades-oò violationson some directives if the other directives weigh stronglyagainst it. For example, we can pick a task that is less use-ful from a packing perspective if it appears much earlieron the preferred schedule. One keynovel aspect is bound-ing the extent of unfairness.

_e oøine component of Graphene is described next;the online component is described in Section 5.

4 Scheduling one DAGGraphene builds the schedule for a DAG in three steps.Figure 1 illustrates these steps and Figure 8 has a simpli-ûedpseudocode. First,Graphene identiûes some trouble-some tasks and divides the DAG into four subsets (§4.1).Second, tasks in a subset are packed greedily onto the vir-tual space while respecting dependencies (§4.2). _ird,Graphene carefully restricts the order in which diòerent

86 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 8: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

1 Func: BuildSchedule:2 Input: G: a DAG, m: number of machines3 Output: An ordered list of tasks t ∈ G4 Sbest ←∅ // best schedule for G thus far5 foreach sets T,S,P,C ∈ CandidateTroublesomeTasks(G) do6 Space S ← CreateSpace(m) //resource-time space7 S ← PlaceTasks(T,S ,G) // trouble goes first8 S ← TrySubsetOrders(SCP,SPC,CSP,PSC,S ,G)9 if S < Sbest then Sbest ← S //keep the best schedule;

10 return OrderTasks(G ,Sbest)

Figure 8: Pseudocode for constructing the schedule for aDAG. Helper methods are in Figure 9.

subsets are placed such that the troublesome tasks go ûrstand there are no dead-ends (§4.3). Graphene picks themost compact schedule aer iterating over many choicesfor troublesome tasks. _e resulting schedule is passed onto the online component (§5).

4.1 Searching for troublesome tasks

To identify troublesome tasks, Graphene computes twoscores per task. _e ûrst, LongScore, divides the taskduration by the maximum across all tasks. Tasks with ahigher score aremore likely to be on the critical path andcan beneût from being placed ûrst because otherwork canoverlap with them. _e second, FragScore, reects thepackability of tasks in a stage (e.g., amap or a reduce). Itis computed by dividing the total work in a stage (TWorkdeûned in Table 1) by the time a greedy packer takesto schedule that stage. Tasks that are more diõcult topack would have a lower FragScore. Given thresholdsl and f , Graphene picks tasks with LongScore ≥ l orFragScore ≤ f . Intuitively, this biases towards select-ing tasks that aremore likely to hurt the schedule becausethey are long ordiõcult topack. Each value of l , f leadsto a choice of troublesome tasks Twhich leads to a sched-ule (aer placing the tasks in T ûrst and then the othersubsets); Graphene iterates over diòerent values for the land f thresholds and picks themost compact schedule.

To speed up this search, (1) rather than choose thethreshold values arbitrarily, Graphene picks values thatare discriminative, i.e. those that lead to diòerent choicesof troublesome tasks, and (2) Graphene remembers theset of troublesome tasks that were already explored (byprevious settings of the thresholds) so that only oneschedule is built for each troublesome set. Note also thatthe diòerent choices of troublesome tasks can be exploredin parallel. Further improvements are in §4.4.As shown in Figure 9 (line 9), the set T is a closure

over the chosen troublesome tasks. _at is, T containsthe troublesome tasks and all tasks that lie on a path inthe DAG between two troublesome tasks. _e parent andchild subsets P, C consist of tasks that are not in T but havea descendant or ancestor in T respectively. _e subset Sconsists of the remaining tasks.

1 Func: CandidateTroublesomeTasks:2 Input: DAG G; Output: list L of sets T,S,P,C

// choose a candidate set of troublesome tasks; per choice, divide Ginto four sets

3 L←∅4 ∀v ∈ G ,LongScore(v)← v .duration/maxv′∈G v′ .duration5 ∀v ∈ G , v in stage s ,FragScore(v)← TWork(s)/ExecTime(s)6 foreach l ∈ δ , 2δ , . . . 1 do7 foreach f ∈ δ , 2δ , . . . 1 do8 T← v ∈ G∣LongScore(v) ≥ l or FragScore(v) ≤ f 9 T← Closure(T)

10 if T ∈L then continue // ignore duplicates;11 P←⋃v∈T A(v ,G); C←⋃v∈T D(v ,G); S← V −T−P−C;12 L←L∪T,S,P,C

Figure 9: Identifying various candidates for troublesometasks and dividing the DAG into four subsets.

4.2 Compactly placing tasks of a subset

Given a subset of tasks and a partially occupied space,how best topack the taskswhile respectingdependencies?Graphene uses the following logic for each of the subsetsT,P,S and C. One can choose to place the parents ûrstor the children ûrst. We call these the forward and back-ward placements respectively. More formally, the forwardplacement recursively picks a task all of whose ancestorshave already been placed on the space and puts it at theearliest possible time aer its latest ûnishing ancestor. _ebackward placement is analogously deûned. Intuitively,both placements respect dependencies but can lead to dif-ferent schedules since greedy packing yields diòerent re-sults based on the order in which tasks are placed. Fig-ure 10:PlaceTasks shows some simpliûed pseudo-code.Traversing the tasks in either placement has n logn com-plexity for a subset of n tasks and if there arem machines,placing tasks greedily has n log(mn) complexity.

4.3 Subset orders that guarantee feasibility

For each division of DAG into subsets T,S,P,C,Grapheneconsiders these 4 orders: TSCP,TSPC,TPSC or TCSP. _atis, in the TSCP order, it ûrst places all tasks in T, then tasksin S, then tasks in C and ûnally all tasks in P. Intuitively,this helps because the troublesome tasks T are alwaysplaced ûrst. Further, other orders may lead to dead-ends.For example, consider the order TPCS; by the time sometask t in the subset S is considered for placement, parentsof t and children of t may already have been placed sincethey may belong to the sets P and C respectively. Hence,it may be impossible to place t without violating depen-dencies. We prove that the above orders avoid dead-endsand are the only orders beginning with T to do so.

Note also that only one of the forwards or backwardsplacements (described in §4.2) are appropriate for somesubsets of tasks. For example, tasks in P cannot be placedforwards since some descendants of these tasks may al-ready have been placed (such as those in T). As notedabove, the forwards placement places a task aer its lastûnishing ancestor but ignores descendants and can henceviolate dependencies if used for P; because by deûnition

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 87

Page 9: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

1 Func: PlaceTasks(V ,S ,G):2 Inputs: V : subset of tasks to be placed, S : space (partially filled), G:a DAG

3 Output: a new space with tasks in V placed atop S4 return min(PlaceTasksF(V ,S ,G), PlaceTasksB(V ,S ,G))

5 Func: PlaceTasksF: // forwards placement, inputs and outputs sameas PlaceTasks

6 S ← Clone(S)7 finished placement set F ← v ∈ G∣v already placed in S8 while true do9 ready set R← v ∈ V − F ∣ P(v ,G) ⊆ F

10 if R =∅ then break // all done;11 v′ ← task in R with longest runtime12 t←maxv∈P(v ,G)EndTime(v ,S)13 // place v′ at earliest time ≥ t when its resource needs can be met14 F ← F ∪ v′

15 Func: PlaceTasksB: // only backwards, analogous to PlaceTasksF.

16 Func: TrySubsetOrders:17 Input: G: a DAG, Sin : space with tasks in T already placed18 Output: A space that has the most compact placement of all tasks.19 S1 ,S2 ,S3 ,S4 ← Clone(Sin)20 return min( // pick the most compact among all feasible orders21 PlaceTasksF(C, PlaceTasksB(P, PlaceTasks(S,S1 ,G),G),G), //SPC22 PlaceTasksB(P, PlaceTasksF(C, PlaceTasks(S,S2 ,G),G),G), //SCP23 PlaceTasksB(P, PlaceTasksB(S, PlaceTasksF(C,S3 ,G),G),G), //CSP24 PlaceTasksF(C, PlaceTasksF(S, PlaceTasksB(P,S4 ,G),G),G) //PSC25 );

Figure 10: Pseudocode for the description in §4.2, §4.3.

every task in the parent subset P has at least one descen-dant task. Analogously, tasks in C cannot be placed back-wards. Tasks in S can be placed in one or both place-ments, depending on the inter-subset order. Finally, sincethe tasks in T are placed onto an empty space they can beplaced either forwards or backwards; details are in Fig-ure 10:TrySubsetOrders. We prove the following:

Lemma 3. (Correctness) Our method in §4.1–§4.3 satis-ûes dependencies and avoids dead-ends. (Completeness)_emethod explores every order that places troublesometasks ûrst and is free of dead-ends.

Intuitively, the proof (omitted for space) follows from(1) all four subsets are closed and hence intra-subset de-pendencies are respected by the placement logic in §4.2whether in the forward or in the backward placement, (2)the inter-subset orders and the corresponding restrictionsto only use forwards and/or backwards placements spec-iûed in §4.3 ensure that dependencies across subsets arerespected and, (3) every other order that beginswith T canlead to dead-ends.4.4 Enhancements

We note a few enhancements. First, as noted in §2.3, itis possible to partition a DAG into parts that are totallyordered. Hence, any schedule for the DAG is a concate-nation of per-partition schedules. _is lowers the com-plexity of schedule construction. 24% of the productionDAGs can be split into four or more parts. Second, andalong similar lines, whenever possible we reduce com-plexity by reasoning over stages. Stages are collectionsof tasks and are 10 to 103 times fewer in number thantasks. _ird, schedule computation can be sped up in a

Figure 11:_e various aspects considered byGraphene’s on-line component when matching tasks to machines.

few ways. Parallelizing the search will help the most, i.e.examine diòerent choices for troublesome tasks T in par-allel. Working over more compact representations (e.g.,scaling down the DAG and the cluster by a correspondingamount) will also help. Fourth, jobs that are short-lived,or only use a small amount of resources, or do not havecomplex DAG structures, will bypass the oøine portionof Graphene. Fih, the complexity of schedule construc-tion is independent of the sizes of the subsets T,S,P,Cthat Graphene divides the DAG into. However, if ∣T∣ isvery large, the approach of placing troublesome tasks ûrstand other tasks carefully around them is unlikely to help.We prune such choices of T without further exploration.Among the schedules built by Graphene for productionDAGs, the median DAG has 17% of its tasks consideredtroublesome; these tasks contribute to 32% of the work inthat job. Finally, note that it is possible to recursively em-ploy this logic: i.e., given a DAG G, pick a troublesomesubset T, let G′ be the sub-DAG over tasks in T, repeatthe logic on G′. We defer further examination of this ap-proach to future work.

5 Scheduling manyDAGsGiven the preferred schedules for each job, we describehow the Graphene inter-job scheduler matches tasks tomachines online. Recall the example in Figure 3. _escheduling procedure is triggered when a machine m re-ports its vector of available resources to the cluster-wideresourcemanager. Given a set of runnable jobs (and theirtasks), the scheduler returns a list of tasks to be allocatedon that machine. _e challenge is to enforce the per-joborder computed in §4 while also packing tasks for clustereõciency, ensuring low JCTs, and enforcing fairness.

5.1 Inter-job Scheduler

Enforcing preferred schedules. Using the per-DAGschedule constructed in §4, a tpr iS core is computed foreach task t by (1) ranking tasks in increasing order of theirstart time in the schedule and (2) dividing the rank by thenumber of tasks in the DAG so that the result is between1 (for the task that begins ûrst) and 0. As noted below,Graphene preferentially schedules tasks with a highertpr iS core value ûrst.Packing eõciency. Graphene borrows ideas from [37] toimprove packing eõciency. For every task, it computes apacking score pScoret as a dot product between the taskdemand vector and themachine’s available resource vec-tor. To favor local placement, when remote resources areneeded, pScoret is reduced bymultiplyingwith a remote

88 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 10: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

penalty rp (∈ [0, 1]). Sensitivity analysis on the value of rpis in §7.4.Job completion time. Graphene estimates the remainingwork in a job j similar to [37]; srpt j is a sum over theremaining tasks to schedule in j, the product of their du-ration and resource demands. A lower score implies lesswork remaining in the job j.Bounding unfairness. Graphene trades oò fairness forbetter performance while ensuring that the maximumunfairness is below an operator conûgured threshold.Speciûcally, Graphene maintains deûcit counters [64]across jobs tomeasureunfairness. _e deûcit counters areupdated as follows. When a task t from a group g is sched-uled, its deûcit increases by factort ×(fairShareg − 1)and the deûcit of all the other groups g′ increases byfactort × fairShareg′ . _is update lowers the deûcitcounter of g proportional to the resources allocated to itand increases the deûcit counters of other groups to re-member that they were treated unfairly. Further, by vary-ing the value of factort , Graphene can support diòer-ent fairness schemes: e.g., factort = 1 mimics slot fair-ness and factort = demand of the dominant resource of gmimics DRF [33].Combining schedule order, packing, completion timeand fairness. Graphene attempts to simultaneously con-sider the above four aspects; as shown in Figure 11, someof the aspects vary with the task while others vary acrossjobs. First, Graphene combines the performance relatedaspects into a single score, i.e., perfScoret = pScoret ⋅tpr iS core −ηsrpt j . η is a parameter that is automaticallyupdated based on the average srpt and pScore acrossjobs and tasks. Subtracting η ⋅srpt j prefers shorter jobs.Sensitivity analysis on the value of η is in §7.4. Intuitively,the combined value perfScoret soly enforces the var-ious objectives. For example, if a task t is preferred byall individual objectives (belongs to shortest job, is mostpackable, is next in the preferred schedule), then it willhave the highest perfScoret . When the objectives arediscordant, colloquially, the task preferred by a majorityof objectives t will have the highest perfScoret .

Next, to trade-oò performancewhile bounding unfair-ness, let the most unfairly treated group (the one withthe highest deûcit counter) be gunfair. If the deûcitcounter of gunfair is below the unfairness threshold, thenGraphene picks the task with the maximum perfScore

from among all groups; else it picks the taskwith themax-imum perfScore from gunfair. _e unfairness thresh-old is κC where κ (< 1) is a tunable parameter and C is thecluster capacity.Further details, including a pseudo-code, are in [38].

6 Graphene SystemWe have implemented the runtime component (§5) in theApache YARN resource manager (RM) and the sched-

ule constructor (§4) in the Apache Tez application mas-ter (AM). Our (unoptimized) schedule constructor ûn-ishes in tens of seconds on theDAGs used in experiments;this is in the same ballpark as the time to compile andquery-optimize these DAGs. Recurring jobs use previ-ously constructed schedules. Each DAG is managed byan instance of the Tez AM which closely resembles otherframeworks such as FlumeJava [25] and Dryad [43]. _eper-job AMs negotiatewith the YARN RM for containersto run the job’s tasks; each container is a ûxed amount ofvarious resources. As part of implementingGraphene,weexpanded the interface between the AM and RM to passadditional information, such as the job’s pending workand tasks’ demands, duration and preferred order. Here,we describe some key aspects.

6.1 DAG Annotations

Recall from §2.5 that Graphene requires a more detailedannotation of DAGs than existing systems: speciûcally, itneeds task durations and estimates of network and diskusages; the usages of cores andmemory are already avail-able [8, 67, 73].

Our approach is to construct estimates for the aver-age task in each stage using a combination of historicaldata and prediction. _ese estimates are used by the of-ine portion of Graphene (§4). As noticed by priorwork,recurring jobs are common in our production clustersand historical usages, aer normalizing for the change indata volume, are predictive for such job groups [16]. _eonline portion of Graphene (§5) reûnes these estimatesbased on the actualwork of a task (e.g., by noting its inputsize) and based on the executions of earlier tasks; since(a) tasks in the same stage oen run in multiple wavesdue to capacity limits and (b) running tasks issue periodicprogress reports [8, 17].

In our evaluation, we execute the jobs once and usethe actual observed usage (from job history) to computethe necessary annotations. We normalize both the dura-tion and usage estimates by the tasks’ input size, as ap-propriate. A sensitivity analysis that introduces diòerentamounts of error to the estimates and shows their eòecton performance is in §7.4.

We observe that Graphene is rather robust to estima-tion error because relatively small diòerences in tasks’ du-ration and usages do not change the schedule. For exam-ple, while it is useful to know that reduce and join tasksare network-heavy as opposed to map tasks which haveno network usage, it is less useful to know precisely howmuch network usage a reducer or a join taskwill have; theactual usage would vary, at runtime, in any case due tocontention, thread or process scheduling, etc. Similarly,while it is useful to know that tasks in a certain stage willtake ten times longer, on average, and hence it is better tooverlap those tasks with unrelated work, it is less useful

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 89

Page 11: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

to know the exact duration of a task; again, the exact du-rations will vary because of contention,machine-speciûcslowdowns etc. [17].6.2 Eõcient onlinematching

Naively implementing our runtime component (§5)would improve schedule quality at the cost of delayingscheduling. We use bundling to oòset this issue.

Some background: _e matching logic in typicalschedulers isheartbeat-based [8]. When amachineheart-beats to the RM, the allocator (0) maintains an orderingover pending tasks, (1) picks the ûrst appropriate task toallocate to that machine, (2) adjusts its data structures(such as, resorting/rescoring) and (3) repeats these stepsuntil all resources on the node have been allocated or allallocation requests have been satisûed.A naive implementation of the runtime component

would examine all the pending tasks; thereby increasingthe time to match.

Instead, we propose to bundle the allocations. Speciû-cally, rather than breaking the loop aer ûnding the ûrstschedulable task (step 1 above), we keep along a bundleof tasks that can all be potentially scheduled on the ma-chine. At the end of one pass, we assign multiple tasks bychoosing from among those in the bundle.

_e bundle amortizes the cost of examining the pend-ing tasks. We can allocate multiple tasks in one pass asopposed to one pass per task. It is also easy to see thatbundling admitsnon-greedy choices and that the pass canbe terminated early when the bundle has good-enoughtasks. We have refactored theYarn schedulerwith conûg-urable choices for (1)which tasks to add to the bundle, (2)when to terminate bundling and (3) which tasks to pickfrom the bundle. From conversations with Hadoop com-mitters, these code-changes help improve matching eõ-ciency and code readability.6.3 Co-existing with other features

We note that a cluster scheduler performs other roles be-sides matching tasks to machines. Several of these rolessuch as handling outliers and failed tasks diòerently [17,74], delay scheduling [72], reservations [12, 30] or sup-porting heterogeneous clusters where only some serversmay haveGPUs [11] are implemented as preconditions tothe main schedule loop, i.e. they are checked ûrst, or areimplemented by partitioning the tasks that will be con-sidered in the scheduling loop. Since Graphene’s changesonly aòect the inner core of the schedule loop (e.g., givena set of pending tasks, which subset to allocate to a ma-chine), our implementation co-exists with these features.

7 EvaluationOur key evaluation results are as follows.(1) In experiments on a 200 server cluster, relative to Tezjobs running on YARN, Graphene improves completion

timeofhalfof the jobsby 19% to 31% across variousbench-marks. 25% of the jobs improve by 30% to 49%. _e extentof gains depends on the workload (complexity of DAGs,resource usage variations etc.).(2) On over 20,000 DAGs from production clusters, theschedules constructed by Graphene are faster by 25% forhalf of the DAGs. A quarter of the DAGs improve by 57%.Further, by comparing with our new lower bound, theseschedules are optimal for 40% of the jobs and within 13%of optimal for 75% of the jobs.(3) By examining further details, we show that the gainsare from better packing dependent tasks. Makespan (andcluster throughput) improve by a similar amount. Moreresources are used, on average, by Graphene and tradingoò short-term unfairness improves performance.(4) We also compare with several alternative schedulersand oòer a sensitivity analysis to cluster load, various pa-rameter choices, and annotation errors.

7.1 Setup

Our experimental clusterhas 200 serverswith two quad-core IntelE2550 processors (hyperthreading enabled), 128GB RAM, 10 drives, and a 10Gbps network interface. _enetwork has a congestion-free core [39].Workload: Our workload mix consists of jobs frompublic benchmarks—TPC-H [14], TPC-DS [15], Big-Bench [4], and jobs from a production cluster that runsHive jobs (E-Hive). We also use 20K DAGs from a pri-vate production system in our simulations. In each exper-imental run, job arrival is modeled as a Poisson processwith average inter-arrival time of 25s for 50minutes. Eachjob is picked at random from the corresponding bench-mark. We built representative inputs and varied input sizefromGBs to tens of TBs such that the average query com-pletes in a few minutes and the longest query ûnishes inunder ten minutes on the idle cluster. A typical experi-ment run has about 120 jobs. _e results presented arethemedian over three runs.Compared Schemes: We experimentally compareGraphene with the following baselines: (1) Tez ∶ breadth-ûrst order of tasks in the DAG running atop YARN’sCapacity Scheduler (CS), (2) Tez+CP ∶ critical pathlength based order of tasks in the DAG atop CS and (3)Tez+Tetris ∶ breadth-ûrst order of tasks in the DAGatop Tetris [37]. To tease apart the gains from the of-ine and online components, we also oòer results for (4)Tez+G+CS and (5) Tez+G+Tetris which use the of-ine constructed schedules at the jobmanager (to requestcontainers in that order) but the online components areagnostic to the desired schedule (either the default ca-pacity scheduler or Tetris respectively). Using simula-tions, we also compare Graphene against the followingschemes: (6) BFS ∶ breadth ûrst order, (7) CP ∶ criticalpath order, (8) Random order, (9) StripPart [20], (10)

90 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 12: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

0

0.2

0.4

0.6

0.8

1

-20 0 20 40 60 80 100

Frac

tion

of

DA

Gs

Reduction in job duration against Tez default[%]

Graphene

Tez+Tetris

Tez+CP

(a) CDF of gains for jobs on TPC-DS workload

50th percentile 75th percentileWorkload G T+C T+T G T+C T+TTPC-DS 27.8 4.1 6.5 45.7 8.9 16.6TPC-H 30.5 3.8 8.9 48.3 7.7 15.0BigBench 25.0 6.4 6.2 33.3 21.7 18.5E-Hive 19.0 1.0 5.8 29.7 4.5 14.2

G stands for Graphene. T+C and T+T denote Tez+CP and Tez+Tetrisrespectively (see §7.1). The improvements are relative to Tez.

(b) Improvements in JCT across all the workloads

Figure 12: Comparing completion time improvements of var-ious schemes relative to Tez.

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8Runnin

g t

ask

s [1

00 x

]

Time [1000 x s]

Tez+Tetris

Graphene

TezTez+CP

(a) Running tasks

0 20 40 60 80

100 120 140 160

0 1 2 3 4 5 6

Uti

lizati

on %

Time [1000x s]

CPU MEM N/R D/R

(b) Graphene

0 20 40 60 80

100 120 140 160

0 1 2 3 4 5 6 7

Uti

lizati

on %

Time [1000x s]

CPU MEM N/R D/R

(c) Tez+Tetris

0 20 40 60 80

100 120 140 160

0 1 2 3 4 5 6 7 8

Uti

lizati

on %

Time [1000x s]

CPU MEM N/R D/R

(d) Tez+CPFigure 13: For a cluster runwith 200 jobs, a time lapse of howmany tasks are running (lemost) and how many resourcesare allocated by each scheme. N/R represents the amount ofnetwork read,D/R the disk read andD/W the correspondingdisk write.

Tetris [37], and (11) CoffmanGraham [29].All of the above schemes except (9) are work-

conserving. (6)–(8) and (10) pick only from among therunnable tasks but vary in the speciûc heuristic. (9) and(11) perform more complex schedule construction, as wewill discuss later.Metrics: Improvement in JCT is our key metric. Be-tween two schemes, we measure the normalized gap inJCTs. _at is, the diòerence in the runtime of a job di-vided by the job runtime; the normalization lets us com-pare jobs with very diòerent runtimes. We also measuremakespan, i.e., the time to ûnish a given set of jobs, Jain’sfairness index [45], and the actual usages of various re-sources in the cluster.

7.2 How does Graphene do in experiments?

Job Completion Time: Relative to Tez, Figure 12 showsthatGraphene improves half of theDAGs by 19 to 31%; theextent of gains depends on theworkload and varies acrossbenchmarks. A quarter of the DAGs improve by 30 to49%. We see occasional regressions. Up to 5% of the jobsin the TPC-DS benchmark slow down with Graphene;the maximum slowdown is 16%. We found this to pri-marily happen on the shorter jobs and believe it is due tonoise from runtime artifacts such as stragglers and taskfailures [17]. _e table in Figure 12 shows the results forall the benchmarks; we see that DAGs from E-Hive seethe smallest improvement (19% at median) because theDAGs here are mostly two stage map-reduce jobs. _eother benchmarks have more complex DAGs and hencereceive larger gains.

Relative to the alternatives, Figure 12 shows thatGraphene is 15% to 34% better. Tez+CP achievesonly marginal gains over Tez, hinting that critical pathscheduling does not suõce. _e exception is the Big-Bench datasetwhere about half the queries are dominatedby work on the critical path. Tez+Tetris comes closestto Graphene because Tetris’ packing logic reduces frag-mentation. _e gap is still substantial since Tetris ignoresdependencies. In fact, we see that Tez+Tetris does notconsistently beat Tez+CP. Our takeaway is that consid-ering both dependencies and packing substantially im-proves DAG completion time.

Where do the gains come from? Figure 13 oòers moredetail on an example experimental run. Graphene keepsmore tasks running on the cluster and hence ûnishesfaster (Figure 13a). _e other schemes take over 20%longer. Graphene runs more tasks by reducing fragmen-tation and by overbooking resources such as network anddisk that do not lose goodput when demand exceeds ca-pacity (unlike say memory). Comparing Figure 13b withFigures 13c, 13d, the average allocation of all resources ishigher with Graphene. Occasionally, Graphene allocatesover 100% of the network and disk. One caveat about ourmeasurement methodology here: we take the peak us-age of a task and assume that the task held on to thoseresources for the entirety of its lifetime; hence, the us-ages are over-estimates for all schemes. Tez+Tetris,the closest alternative, has fewer tasks running at all timesbecause (a) it does not overbook (resource usages are be-low 100% in Figure 13c) and (b) it has aworse global pack-ing for a DAG because it ignores dependencies and packsonly the runnable tasks. Tez+CP is impacted negativelyby two eòects: (a) ignoring disk and network usage leadsto arbitrary over-allocation (the “total” resource usage ishigher because, due to saturation, tasks hold on to allo-cations for longer) and (b) due to fragmentation, manyfewer tasks run on average. Overall, Graphene gains byincreasing the task throughput.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 91

Page 13: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

Workload Tez+CP Tez+Tetris GrapheneTPC-DS +2.1% +8.2% +30.9%TPC-H +4.3% +9.6% +27.5%

Table 2:Makespan, gap from Tez.

Workload Scheme 2Q vs. 1Q Jain’s fairness indexPerf. Gap 10s 60s 240s

TPC-DSTez −13% 0.82 0.86 0.88

Tez+DRF −12% 0.85 0.89 0.90Tez+Tetris −10% 0.77 0.81 0.92

Graphene +2% 0.72 0.83 0.89Table 3: Fairness: Shows the performance gap and Jain’s fair-ness index when used with 2 queues (even share) versus 1queue. Here, a score of 1 indicates perfect fairness.

Makespan: To evaluate makespan, we make one changeto the experiment setup– all jobs arrive within the ûrstfew minutes. Everything else remains the same. Table 2shows the gap in makespan for diòerent cases. Due tocareful packing, Graphene sustains high cluster resourceutilization which in turn enables jobs to ûnish quickly:makespan improves 30% relative to Tez and over 20% rel-ative to alternatives.Fairness: Can we improve performance while also be-ing fair? Intuitively, fairness may hurt performance sincefairly dividing resources may lower overall utilization orslow-down some jobs. To evaluate fairness, wemake onechange to the experiment set up. _e jobs are evenly andrandomly distributed among two queues and the sched-uler has to divide resources evenly.

Table 3 reports the gap in performance (median JCT)for each scheme when run with two queues vs. one.Tez, Tez+DRF and Tez+Tetris lose over 10% in per-formance relative to their one queue counterparts. _etable shows that with two queues, Graphene has a smallgain (perhaps due to experimental noise). Hence, rela-tively, Graphene performs even better than the alterna-tives if given more queues. But why? Table 3 also showsJain’s fairness index computed over 10s, 60s and 240swin-dows. We see thatGraphene is less fair at short timescalesbut is indistinguishable at larger time windows. _is isbecause Graphene bounds unfairness (§5); it leveragesshort-term slack from precise fairness to make schedul-ing choices that improve performance.Value of enforcing preferred schedules online: Recallthat Graphene’s online component enforces the preferredschedules constructed by the oøine component. To teaseapart the value of this combination, we consider alterna-tives wherein the job managers use the preferred sched-ules (to request containers in that order) but the clusterscheduler is agnostic; i.e. it simply runs the default ca-pacity scheduler or Tetris (we call these Tez+G+CS andTez+G+Tetris respectively). We ûnd that Grapheneoòers 26% and 28% better median JCT compared toTez+G+Tetris and Tez+G+CS. _is experiment wasconducted on a smaller 50 server cluster with diòerenthardware so these numbers are not directly comparablewith the remaining experiments; we oòer them merely as

0

0.2

0.4

0.6

0.8

1

-20 0 20 40 60 80 100

Fra

ctio

n o

f D

AG

s

Reduction in job duration [%]

RandomTetris

Graphene

New lb

(a) Graphene vs. Baselines

0

0.2

0.4

0.6

0.8

1

-20 0 20 40 60 80 100

Fra

ctio

n o

f D

AG

s

Reduction in job duration [%]

Graphene

Crit.PathFit all

(b) Graphene vs. AlternatesFigure 14: Comparing Graphene with other schemes. Weremoved the lines for CG and StripPart from the rightûgurebecause they hug x = 0; see Table 4.

25th 50th 75th 90th

Graphene 7 25 57 74Random −2 0 1 4

Crit.Path Fit cpu/mem −2 0 2 1Fit all 1 4 13 16

Tetris Fit all 0 7 29 42Strip Part. Fit all 0 1 12 27

Coffman-Graham. Fit all 0 1 12 26Fit cpu/mem −2 0 0 2

Table 4: Reading out the gaps from Figure 14. Each entry isthe improvement relative to BFS.

a qualitative validation of Graphene’s combination of on-line and oøine components.

7.3 Comparing with alternatives

We use simulations to compare a wider set of algo-rithms (§7.1) on themuch largerDAGs that ran in the pro-duction clusters. Wemimic the actual dependencies, taskdurations and resource needs from the cluster.Figure 14 compares the schedules constructed by

Graphene with the schedules from other algorithms. Ta-ble 4 reads out the gaps at various percentiles. We observethat Graphene’s gains at the end of schedule constructionare about the same as those at runtime (Figure 12). _isis interesting because the runtime component only solyenforces the desired schedules from all the jobs runningsimultaneously in the cluster. It appears that any loss inperformance fromnot adhering to the desired schedule ismade up by the gains from better packing (acrossDAGs)and trading oò some short-term unfairness.

Second, Graphene’s gains are considerable comparedto the alternatives. CP and Tetris are the closest. _ereason is that Graphene looks at the entire DAG andplaces the troublesome tasks ûrst, leading to amore com-pact schedule overall.

_ird, when tasks have unit durations and nicelyshaped demands, CG (Coòman-Graham [29]) is at most 2times optimal. However, it does not perform well on theheterogeneous DAGs seen in production. Some recentextensions of CG to handle heterogeneity ignore fragmen-tation when resources are divided across machines [49].Fourth, StripPart [20] combines resource packing

and task dependencies and has the best-known approx-imation ratio: O(logn) on a DAG with n tasks [20]. _ekey idea is to partition tasks into levels such that all depen-

92 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 14: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Red

uct

ion [

%]

in

Avg

. Jo

b C

om

pl.

Parameter Value

mrp

(a) Job Duration

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Red

uct

ion [

%]

in M

ake

span

Parameter Value

mrp

(b) MakespanFigure 15: Graphene - sensitivity analysis.

dencies go across levels and then to tightlypack each level.We ûnd that StripPart under-performs simpler heuris-tics in practice because (a) independent tasks that happento fall into diòerent levels cannot be packed together lead-ing towasted resources between levels and (b) the recom-mended per-level packers (e.g. [60]) do not support mul-tiple resources and vector packing [58].

How close is Graphene to Optimal? ComparingGraphene with NewLB, we ûnd that Graphene is optimalfor about 40% of the DAGs. For half (three quarters) ofthe DAGs, Graphene is within 4% (13%) of the new lowerbound. A gap still remains: for 10% of DAGs, Graphenetakes 25% longer. Manually examining these DAGs showsthat NewLB is loose for most of them (deriving a tighterlower bound is an open problem). In sum, Graphene isclose to optimal for most of the production DAGs.

7.4 Sensitivity Analysis

Packing vs. Shortest Remaining Processing Time(srpt): Recall that we combine packing score and srptusing a weighted sum with η (§5). Let η be m times theaverage over the two expressions that it combines. Fig-ure 15 shows the reduction in average JCT (on le) andmakespan (on right) for diòerent values of m. Values ofm ∈ [0.1,0.3] have the most gains. Lower values lead toworse average JCT because the eòect of srpt reduces;larger values lead to moderately worsemakespan. Hence,we recommend m = 0.2.Remote Penalty: Graphene uses a remote penalty rp toprefer local placement. Our analysis shows that both JCTand makespan improve the most when rp ∈ [0.7,0.85](Figure 15). Since rp is amultiplicative penalty, lower val-ues of rp cause the scheduler tomiss (non-local) schedul-ing opportunitieswhereas higher rp can over-use remoteresources. We use rp = 0.8.Cluster Load:We vary cluster load by reducing the num-ber of available servers without changing the workload.Figure 16 shows the JCTs and makespan for a query setderived from TPC-DS. Both Graphene and the alterna-tives oòer more gains at higher loads. _is is because theneed for careful scheduling and packing increases whenresources are scarce. Gains due to Graphene increase by+10% at 2× load and by +15% at 6× load. Further, the gapbetween Graphene and the alternatives remains similaracross load levels.

0

10

20

30

40

50

0 2 4 6 8

Red

uct

ion [

%]

in

Avg

. Jo

b C

om

pl.

Multiple of Original Load

Tez+Tetris

Graphene

Tez+CP

(a) Job Duration

0

10

20

30

40

50

0 2 4 6 8

Red

uct

ion [

%]

in M

ake

span

Multiple of Original Load

Graphene

Tez+Tetris

Tez+CP

(b) MakespanFigure 16: Graphene’s gains increase with cluster load.

Fractional change in JCT

0

0.2

0.4

0.6

0.8

1

-0.2 -0.1 0 0.1 0.2

CD

F ove

r D

AG

s

CPSched

0

0.2

0.4

0.6

0.8

1

-0.2 -0.1 0 0.1 0.2

Graphene

0

0.2

0.4

0.6

0.8

1

-0.2 -0.1 0 0.1 0.2

Tetris

[-0.75:-0.50][-0.50:-0.25]

[0.25:0.50][0.50:0.75]

Figure 17: Fractional change in the job completion time (JCT)of DAGswith various schedulerswhen task durations and re-source proûles aremis-estimated.

Impact of misestimations: We oòer to each scheduleran inaccurate task duration and resource usage vectorbut have the underlying execution use the true values.Hence, the schedulers match tasks to machine based onimperfect estimates. Once scheduled, the tasks may ûn-ish aer a diòerent duration or use diòerent amountsof resources. When the total resource demand crossesmachine capacity, we delay the completion of tasks fur-ther by a proportional amount. Figure 17 shows a CDFof the change in the completion time of the productionDAGs for diòerent schedulers. Each line denotes a dif-ferent amount of error. For example, the red triangleline labeled [−0.75 ∶ −0.50] corresponds to picking a ran-dom number in that range for each stage and then chang-ing the task durations and resource needs fractionally bythat random number (−0.75 indicates a 75% lower value).We see that the impact of mis-estimates is rather small;Graphene changes roughly similarly to the other sched-ulers. Under-estimates tend to speed up the job becausethe scheduler over-allocates tasks but over-allocation canalso slow-down jobs. Over-estimates delay jobs becausethe scheduler wastes resources; it may refrain from allo-cating a task when its needs appear larger than the avail-able resources at a machine. Overall, Graphene appearsrobust to mis-estimations.

8 Applying Graphene to other domainsWe evaluateGraphene’s eòectiveness in schedulingDAGsthat arise in distributed compilation jobs [3, 32, 34] andInternet service workows [46].Distributed build systems speed up the compilation of

large code bases [3, 34]. Each build is a DAG with de-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 93

Page 15: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

0

10

20

30

40

50

1 2 3 4 5

Reduct

ion [

%]

DAG size (100x)

G vs. TG vs. CP

(a) Distributed Build Systems:Compilation time

0

10

20

30

40

50

0 1 2 3 4

Red

uct

ion [

%]

DAG size (100x)

G vs. TG vs. CP

(b) Request-response workows:Query latency

Figure 18: Comparing Graphene (G) with Tetris (T) andCritical path scheduling (CP) onDAGs from other domains.

pendencies between the various tasks (compilation, link-ing, test, code analysis). _e tasks have diòerent runtimesand diòerent resource proûles. Figure 18a shows thatGraphene is 20% (30%) faster than Tetris (CP) whenscheduling the buildDAGs from a production distributedbuild system [32]. Each bar is centered on themedian gainfor DAGs of a certain size; the error bars are quartiles.We also examine the DAGs that arise in datacenter-

side workows for Internet-services [46]. For instance,a search query translates into a workow of dependentRPCs at the datacenter (e.g., spell check before indexlookup, video and image lookup in parallel). _e RPCsuse diòerent resources, have diòerent runtimes and of-ten run on the same server pool [46]. Over severalworkows from a production service, Figure 18b showsthat Graphene improves upon alternatives by about 24%._ese encouraging early results hint that Graphene maybemore broadly useful.

9 RelatedWorkTo structure the discussion, we ask four questions: (Q1)does a scheme consider both packing and dependencies,(Q2) does it make realistic assumptions, (Q3) is it practi-cal to implement in cluster schedulers and, (Q4) does itconsider multiple objectives such as fairness? Grapheneis unique in positively answering these four questions.Q1 ∶ NO. Substantial prior work ignores dependenciesbut packs tasks with varying demands for multiple re-sources [26, 37, 60, 65, 71]. _e best results are when thedemand vectors are small [21]. Other work considers de-pendencies but assumes homogeneous demands [29, 36].A recent multi-resource packing scheme, Tetris [37], suc-ceeds on the three other questions but does not handledependencies. Hence, we saw in §7 that Tetris performspoorly when schedulingDAGs (can be up to 2d times oò,see [38]). Tetris can also be arbitrarily unfair.Q1 ∶ YES,Q2 ∶ NO._epacking+dependenciesproblemhasbeen considered at length under job-shop scheduling [31,35, 50, 63]. Most results assume knowledge of job arrivaltimes and proûles [49]. For the case with unknown fu-ture job arrivals (the version considered here), no algo-rithms with bounded competitive ratios are known [54,55]. Some notable work assumes only two resources [23],

applies for a chain but not a generalDAG [18] or assumesone cluster-wide resource pool [51].

Q3 ∶ NO. Several of the schemes listed above are com-plex and hence do not meet the tight timing require-ments of cluster schedulers. VM allocators [28] also con-sider multi-resource packing. However, cluster sched-ulers have to support roughly two to three orders ofmag-nitude higher rate of allocation (tasks aremore numerousthan VMs).

Q3 ∶ YES,Q1 ∶ NO. Several works in cluster scheduling ex-ist such as Quincy [44], Omega [62], Borg [68], Kuber-netes [9] and Autopilot [42]. None of these combinemulti-resource packing with DAG-awareness and manydo neither. Job managers such as Tez [2] and Dryad [43]use simple heuristics such as breath-ûrst scheduling andperform poorly in our experiments.

Q4 ∶ NO. Recently proposed fairness schemes incor-porate multiple resources [33] and some are work-conserving [27]. We note that these fairness schemes nei-ther pack nor are DAG-aware. Graphene can incorporatethese fairness methods as one of the multiple objectivesand trades oò bounded unfairness for performance.

10 Concluding Remarks

DAGs are a common scheduling abstraction. However,we found that existing algorithms make key assumptionsthat do not hold in the case of cluster schedulers. Ourscheduler, Graphene, is an eõcient online solution thatscales to large clusters. We experimentally validated thatit substantially improves the scheduling of DAGs in bothsynthetic and emulated production traces. _e core tech-nical contributions are: (1) construct a good schedulefor a DAG by placing tasks out-of-order on to a virtualresource-time space, and (2) use an online heuristic tosoly enforce the desired schedules and simultaneouslymanage other concerns such as packing and fairness.Much of these innovations use the fact that job DAGsconsist of groups of tasks (in each stage) that have similardurations, resource needs, and dependencies. We intendto contribute our Graphene implementation to ApacheYARN/Tez projects.

Acknowledgments

For early discussions, we would like to thank GaneshAnanthanaryanan and Peter Bodik. For feedback thathelped improve this paper, we thank the anonymous re-viewers, our shepherd Phil Levis, Mohammad Alizadeh,Chris Douglas, Hongzi Mao, Ishai Menache and MalteSchwarzkopf. For operating the cluster that motivatedand inspired this work, we thank the Cosmos/ SCOPEproduction team at Microso.

94 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 16: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

References[1] 43 bigdata platforms and bigdata analytics soware.

http://bit.ly/1DROqgt.

[2] Apache Tez. http://tez.apache.org/.

[3] Bazel. http://bazel.io/.

[4] Big-Data-Benchmark. http://bit.ly/1HlFRH0.

[5] Condor. http://research.cs.wisc.edu/

htcondor/.

[6] Hadoop: Fair scheduler/ slot fairness. http://

bit.ly/1PfsT7F.

[7] Hadoop MapReduce - Capacity Scheduler. http:

//bit.ly/1tGpbDN.

[8] Hadoop YARN Project. http://bit.ly/

1iS8xvP.

[9] Kubernetes. http://kubernetes.io/.

[10] Market research on big-data-as-service oòerings.http://bit.ly/1V6TXV4.

[11] Node labels in yarn. http://bit.ly/2d92ook.

[12] Reserved containers in yarn. http://bit.ly/

2ckArDO.

[13] Slot and resource fairness in yarn. http://bit.

ly/2ckArDO.

[14] TPC-H Benchmark. http://bit.ly/1KRK5gl.

[15] TPC-DS Benchmark. http://bit.ly/1J6uDap,2012.

[16] Agarwal, S., Kandula, S., Burno, N.,Wu,M.-C.,Stoica, I., andZhou, J. Re-optimizing data parallelcomputing. In NSDI (2012).

[17] Ananthanarayanan, G., Kandula, S., Green-berg, A., Stoica, I., Lu, Y., Saha, B., and Harris,E. Reining in the Outliers in MapReduce ClustersUsing Mantri. In OSDI (2010).

[18] Anderson, E., Beyer, D., Chaudhuri, K., Kelly,T., Salazar, N., Santos, C., Swaminathan, R.,Tarjan, R., Wiener, J., and Zhou, Y. Value-maximizing deadline scheduling and its applicationto animation rendering. In SPAA (2005).

[19] Armbrust, M., Xin, R. S., Lian, C., Huai, Y.,Liu, D., Bradley, J. K., Meng, X., Kaftan, T.,Franklin, M. J., Ghodsi, A., and Zaharia, M.Spark sql: Relational data processing in spark. InSIGMOD (2015).

[20] Augustine, J., Banerjee, S., and Irani, S. Strippackingwith precedence constraints and strip pack-ing with release times. In SPAA (2006).

[21] Azar, Y., Cohen, I. R., Fiat, A., and Roytman, A.Packing small vectors. In SODA (2016).

[22] Azar,Y.,Kalp-Shaltiel, I., Lucier, B.,Menache,I., Naor, J., and Yaniv, J. Truthful online schedul-ing with commitments. In EC (2015).

[23] Belkhale, K. P., and Banerjee, P. An approx-imate algorithm for the partitionable independenttask scheduling problem. Urbana 51 (1990), 61801.

[24] Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey,B., Shakib, D.,Weaver, S., and Zhou, J. SCOPE:Easy and Eõcient Parallel Processing of MassiveDatasets. In VLDB (2008).

[25] Chambers, C., Raniwala, A., Perry, F., Adams, S.,Henry, R. R., Bradshaw, R., andWeizenbaum,N.Flumejava: Easy, eõcient data-parallel pipelines. InPLDI (2010).

[26] Chandra, C., and Sanjeev, K. On multidimen-sional packing problems. SIAM J. Comput. (2004).

[27] Chowdhury,M., Liu, Z.,Ghodsi, A., and Stoica,I. Hug: Multi-resource fairness for correlated andelastic demands. In NSDI (2016).

[28] Chowdhury, N., Rahman, M., and Boutaba,R. Virtual Network Embedding with CoordinatedNode and Link Mapping. In INFOCOM (2009).

[29] Coffman, E.G., J., and Graham, R. Optimalscheduling for two-processor systems. Acta Infor-matica (1972).

[30] Curino, C., Difallah, D. E., Douglas, C., Kr-ishnan, S., Ramakrishnan, R., and Rao, S.Reservation-based scheduling: If you’re late don’tblame us! In SOCC (2014).

[31] Czumaj, A., and Scheideler, C. ANewAlgorithmApproach to the General Lovasz Local Lemma withApplications to Scheduling and Satisûability Prob-lems (Extended Abstract). In STOC (2000).

[32] Esfahani, H., Fietz, J., Ke, Q., Kolomiets, A.,Lan, E.,Mavrinac, E., Schulte,W., Sanches, N.,and Kandula, S. CloudBuild: Microso’s Dis-tributed and Caching Build Service. In ICSE (2016).

[33] Ghodsi, A., Zaharia,M., Hindman, B., Konwin-ski, A., Shenker, S., and Stoica, I. DominantResource Fairness: Fair Allocation Of Multiple Re-source Types. In NSDI (2011).

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 95

Page 17: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

[34] Gliboric, M., Schulte, W., Prasad, C., vanVelzen, D., Narsamdya, I., and Livshits, B. Au-tomated migration of build scripts using dynamicanalysis and search-based refactoring. In OOPSLA(2014).

[35] Goldberg, L. A., Paterson, M., Srinivasan, A.,and Sweedyk, E. Better approximation guaranteesfor job-shop scheduling. In SODA (1997).

[36] Graham, R. L. Bounds on multiprocessing timinganomalies. SIAM Journal on Applied Mathematics(1969).

[37] Grandl, R., Ananthanarayanan, G., Kandula,S., Rao, S., and Akella, A. Multi-resource Packingfor Cluster Schedulers. In SIGCOMM (2014).

[38] Grandl,R.,Kandula, S.,Rao, S., Akella, A., andKulkarni, J. Graphene: Packing and dependency-aware scheduling for data-parallel clusters, extendedversion. In MSR Technical Report (2016).

[39] Greenberg, A., Jain, N., Kandula, S., Kim, C.,Lahiri, P.,Maltz, D. A., Patel, P., and Sengupta,S. Vl2: A scalable and exible data center network.In SIGCOMM (2009).

[40] Gulwani, S.,Mehra, K., and Chilimbi, T. Speed:Precise and eõcient static estimation of programcomputational complexity. In POPL (2009).

[41] Hindman, B., Konwinski, A., Zaharia, M., Gh-odsi, A., Joseph, A. D., Katz, R., Shenker, S., andStoica, I. Mesos: a platform for ûne-grained re-source sharing in the data center. In NSDI (2011).

[42] Isard,M. Autopilot: Automatic Data Center Man-agement. OSR (2007).

[43] Isard,M., Budiu,M., Yu, Y., Birrell, A., and Fet-terly, D. Dryad: Distributed Data-parallel Pro-grams from Sequential Building Blocks. In Eurosys(2007).

[44] Isard,M., et al. Quincy: Fair Scheduling ForDis-tributed Computing Clusters. In SOSP (2009).

[45] Jain, R., Chiu, D., and Hawe, W. A quantita-tive measure of fairness and discrimination for re-source allocation in shared computer systems. CoRRcs.NI/9809099 (1998).

[46] Jalaparti, V., Bodik, P., Kandula, S., Menache,I., Rybalkin, M., and Yan, C. Speeding up dis-tributed request-responseworkows. In SIGCOMM(2013).

[47] Kelly, F., Maulloo, A., and Tan, D. Rate controlfor communication networks: shadow prices, pro-portional fairness and stability. Journal of the Oper-ational Research Society (1998).

[48] Kumar,V. S.A.,Marathe,M.V., Parthasarathy,S., and Srinivasan, A. Scheduling on unrelatedmachines under tree-like precedence constraints.Algorithmica (2009).

[49] Kwok, Y., and Ahmad, I. Static scheduling algo-rithms for allocating directed task graphs to multi-processors. ACMComputing Surveys (CSUR) (1999).

[50] Leighton, F. T., Maggs, B. M., and Rao, S. Uni-versal packet routing algorithms. In FOCS (1988).

[51] Lepere, R., Trystram, D., and Woeginger, G. J.Approximation Algorithms for Scheduling Mal-leable Tasks Under Precedence Constraints. Inter-national Journal of Foundations of Computer Science(2002).

[52] Lucier, B., Menache, I., Naor, J., and Yaniv,J. Eõcient online scheduling for deadline-sensitivebatch computing. In SPAA (2013).

[53] Makarychev, K., and Panigrahi, D. Precedence-constrained scheduling of malleable jobs with pre-emption. In ICALP (2014).

[54] Mastrolilli, M., and Svensson, O. (acyclic) jobshops are hard to approximate. In FOCS (2008).

[55] Monaldo, M., and Ola, S. Improved bounds forow shop scheduling. In ICALP (2009).

[56] Motwani, R., and Raghavan, P. Randomized Al-gorithms. Cambridge University Press, New York,NY, USA, 1995.

[57] Olston, C., et al. Pig Latin: A Not-So-ForeignLanguage for Data Processing. In SIGMOD (2008).

[58] Panigrahy, R., Talwar, K., Uyeda, L., andWieder, U. Heuristics for Vector Bin Packing. InMSR TR (2011).

[59] Peng, D., and Dabek, F. Large-scale incrementalprocessing using distributed transactions and noti-ûcations. DBMS 2006 (2010), 1–15.

[60] Schiermeyer, I. Reverse-ût: A2-optimal algorithmfor packing rectangles. In Proceedings of the SecondAnnual European Symposium on Algorithms (1994).

[61] Schurman, E., and Brutlag, J. _e Userand Business Impact of Server Delays, Addi-tional Bytes, and Http Chunking in Web Search.

96 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Page 18: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;

http://velocityconf.com/velocity2009/

public/schedule/detail/8523, 2009.

[62] Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: Flexible,scalable schedulers for large compute clusters. InEuroSys (2013).

[63] Shmoys, D. B., Stein, C., and Wein, J. Im-proved approximation algorithms for shop schedul-ing problems. SIAM J. Comput. (1994).

[64] Shreedhar, M., and Varghese, G. Eõcient fairqueueing using deûcit round robin. In SIGCOMM(1995).

[65] Sungjin, I., Nathaniel, K., Janardhan, K., andDebmalya, P. Tight bounds for online vectorscheduling. In FOCS (2015).

[66] Suresh, L., Canini, M., Schmid, S., and Feld-mann, A. C3: Cutting tail latency in cloud datastores via adaptive replica selection. In NSDI (2015).

[67] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z.,Chakka, P., Zhang, N., Antony, S., Liu, H., andMurthy, R. Hive- a warehousing solution over amap-reduce framework. In VLDB (2009).

[68] Verma, A., Pedrosa, L.,Korupolu,M.R.,Oppen-heimer, D., Tune, E., and Wilkes, J. Large-scaleclustermanagement atGooglewithBorg. InEuroSys(2015).

[69] Woeginger,G. J. _ere IsNoAsymptoticPTASForTwo-Dimensional Vector Packing. In InformationProcessing Letters (1997).

[70] Yossi, A., Ilan, C., Seny, K., and Bruce, S. Tightbounds for online vector bin packing. In STOC(2013).

[71] Yossi, A., Ilan Reuven, C., and Iftah,G. _e lossof serving in the dark. In STOC (2013).

[72] Zaharia, M., Borthakur, D., Sarma, J. S.,Elmeleegy, K., Shenker, S., and Stoica, I. De-lay Scheduling: A Technique For Achieving Local-ity And Fairness In Cluster Scheduling. In EuroSys(2010).

[73] Zaharia, M., Chowdhury, N. M. M., Franklin,M., Shenker, S., and Stoica, I. Spark: Clustercomputingwithworking sets. No.UCB/EECS-2010-53.

[74] Zaharia,M., Konwinski, A., Joseph, A.D., Katz,R., and Stoica, I. Improving MapReduce Perfor-mance in Heterogeneous Environments. In OSDI(2008).

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 97

Page 19: Graphene: Packing and Dependency-Aware Scheduling for ...Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin—Madison;