multi-resource fair allocation in heterogeneous cloud computing systems

1045-9219 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2014.2362139, IEEE Transactions on Parallel and Distributed Systems

1

Multi-Resource Fair Allocation in HeterogeneousCloud Computing Systems

Wei Wang, Student Member, IEEE, Ben Liang, Senior Member, IEEE, Baochun Li, Senior Member, IEEE

Abstract—We study the multi-resource allocation problem in cloud computing systems where the resource pool is constructed from alarge number of heterogeneous servers, representing different points in the configuration space of resources such as processing,memory, and storage. We design a multi-resource allocation mechanism, called DRFH, that generalizes the notion of DominantResource Fairness (DRF) from a single server to multiple heterogeneous servers. DRFH provides a number of highly desirableproperties. With DRFH, no user prefers the allocation of another user; no one can improve its allocation without decreasing that ofthe others; and more importantly, no user has an incentive to form a coalition with others to lie about its resource demand. DRFH alsoensures some level of service isolation among the users. As a direct application, we design a simple heuristic that implements DRFHin real-world systems. Large-scale simulations driven by Google cluster traces show that DRFH significantly outperforms the traditionalslot-based scheduler, leading to much higher resource utilization with substantially shorter job completion times.

Index Terms—Cloud computing, heterogeneous servers, job scheduling, multi-resource allocation, fairness.

F

1 INTRODUCTION

Resource allocation under the notion of fairness and efficiencyis a fundamental problem in the design of cloud computingsystems. Unlike traditional application-specific clusters andgrids, cloud computing systems distinguish themselves withunprecedented server and workload heterogeneity. Moderndatacenters are likely to be constructed from a variety ofserver classes, with different configurations in terms of pro-cessing capabilities, memory sizes, and storage spaces [2].Asynchronous hardware upgrades, such as adding new serversand phasing out existing ones, further aggravate such diversity,leading to a wide range of server specifications in a cloudcomputing system [3]–[7]. Table 1 illustrates the heterogeneityof servers in one of Google’s clusters [3], [8]. Similar serverheterogeneity has also been observed in public clouds, suchas Amazon EC2 and Rackspace [4], [5].

In addition to server heterogeneity, cloud computing sys-tems also represent much higher diversity in resource demandprofiles. Depending on the underlying applications, the work-load spanning multiple cloud users may require vastly differ-ent amounts of resources (e.g., CPU, memory, and storage).For example, numerical computing tasks are usually CPUintensive, while database operations typically require high-memory support. The heterogeneity of both servers and work-load demands poses significant technical challenges on theresource allocation mechanism, giving rise to many delicateissues—notably fairness and efficiency—that must be carefullyaddressed.

Despite the unprecedented heterogeneity in cloud com-

• W. Wang, B. Liang and B. Li are with the Department of Electrical andComputer Engineering, University of Toronto, Toronto, ON, Canada.E-mail: {weiwang, liang}@ece.utoronto.ca, [email protected].

• Part of this paper has appeared in [1]. This new version contains substan-tial revision with new illustrative examples, property analyses, proofs, anddiscussions.

TABLE 1Configurations of servers in one of Google’s clusters [3],

[8]. CPU and memory units are normalized to themaximum server (highlighted below).

Number of servers CPUs Memory6732 0.50 0.503863 0.50 0.251001 0.50 0.75795 1.00 1.00126 0.25 0.2552 0.50 0.125 0.50 0.035 0.50 0.973 1.00 0.501 0.50 0.06

puting systems, state-of-the-art computing frameworks em-ploy rather simple abstraction that falls short. For example,Hadoop [9] and Dryad [10], the two most widely deployedcloud computing frameworks, partition a server’s resourcesinto bundles—known as slots—that contain fixed amountsof different resources. The system then allocates resourcesto users at the granularity of these slots. Such a singleresource abstraction ignores the heterogeneity of both serverspecifications and demand profiles, inevitably leading to afairly inefficient allocation [11].

Towards addressing the inefficiency of the current allocationmodule, many recent works focus on multi-resource allocationmechanisms. Notably, Ghodsi et al. [11] suggest a com-pelling alternative known as the Dominant Resource Fairness(DRF) allocation, in which each user’s dominant share—the maximum ratio of any resource that the user has beenallocated—is equalized. The DRF allocation possesses a set ofhighly desirable fairness properties, and has quickly receivedsignificant attention in the literature [12]–[15]. While DRFand its subsequent works address the demand heterogeneity ofmultiple resources, they all limit the discussion to a simplified



2

model where resources are pooled in one place and theentire resource pool is abstracted as one big server1. Suchan all-in-one resource model not only contrasts the prevalentdatacenter infrastructure—where resources are distributed to alarge number of servers—but also ignores the server hetero-geneity: the allocations depend only on the total amount ofresources pooled in the system, irrespective of the underlyingresource distribution of servers. In fact, when servers areheterogeneous, even the definition of dominant resource is notso clear. Depending on the underlying server configurations,a computing task may bottleneck on different resources indifferent servers. We shall see that naive extensions, such asapplying the DRF allocation to each server separately, maylead to a highly inefficient allocation (details in Sec. 3.4).

This paper represents a rigorous study to propose a so-lution with provable operational benefits that bridge the gapbetween the existing multi-resource allocation models and thestate-of-the-art datacenter infrastructure. We propose DRFH,a DRF generalization in Heterogeneous environments whereresources are pooled by a large number of heterogeneousservers, representing different points in the configuration spaceof resources such as processing, memory, and storage. DRFHgeneralizes the intuition of DRF by seeking an allocation thatequalizes every user’s global dominant share, which is themaximum ratio of any resources the user has been allocatedin the entire resource pool. We systematically analyze DRFHand show that it retains most of the desirable propertiesthat the all-in-one DRF model provides for a single server[11]. Specifically, DRFH is Pareto optimal, where no user isable to increase its allocation without decreasing other users’allocations. Meanwhile, DRFH is envy-free in that no userprefers the allocation of another one. More importantly, DRFHis group strategyproof in that no user can schedule morecomputing tasks by forming a coalition with others to claimmore resources that are not needed. The users hence have noincentive to misreport their actual resource demands. In addi-tion, DRFH offers some level of service isolation by ensuringthe sharing incentive property in a weak sense – it allows usersto execute more tasks than those under some “equal partition”where the entire resource pool is evenly allocated among allusers. DRFH also satisfies a set of other important properties,namely single-server DRF, single-resource fairness, bottleneckfairness, and population monotonicity (details in Sec. 3.3).

As a direct application, we design a heuristic schedulingalgorithm that implements DRFH in real-world systems. Weconduct large-scale simulations driven by Google cluster traces[8]. Our simulation results show that compared with the tra-ditional slot schedulers adopted in prevalent cloud computingframeworks, the DRFH algorithm suitably matches demandheterogeneity to server heterogeneity, significantly improvingthe system’s resource utilization, yet with a substantial reduc-tion of job completion times.

The remainder of this paper is organized as follows. Webriefly revisit the DRF allocation and point out its limitationsin heterogeneous environments in Sec. 2. We then formulate

1. While [11] briefly touches on the case where resources are distributed tosmall servers (known as the discrete scenario), its coverage is rather informal.

Memory

CPUs

Server 1 Server 2

(1 CPU, 14 GB) (8 CPUs, 4 GB)

Fig. 1. An example of a system consisting of two hetero-geneous servers, in which user 1 can schedule at mosttwo tasks each demanding 1 CPU and 4 GB memory.The resources required to execute the two tasks are alsohighlighted in the figure.

the allocation problem with heterogeneous servers in Sec. 3,where a set of desirable allocation properties are also defined.In Sec. 4, we propose DRFH and analyze its properties.Sec. 6 dedicates to some practical issues on implementingDRFH. We evaluate the performance of DRFH via trace-drivensimulations in Sec. 6. We survey the related work in Sec. 7and conclude the paper in Sec. 8.

2 LIMITATIONS OF DRF ALLOCATION IN HET-EROGENEOUS SYSTEMS

In this section, we briefly review the DRF allocation [11] andshow that it may lead to an infeasible allocation when a cloudsystem is composed of multiple heterogeneous servers.

In DRF, the dominant resource is defined for each user asthe one that requires the largest fraction of the total availability.The mechanism seeks a maximum allocation that equalizeseach user’s dominant share, defined as the fraction of thedominant resource the user has been allocated. Consider anexample given in [11]. Suppose that a computing system has9 CPUs and 18 GB memory, and is shared by two users. User1 wishes to schedule a set of (divisible) tasks each requiring〈1 CPU, 4 GB〉, and user 2 has a set of (divisible) taskseach requiring 〈3 CPU, 1 GB〉. In this example, the dominantresource of user 1 is the memory as each of its task demands1/9 of the total CPU and 2/9 of the total memory. On theother hand, the dominant resource of user 2 is CPU, as eachof its task requires 1/3 of the total CPU and 1/18 of the totalmemory. The DRF mechanism then allocates 〈3 CPU, 12 GB〉to user 1 and 〈6 CPU, 2 GB〉 to user 2, where user 1 schedulesthree tasks and user 2 schedules two. It is easy to verify thatboth users receive the same dominant share (i.e., 2/3) and noone can schedule more tasks by allocating additional resources(there is 2 GB memory left unallocated).

The DRF allocation above is based on a simplified all-in-oneresource model, where the entire system is modeled as one bigserver. The allocation hence depends only on the total amountof resources pooled in the system. In the example above, nomatter how many servers the system has, and what each serverspecification is, as long as the system has 9 CPUs and 18 GBmemory in total, the DRF allocation will always schedule threetasks for user 1 and two for user 2. However, this allocationmay not be possible to implement, especially when the system



3

consists of heterogeneous servers. For example, suppose thatthe resource pool is provided by two servers. Server 1 has 1CPU and 14 GB memory, and server 2 has 8 CPUs and 4GB memory. As shown in Fig. 1, even allocating both serversexclusively to user 1, at most two tasks can be scheduled, onein each server. Moreover, even for some server specificationswhere the DRF allocation is feasible, the mechanism onlygives the total amount of resources each user should receive.It remains unclear how many resources a user should beallocated in each server. These problems significantly limit theapplication of the DRF mechanism. In general, the allocationis valid only when the system contains a single server ormultiple homogeneous servers, which is rarely a case underthe prevalent datacenter infrastructure.

Despite the limitation of the all-in-one resource model,DRF is shown to possess a set of highly desirable allocationproperties for cloud computing systems [11], [15]. A naturalquestion is: how should the DRF intuition be generalized to aheterogeneous environment to achieve similar properties? Notethat this is not an easy question to answer. In fact, with hetero-geneous servers, even the definition of dominant resource isnot so clear. Depending on the server specifications, a resourcemost demanded in one server (in terms of the fraction of theserver’s availability) might be the least-demanded in another.For instance, in the example of Fig. 1, user 1 demands CPU themost in server 1. But in server 2, it demands memory the most.Should the dominant resource be defined separately for eachserver, or should it be defined for the entire resource pool?How should the allocation be conducted? And what propertiesdo the resulting allocation preserve? We shall answer thesequestions in the following sections.

3 SYSTEM MODEL AND ALLOCATION PROP-ERTIES

In this section, we model multi-resource allocation in a cloudcomputing system with heterogeneous servers. We formalizea number of desirable properties that are deemed the mostimportant for allocation mechanisms in cloud computing en-vironments.

3.1 Basic SettingsLet S = {1, . . . , k} be the set of heterogeneous serversa cloud computing system has in its resource pool. LetR = {1, . . . ,m} be the set of m hardware resources pro-vided by each server, e.g., CPU, memory, storage, etc. Letcl = (cl1, . . . , clm)T be the resource capacity vector of serverl ∈ S, where each component clr denotes the total amount ofresource r available in this server. Without loss of generality,we normalize the total availability of every resource to 1, i.e.,∑

l∈S

clr = 1, ∀r ∈ R .

Let U = {1, . . . , n} be the set of cloud users sharing theentire system. For every user i, let Di = (Di1, . . . , Dim)T

be its resource demand vector, where Dir is the amount ofresource r required by each instance of the task of user i. Forsimplicity, we assume positive demands, i.e., Dir > 0 for all

Memory

CPUs

Server 1 Server 2

(2 CPUs, 12 GB) (12 CPUs, 2 GB)

Fig. 2. An example of a system containing two heteroge-neous servers shared by two users. Each computing taskof user 1 requires 0.2 CPU time and 1 GB memory, whilethe computing task of user 2 requires 1 CPU time and 0.2GB memory.

user i and resource r. We say resource r∗i the global dominantresource of user i if

r∗i ∈ arg maxr∈R

Dir .

In other words, resource r∗i is the most heavily demandedresource required by each instance of the task of user i, overthe entire resource pool. For each user i and resource r, wedefine

dir = Dir/Dir∗i

as the normalized demand and denote by di = (di1, . . . , dim)T

the normalized demand vector of user i.As a concrete example, consider Fig. 2 where the system

consists of two heterogeneous servers. Server 1 is high-memory with 2 CPUs and 12 GB memory, while server2 is high-CPU with 12 CPUs and 2 GB memory. Sincethe system has 14 CPUs and 14 GB memory in total,the normalized capacity vectors of server 1 and 2 arec1 = (CPU share,memory share)T = (1/7, 6/7)T and c2 =(6/7, 1/7)T , respectively. Now suppose that there are twousers. User 1 has memory-intensive tasks each requiring 0.2CPU time and 1 GB memory, while user 2 has CPU-heavytasks each requiring 1 CPU time and 0.2 GB memory. Thedemand vector of user 1 is D1 = (1/70, 1/14)T and the nor-malized vector is d1 = (1/5, 1)T , where memory is the globaldominant resource. Similarly, user 2 has D2 = (1/14, 1/70)T

and d2 = (1, 1/5)T , and CPU is its global dominant resource.For now, we assume users have an infinite number of tasks

to be scheduled, and all tasks are divisible [11], [13]–[16]. Weshall discuss how these assumptions can be relaxed in Sec. 5.

3.2 Resource Allocation

For every user i and server l, let Ail = (Ail1, . . . , Ailm)T

be the resource allocation vector, where Ailr is the amountof resource r allocated to user i in server l. Let Ai =(Ai1, . . . ,Aik) be the allocation matrix of user i, and A =(A1, . . . ,An) the overall allocation for all users. We say anallocation A feasible if no server is required to use more thanany of its total resources, i.e.,∑

i∈UAilr ≤ clr, ∀l ∈ S, r ∈ R .



4

For each user i, given allocation Ail in server l, let Nil(Ail)be the maximum number of tasks (possibly fractional) it canschedule. We have

Nil(Ail)Dir ≤ Ailr, ∀r ∈ R .

As a result,

Nil(Ail) = minr∈R{Ailr/Dir} .

The total number of tasks user i can schedule under allocationAi is hence

Ni(Ai) =∑l∈S

Nil(Ail) . (1)

Intuitively, a user prefers an allocation that allows it toschedule more tasks.

A well-justified allocation should never give a user moreresources than it can actually use in a server. Following theterminology used in the economics literature [17], we call suchan allocation non-wasteful:

Definition 1: For user i and server l, an allocation Ail isnon-wasteful if reducing any resource decreases the numberof tasks scheduled, i.e., for all A′il ≺ Ail

2, we have

Nil(A′il) < Nil(Ail) .

Further, user i’s allocation Ai = (Ail) is non-wasteful if Ail

is non-wasteful for all server l, and allocation A = (Ai) isnon-wasteful if Ai is non-wasteful for all user i.

Note that one can always convert an allocation to non-wasteful by revoking those resources that are allocated buthave never been actually used, without changing the numberof tasks scheduled for any user. Unless otherwise specified,we limit the discussion to non-wasteful allocations.

3.3 Allocation Mechanism and Desirable PropertiesA resource allocation mechanism takes user demands as inputand outputs the allocation result. In general, an allocationmechanism should provide the following essential propertiesthat are widely recognized as the most important fairness andefficiency measures in both cloud computing systems [11],[12], [18] and the economics literature [17], [19].

Envy-freeness: An allocation mechanism is envy-free if nouser prefers the other’s allocation to its own, i.e.,

Ni(Ai) ≥ Ni(Aj), ∀i, j ∈ U .

This property essentially embodies the notion of fairness.Pareto optimality: An allocation mechanism is Pareto op-

timal if it returns an allocation A such that for all feasibleallocations A′, if Ni(A

′i) > Ni(Ai) for some user i, then

there exists a user j 6= i such that Nj(A′j) < Nj(Aj). In other

words, allocation A cannot be further improved such that allusers are at least as well off and at least one user is strictlybetter off. This property ensures the allocation efficiency andis critical to achieve high resource utilization.

Group strategyproofness: An allocation mechanism isgroup strategyproof if no user can schedule more tasks by

2. For any two vectors x and y, we say x ≺ y if xi ≤ yi, ∀i and forsome j we have strict inequality: xj < yj .

forming a coalition with others to misreport their resourcedemands (assuming a user’s demand is its private information).Specifically, let A′ be the allocation returned when a coalitionof users M ⊂ U misreports their demands as D′i 6= Di, forall i ∈ M . Suppose that a member of the coalition, say, useru ∈ M , changes its mind and instead truthfully reports itsdemand Du. Let A be the resulting allocation. The allocationmechanism is group strategyproof if

Nu(A′u) ≤ Nu(Au) .

In other words, user u is better off quiting the coalition.Group strategyproofness is of a special importance for a cloudcomputing system, as it is common to observe in a real-worldsystem that users try to manipulate the scheduler for moreallocations by lying about their resource demands [11], [18].

Sharing incentive is another critical property that hasbeen frequently mentioned in the literature [11]–[13], [15]. Itensures that every user’s allocation is not worse off than thatobtained by evenly dividing the entire resource pool. Whilethis property is well defined for a single server, it is not fora system containing multiple heterogeneous servers, as thereis an infinite number of ways to evenly divide the resourcepool among users, and it is unclear which one should beselected as a benchmark to compare with. We shall give aspecific discussion to Sec. 4.5, where we justify between tworeasonable alternatives.

In addition to the four essential allocation properties above,we also consider four other important properties as follows:

Single-server DRF: If the system contains only one server,then the resulting allocation should be reduced to the DRFallocation.

Single-resource fairness: If there is a single resource in thesystem, then the resulting allocation should be reduced to amax-min fair allocation.

Bottleneck fairness: If all users bottleneck on the sameresource (i.e., having the same global dominant resource), thenthe resulting allocation should be reduced to a max-min fairallocation for that resource.

Population monotonicity: If a user leaves the system andrelinquishes all its allocations, then the remaining users willnot see any reduction in the number of tasks scheduled.

To summarize, our objective is to design an allocationmechanism that guarantees all the properties defined above.

3.4 Naive DRF Extension and Its InefficiencyIt has been shown in [11], [15] that the DRF allocationsatisfies all the desirable properties mentioned above whenthe entire resource pool is modeled as one server. Whenresources are distributed to multiple heterogeneous servers, anaive generalization is to separately apply the DRF allocationper server. For instance, consider the example of Fig. 2. Wefirst apply DRF in server 1. Because CPU is the dominantresource of both users, it is equally divided for both of them,each receiving 1. As a result, user 1 schedules 5 tasks ontoserver 1, while user 2 schedules one. Similarly, in server 2,memory is the dominant resource of both users and is evenlyallocated, leading to one task scheduled for user 1 and five



5

50%

CPU Memory

100%

0%

50%

CPU Memory

100%

0%

User1 User2

Server 1 Server 2

42%

8%

Fig. 3. DRF allocation for the example shown in Fig. 2,where user 1 is allocated 5 tasks in server 1 and 1 inserver 2, while user 2 is allocated 1 task in server 1 and5 in server 2.

for user 2. The resulting allocations in the two servers areillustrated in Fig. 3, where both users schedule 6 tasks.

Unfortunately, this allocation violates the Pareto optimalityand is highly inefficient. If we instead allocate server 1exclusively to user 1, and server 2 exclusively to user 2, thenboth users schedule 10 tasks, almost twice the number of tasksscheduled under the DRF allocation. In fact, a similar examplecan be constructed to show that the per-server DRF may leadto arbitrarily low resource utilization. The failure of the naiveDRF extension to the heterogeneous environment necessitatesan alternative allocation mechanism, which is the main themeof the next section.

4 DRFH ALLOCATION AND ITS PROPERTIES

In this section, we present DRFH, a generalization of DRF ina heterogeneous cloud computing system where resources aredistributed in a number of heterogeneous servers. We analyzeDRFH and show that it provides all the desirable propertiesdefined in Sec. 3.

4.1 DRFH Allocation

Instead of allocating separately in each server, DRFH jointlyconsiders resource allocation across all heterogeneous servers.The key intuition is to achieve the max-min fair allocation forthe global dominant resources. Specifically, given allocationAil, let

Gil(Ail) = Nil(Ail)Dir∗i= min

r∈R{Ailr/dir} . (2)

be the amount of global dominant resource user i is allocatedin server l. Since the total availability of resources is normal-ized to 1, we also refer to Gil(Ail) the global dominant shareuser i receives in server l. Simply adding up Gil(Ail) over allservers gives the global dominant share user i receives underallocation Ai, i.e.,

Gi(Ai) =∑l∈S

Gil(Ail) =∑l∈S

minr∈R{Ailr/dir} . (3)

DRFH allocation aims to maximize the minimum global dom-inant share among all users, subject to the resource constraints

per server, i.e.,

maxA

mini∈U

Gi(Ai)

s.t.∑i∈U

Ailr ≤ clr,∀l ∈ S, r ∈ R .(4)

Recall that without loss of generality, we assume non-wasteful allocation A (see Sec. 3.2). We have the followingstructural result. Its proof is deferred to the appendix3.

Lemma 1: For user i and server l, an allocation Ail isnon-wasteful if and only if there exists some gil such that

Ail = gildi .

In particular, gil is the global dominant share user i receivesin server l under allocation Ail, i.e.,

gil = Gil(Ail) .

Intuitively, Lemma 1 indicates that under a non-wastefulallocation, resources are allocated in proportion to the user’sdemand. Lemma 1 immediately suggests the following rela-tionship for every user i and its non-wasteful allocation Ai:

Gi(Ai) =∑l∈S

Gil(Ail) =∑l∈S

gil .

Problem (4) can hence be equivalently written as

max{gil}

mini∈U

∑l∈S

gil

s.t.∑i∈U

gildir ≤ clr, ∀l ∈ S, r ∈ R ,(5)

where the constraints are derived from Lemma 1. Now letg = mini

∑l∈S gil. Via straightforward algebraic operations,

we see that (5) is equivalent to the following problem:

max{gil},g

g

s.t.∑i∈U

gildir ≤ clr, ∀l ∈ S, r ∈ R ,∑l∈U

gil = g, ∀i ∈ U .

(6)

Note that the second constraint embodies the fairness in termsof equalized global dominant share g. By solving (6), DRFHallocates each user the maximum global dominant share g,under the constraints of both server capacity and fairness. Theallocation received by each user i in server l is simply Ail =gildi.

For example, Fig. 4 illustrates the resulting DRFH allocationin the example of Fig. 2. By solving (6), DRFH allocatesserver 1 exclusively to user 1 and server 2 exclusively to user2, allowing each user to schedule 10 tasks with the maximumglobal dominant share g = 5/7.

We next analyze the properties of DRFH allocation obtainedby solving (6). Our analyses of DRFH start with the fouressential resource allocation properties, namely, envy-freeness,Pareto optimality, group strategyproofness, and sharing incen-tive.

3. The appendix is given in a supplementary document as per the TPDSsubmission guidelines.



6

CPU Memory

5/7

0CPU Memory

0

User1 User2

Server 1 Server 2

1/7 1/7

6/7 6/7

Resource

Share

Fig. 4. An alternative allocation with higher system uti-lization for the example of Fig. 2. Server 1 and 2 areexclusively assigned to user 1 and 2, respectively. Bothusers schedule 10 tasks.

4.2 Envy-FreenessWe first show by the following proposition that under theDRFH allocation, no user prefers other’s allocation to its own.

Proposition 1 (Envy-freeness): The DRFH allocation ob-tained by solving (6) is envy-free.

Proof: Let {gil}, g be the solution to problem (6). For alluser i, its DRFH allocation in server l is Ail = gildi. To showNi(Aj) ≤ Ni(Ai) for any two users i and j, it is equivalentto prove Ni(Aj) ≤ Ni(Ai). We have

Gi(Aj) =∑

lGil(Ajl)

=∑

l minr{gjldjr/dir}≤∑

l gjl

= Gi(Ai) ,

where the inequality holds because

minr{djr/dir} ≤ djr∗i /dir∗i = djr∗i ≤ 1 ,

where r∗i is user i’s global dominant resource.

4.3 Pareto OptimalityWe next show that DRFH leads to an efficient allocation underwhich no user can improve its allocation without decreasingthat of the others.

Proposition 2 (Pareto optimality): The DRFH allocationobtained by solving (6) is Pareto optimal.

Proof: Let {gil}, g be the solution to problem (6). For alluser i, its DRFH allocation in server l is Ail = gildi. Since(5) and (6) are equivalent, {gil} is also the solution to (5), andg is the maximum value of the objective of (5).

Assume, by way of contradiction, that allocation A is notPareto optimal, i.e., there exists some allocation A′, such thatNi(A

′i) ≥ Ni(Ai) for all user i, and for some user j we have

strict inequality: Nj(A′j) > Nj(Aj). Equivalently, this im-

plies Gi(A′i) ≥ Gi(Ai) for all user i, and Gj(A

′j) > Gj(Aj)

for user j. Without loss of generality, let A′ be non-wasteful.By Lemma 1, for all user i and server l, there exists some g′ilsuch that A′il = g′ildi. We show that based on {g′il}, one canconstruct some {gil} such that {gil} is a feasible solution to(5), yet leads to a higher objective than g, contradicting thefact that {gil}, g optimally solve (5).

To see this, consider user j. We have

Gj(Aj) =∑

l gjl = g < Gj(A′j) =

∑l g′jl.

For user j, there exists a server l0 and some ε > 0, such thatafter reducing g′jl0 to g′jl0 − ε, the resulting global dominantshare remains higher than g, i.e.,

∑l g′jl − ε ≥ g. This leads

to at least εdj idle resources in server l0. We construct {gil}by redistributing these idle resources to all users to increasetheir global dominant share, therefore strictly improving theobjective of (5).

Denote by {g′′il} the dominant share after reducing g′jl0 tog′jl0 − ε, i.e.,

g′′il =

{g′jl0 − ε, i = j, l = l0;

g′il, o.w.

The corresponding non-wasteful allocation is A′′il = g′′ildi forall user i and server l. Note that allocation A′′ is preferred tothe original allocation A by all users, i.e., for all user i, wehave

Gi(A′′i ) =

∑l

g′′il =

{ ∑l g′jl − ε ≥ g = Gj(Aj), i = j;∑

l g′il = Gi(A

′i) ≥ Gi(Ai), o.w.

We now construct {gil} by redistributing the εdj idleresources in server l0 to all users, each increasing its globaldominant share g′′il0 by δ = minr{εdjr/

∑i dir}, i.e.,

gil =

{g′′il0 + δ, l = l0;g′′il, o.w.

It is easy to check that {gil} remains a feasible allocation. Tosee this, it suffices to check server l0. For all its resource r,we have ∑

i gil0dir =∑

i(g′′il0

+ δ)dir

=∑

i g′il0dir − εdjr + δ

∑i dir

≤ cl0r − (εdjr − δ∑

i dir)

≤ cl0r .

where the first inequality holds because A′ is a feasibleallocation.

On the other hand, for all user i ∈ U , we have∑l gil =

∑l g′′il + δ = Gi(A

′′i ) + δ ≥ Gi(Ai) + δ > g .

This contradicts the premise that g is optimal for (5).

4.4 Group Strategyproofness

For now, all our discussions are based on a critical assump-tion that all users truthfully report their resource demands.However, in a real-world system, it is common to observeusers to attempt to manipulate the scheduler by misreportingtheir resource demands, so as to receive more allocation[11], [18]. More often than not, these strategic behaviourswould significantly hurt those honest users and reduce thenumber of their tasks scheduled, inevitably leading to a fairlyinefficient allocation outcome. Fortunately, we show by thefollowing proposition that DRFH is immune to these strategic



7

behaviours, as reporting the true demand is always the dom-inant strategy for all users, even if they form a coalition tomisreport together with others.

Proposition 3 (Group strategyproofness): The DRFHallocation obtained by solving (6) is group strategyproof inthat no user can schedule more tasks by forming a coalitionwith others to misreport its demand.

Proof: Let M ⊂ U be the set of strategic users forminga coalition to report untruthful normalized demand vectorsd′M = (d′i)i∈M , where d′i 6= di,∀i ∈ M . Let d′ be thecollection of normalized demand vectors submitted by allusers, where d′i = di,∀i ∈ U \M . Let A′ be the resultingallocation, obtained by solving (6). In particular, A′il = g′ild

′i

for each user i and server l, and g′ =∑

l g′il, where {g′il}, g′

solve (6).Now suppose that user u ∈ M quits the coalition and

instead truthfully reports its normalized demand vector du.Let A be the resulting allocation, obtained by solving (6) withthe normalized demand vectors (d′−u,du), and {gil}, g thesolution to (6). For each user j and server l, we have

Ajl =

{guldu, j = u;gjld

′j , j 6= u,

and g =∑

l gul. We check the following two cases and showthat Gu(A′u) ≤ Gu(Au), which is equivalent to Nu(A′u) ≤Nu(Au).

Case 1: g′ ≤ g. In this case, let ρu = minr{d′ur/dur} bedefined for user u. Clearly,

ρu = minr{d′ur/dur} ≤ d′ur∗u/dur∗u = d′ur∗u ≤ 1 ,

where r∗u is the dominant resource of user u. We then have

Gu(A′u) =∑

lGul(A′ul)

=∑

lGul(g′uld′u)

=∑

l minr{g′uld′ur/dur}= ρug

′

≤ g= Gu(Au) .

Case 2: g′ > g. Let Gi(A′i,d′i) be the global dominant

share of user i ∈ U as if its normalized demand vector wered′i, under allocation A′i, i.e.,

Gi(A′i,d′i) =

∑l minr{A′ilr/d′ir}

=∑

l minr{g′ild′ir/d′ir}=∑

l g′il

= g′ .

Following the same notation, when user u ∈ M truthfullyreports its normalized demand vector du, we have

Gu(Au,du) =∑

l minr{guldur/dur} =∑

l gul = g .

For other user j 6= u, we have

Gj(Aj ,d′j) =

∑l minr{gjld′jr/d′jr} =

∑l gjl = g .

Since g′ > g, we have

Gj(A′j ,d′j) > Gj(Aj ,d

′j), ∀j 6= u . (7)

We now prove

Gu(A′u,du) < Gu(Au,du) ,

which is equivalent to Gu(A′u) < Gu(Au). Suppose, by wayof contradiction, that Gu(A′u,du) ≥ Gu(Au,du). Togetherwith (7), we see that allocation A′ is preferred to A by allusers and is strictly preferred by user j 6= u if (d′−u,du)were the collection of the true normalized demand vectors.However, this contradicts the Pareto optimality as A is aDRFH allocation resulting from the input demands (d′−u,du).

4.5 Sharing IncentiveIn addition to the aforementioned three properties, sharingincentive is another critical allocation property that has beenfrequently mentioned in the literature, e.g., [11]–[13], [15],[18]. The property ensures that every user can execute at leastthe number of tasks it schedules when the entire resource poolis evenly partitioned. The property provides service isolationsamong the users.

While the sharing incentive property is well defined in theall-in-one resource model, it is not for the system with multipleheterogeneous servers. In the former case, since the entireresource pool is abstracted as a single server, evenly dividingevery resource of this big server would lead to a uniqueallocation. However, when the system consists of multipleheterogeneous servers, there are many different ways to evenlydivide these servers, and it is unclear which one should be usedas a benchmark for comparison. For instance, in the exampleof Fig. 2, two users share a system with 14 CPUs and 14 GBmemory in total. The following two allocations both allocateeach user 7 CPUs and 7 GB memory: (a) User 1 is allocated1/2 resources of server 1 and 1/2 resources of server 2, whileuser 2 is allocated the rest; (b) user 1 is allocated (1.5 CPUs,5.5 GB) in server 1 and (5.5 CPUs, 1.5 GB) in server 2,while user 2 is allocated the rest. It is easy to verify that thetwo allocations lead to different number of tasks scheduledfor the same user, and can be used as two different allocationbenchmarks. In fact, one can construct many other allocationsthat evenly divide all resources among the users.

Despite the general ambiguity explained above, in the nexttwo subsections, we consider two definitions of the sharingincentive property, strong and weak, depending on the choiceof the benchmark for equal partitioning of resources.

4.5.1 Strong Sharing IncentiveAmong various allocations that evenly divide all servers, per-haps the most straightforward approach is to evenly partitioneach server’s availability cl among all n users. The strongsharing incentive property is defined by using this per-serverpartitioning as a benchmark.

Definition 2 (Strong sharing incentive): Allocation Asatisfies the strong sharing incentive property if each userschedules fewer tasks by evenly partitioning each server, i.e.,

Ni(Ai) =∑l∈S

Ni(Ail) ≥∑l∈S

Ni(cl/n), ∀i ∈ U .



8

Before we proceed, it is worth mentioning that the per-server partitioning above cannot be directly implemented inpractice. With a large number of users, in each server, ev-eryone will be allocated a very small fraction of the server’savailability. In practice, such a small slice of resources usuallycannot be used to run any computing task. However, per-server partitioning may be interpreted as follows. Since a cloudsystem is constructed by pooling hundreds of thousands ofservers [2], [3], the number of users is typically far smallerthan the number of servers [11], [18], i.e., k � n. An equalpartition could randomly allocate to each user k/n servers,which is equivalent to randomly allocating each server to eachuser with probability 1/n. It is easy to see that the meannumber of tasks scheduled for each user under this randomallocation is

∑lNi(ci/n), the same as that obtained under

the per-server partitioning.Unfortunately, the following proposition shows that DRFH

may violate the sharing incentive property in the strong sense.The proof gives a counterexample.

Proposition 4: DRFH does not satisfy the property of strongsharing incentive.

Proof: Consider a system consisting of two servers. Server1 has 1 CPU and 2 GB memory; server 2 has 4 CPUs and 3GB memory. There are two users. Each instance of the task ofuser 1 demands 1 CPU and 1 GB memory; each of user 2’stasks demands 3 CPUs and 2 GB memory. In this case, wehave c1 = (1/5, 2/5)T , c2 = (4/5, 3/5)T , D1 = (1/5, 1/5)T ,D2 = (3/5, 2/5)T , d1 = (1, 1)T , and d2 = (1, 2/3)T . It iseasy to verify that under DRFH, the global dominant shareboth users receive is 12/25. On the other hand, under the per-server partitioning, the global dominant shares user 1 and user2 receive are 2/5 and 1/2, respectively. User 1 hence prefersthe per-server partitioning to DRFH.

While DRFH may violate the strong sharing incentiveproperty, we shall show via trace-driven simulations in Sec. 6that this only happens in rare cases.

4.5.2 Weak Sharing IncentiveThe strong sharing incentive property is defined by choosingthe per-server partitioning as a benchmark, which is only oneof many different ways to evenly divide the total availability.In general, any equal partition that allocates an equal share ofevery resource can be used as a benchmark. This allows us torelax the sharing incentive definition. We first define an equalpartition as follows.

Definition 3 (Equal partition): Allocation A is an equalpartition if it divides every resource evenly among all users,i.e., ∑

l∈S

Ailr = 1/n, ∀r ∈ R, i ∈ U .

It is easy to verify that the aforementioned per-serverpartition is an equal partition. We are now ready to definethe weak sharing incentive property as follows.

Definition 4 (Weak sharing incentive): Allocation Asatisfies the weak sharing incentive property if there existsan equal partition A′ under which each user schedules fewertasks than those under A, i.e.,

Ni(Ai) ≥ Ni(A′i), ∀i ∈ U .

In other words, the property of weak sharing incentive onlyrequires the allocation to be better off than one equal partition,without specifying its specific form. It is hence a more relaxedrequirement than the strong sharing incentive property.

The following proposition shows that DRFH satisfies thesharing incentive property in the weak sense. The proof isconstructive.

Proposition 5 (Weak sharing incentive): DRFH satisfiesthe property of weak sharing incentive.

Proof: Let g be the global dominant share each user receivesunder a DRFH allocation A, and gil the global dominant shareuser i receives in server l. We construct an equal partition A′

under which users schedule fewer tasks than those under A.Case 1: g ≥ 1/n. In this case, let A′ be any equal partition.

We show that each user schedules fewer tasks under A′ thanthose under A. To see this, consider the DRFH allocation A.Since it is non-wasteful, the number of tasks user i schedulesis

Ni(Ai) = g/Dir∗i≥ 1/nDir∗i

.

On the other hand, the number of tasks user i schedules underA′ would be at most

Ni(A′i) =

∑l∈S minr{A′ilr/Dir}

≤∑

l∈S A′ilr/Dir∗i

= 1/nDir∗i

≤ Ni(Ai) .

Case 2: g < 1/n. In this case, no resource has been fullyallocated under A, i.e.,∑

i∈U

∑l∈S

Ailr =∑i∈U

∑l∈S

gildir ≤∑i∈U

∑l∈S

gil =∑i∈U

g < 1

for all resource r ∈ R. Let

Llr = clr −∑i∈U

Ailr

be the amount of resource r left unallocated in server l,Further, let

Lr =∑l∈S

Llr = 1−∑i∈U

∑l∈S

Ailr

be the total amount of resource r left unallocated.We are now ready to construct an equal partition A′ based

on A. Since A′ should allocate each user 1/n of the totalavailability of every resource r, the additional amount ofresource r user i needs to obtain is

uir = 1/n−∑l∈S

Ailr .

It is easy to see that uir > 0,∀i ∈ U, r ∈ R. The demandedfraction of unallocated resource r for user i is

fir = uir/Lr .

As a result, we can construct A′ by reallocating those leftoverresources in each server to users, in proportion to theirdemands, i.e.,

A′ilr = Ailr + Llrfir, ∀i ∈ U, l ∈ S, r ∈ R .



9

It is easy to verify that A′ is an equal partition, i.e.,∑l∈S

A′ilr =∑l∈S

Llrfir +∑l∈S

Ailr

= (uir/Lr)∑l∈S

Llr +∑l∈S

Ailr

= uir +∑l∈S

Ailr

= 1/n , ∀i ∈ U, r ∈ R .

We now compare the number of tasks scheduled for eachuser under both allocations A and A′. Because A′ allo-cates more resources to each user than A does, we haveNi(A

′i) ≥ Ni(Ai) for all i. On the other hand, by the Pareto

optimality of allocation A, no user can schedule more taskswithout decreasing the number of tasks scheduled for others.Therefore, we must have Ni(A

′i) = Ni(Ai) for all i.

4.5.3 DiscussionStrong sharing incentive provides more predictable serviceisolation than weak sharing incentive does. It assures a usera priori that it can schedule at least the number of taskswhen every server is evenly allocated. This gives users aconcrete idea on the worst Quality of Service (QoS) they mayreceive, allowing them to accurately predict their computingperformance. While weak sharing incentive also providessome degree of service isolation, a user cannot infer theguaranteed number of tasks it can schedule a priori from thisweaker property, and therefore cannot predict the computingperformance.

We note that the root cause of such degradation of serviceisolation is due to the heterogeneity among servers. When allservers are of the same specification of hardware resources,DRFH reduces to DRF, and strong sharing incentive is guar-anteed. This is also the case for schedulers adopting the single-resource abstraction. For example, in Hadoop, each server isdivided into several slots (e.g., reducers and mappers). HadoopFair Scheduler [20] allocates these slots evenly to all users. Wesee that predictable service isolation is achieved: each userreceives at least ks/n slots, where ks is the number of slots,and n is the number of users.

In general, one can view weak sharing incentive as theprice paid by DRFH to achieve high resource utilization. Infact, naively applying DRF allocation separately to each serverretains strong sharing incentive: in each server, the DRF allo-cation ensures that a user can schedule at least the number oftasks when resources are evenly allocated [11], [15]. However,as we have seen in Sec. 3.4, such a naive DRF extension maylead to extremely low resource utilization that is unacceptable.Similar problem also exists for traditional schedulers adoptingsingle-resource abstractions. By artificially dividing serversinto slots, these schedulers cannot match computing demandsto available resources at a fine granularity, resulting in poorresource utilization in practice [11]. For these reasons, webelieve that slightly trading off the degree of service isolationfor much higher resource utilization is well justified. We shalluse trace-driven simulation to show in Sec. 6.3 that DRFH onlyviolates strong sharing incentive in rare cases in the Googlecluster.

4.6 Other Important PropertiesIn addition to the four essential properties shown in theprevious subsection, DRFH also provides a number of otherimportant properties. First, since DRFH generalizes DRF toheterogeneous environments, it naturally reduces to the DRFallocation when there is only one server contained in thesystem, where the global dominant resource defined in DRFHis exactly the same as the dominant resource defined in DRF.

Proposition 6 (Single-server DRF): The DRFH leads to thesame allocation as DRF when all resources are concentratedin one server.

Next, by definition, we see that both single-resource fairnessand bottleneck fairness trivially hold for the DRFH allocation.We hence omit the proofs of the following two propositions.

Proposition 7 (Single-resource fairness): The DRFH allo-cation satisfies single-resource fairness.

Proposition 8 (Bottleneck fairness): The DRFH allocationsatisfies bottleneck fairness.

Finally, we see that when a user leaves the system andrelinquishes all its allocations, the remaining users will notsee any reduction of the number of tasks scheduled. Formally,

Proposition 9 (Population monotonicity): The DRFHallocation satisfies population monotonicity.

Proof: Let A be the resulting DRFH allocation, then forall user i and server l, Ail = gildi and Gi(Ai) = g, where{gil} and g solve (6). Suppose that user j leaves the system,changing the resulting DRFH allocation to A′. By DRFH, forall user i 6= j and server l, we have A′il = g′ildi and Gi(A

′i) =

g′, where {g′il}i 6=j and g′ solve the following optimizationproblem:

maxg′il,i6=j

g′

s.t.∑i 6=j

g′ildir ≤ clr,∀l ∈ S, r ∈ R ,∑l∈U

g′il = g′,∀i 6= j .

(8)

To show Ni(A′i) ≥ Ni(Ai) for all user i 6= j, it is

equivalent to prove Gi(A′i) ≥ Gi(Ai). It is easy to verify

that g, {gil}i 6=j satisfy all the constraints of (8) and arehence feasible to (8). As a result, g′ ≥ g, which is exactlyGi(A

′i) ≥ Gi(Ai).

5 PRACTICAL CONSIDERATIONS

So far, all our discussions are based on several assumptionsthat may not be the case in a real-world system. In this section,we relax these assumptions and discuss how DRFH can beimplemented in practice.

5.1 Weighted Users with a Finite Number of TasksIn the previous sections, users are assumed to be assignedequal weights and have infinite computing demands. Bothassumptions can be easily removed with some minor modi-fications of DRFH.

When users are assigned uneven weights, let wi be theweight associated with user i. DRFH seeks an allocation



10

that achieves the weighted max-min fairness across users.Specifically, we maximize the minimum normalized globaldominant share (w.r.t the weight) of all users under the sameresource constraints as in (4), i.e.,

maxA

mini∈U

Gi(Ai)/wi

s.t.∑i∈U

Ailr ≤ clr,∀l ∈ S, r ∈ R .

When users have a finite number of tasks, the DRFH allo-cation is computed iteratively. In each round, DRFH increasesthe global dominant share allocated to all active users, untilone of them has all its tasks scheduled, after which the userbecomes inactive and will no longer be considered in thefollowing allocation rounds. DRFH then starts a new iterationand repeats the allocation process above, until no user is activeor no more resources could be allocated to users. Becauseeach iteration saturates at least one user’s resource demand,the allocation will be accomplished in at most n rounds, wheren is the number of users4 Our analysis presented in Sec. 4 alsoextends to weighted users with a finite number of tasks.

5.2 Scheduling Tasks as EntitiesUntil now, we have assumed that all tasks are divisible.In a real-world system, however, fractional tasks may notbe accepted. To schedule tasks as entities, one can applyprogressive filling as a simple implementation of DRFH5.That is, whenever there is a scheduling opportunity, thescheduler always accommodates the user with the lowestglobal dominant share. To do this, it picks the first serverwhose remaining resources are sufficient to accommodate therequest of the user’s task. While this First-Fit algorithm offersa fairly good approximation to DRFH, we propose anothersimple heuristic that can lead to a better allocation with higherresource utilization.

Similar to First-Fit, the heuristic also chooses user i withthe lowest global dominant share to serve. However, instead ofrandomly picking a server, the heuristic chooses the “best” onewhose remaining resoruces most suitably matches the demandof user i’s tasks, and is hence referred to as the Best-FitDRFH. Specifically, for user i with resource demand vectorDi = (Di1, . . . , Dim)T and a server l with available resourcevector cl = (cl1, . . . , clm)T , where clr is the share of resourcer remaining available in server l, we define the followingheuristic function to quantitatively measure the fitness of thetask for server l:

H(i, l) = ‖Di/Di1 − cl/cl1‖1 , (9)

where ‖·‖1 is the L1-norm. Intuitively, the smaller H(i, l),the more similar the resource demand vector Di appears tothe server’s available resource vector cl, and the better fit useri’s task is for server l. Best-Fit DRFH schedules user i’s tasksto server l with the least H(i, l).

4. For medium- and large-sized cloud clusters, n is in the order of thousands[3], [8].

5. Progressive filling has also been used to implement the DRF allocation[11]. However, when the system consists of multiple heterogeneous servers,progressive filling will lead to a DRFH allocation.

As an illustrative example, suppose that only two types ofresources are concerned, CPU and memory. A CPU-heavy taskof user i with resource demand vector Di = (1/10, 1/30)T

is to be scheduled, meaning that the task requires 1/10 ofthe total CPU availability and 1/30 of the total memoryavailability of the system. Only two servers have sufficientremaining resources to accommodate this task. Server 1 hasthe available resource vector c1 = (1/5, 1/15)T ; Server 2 hasthe available resource vector c2 = (1/8, 1/4)T . Intuitively,because the task is CPU-bound, it is more fit for Server 1,which is CPU-abundant. This is indeed the case as H(i, 1) =0 < H(i, 2) = 5/3, and Best-Fit DRFH places the task ontoServer 1.

Both First-Fit and Best-Fit DRFH can be easily imple-mented by searching all k servers in O(k) time. Even fora large-sized cluster, k is in the order of tens of thousands[21], and the server onto which the task should be placed canbe quickly found. This computation can also be acceleratedby dividing k servers into M groups, each consisting k/Mservers. For each group, the searching is performed in parallel.

It is worth mentioning that the definition of the heuristicfunction (9) is not unique. In fact, one can use more complexheuristic function other than (9) to measure the fitness of atask for a server, e.g., cosine similarity [22]. However, as weshall show in the next section, Best-Fit DRFH with (9) as itsheuristic function already improves the utilization to a levelwhere the system capacity is almost saturated. Therefore, thebenefit of using more complex fitness measure is very limited,at least for the Google cluster traces [8].

6 TRACE-DRIVEN SIMULATION

In this section, we evaluate the performance of DRFH viaextensive simulations driven by Google cluster-usage traces[8]. The traces contain resource demand/usage information ofover 900 users (i.e., Google services and engineers) on a clus-ter of 12K servers. The server configurations are summarizedin Table 1, where the CPUs and memory of each server arenormalized so that the maximum server is 1. Each user submitscomputing jobs, divided into a number of tasks, each requiringa set of resources (i.e., CPU and memory). From the traces,we extract the computing demand information—the requiredamount of resources and task running time—and use it as thedemand input of the allocation algorithms for evaluation.

6.1 Dynamic AllocationOur first evaluation focuses on the allocation fairness of theproposed Best-Fit DRFH when users dynamically join anddepart the system. We simulate 3 users submitting taskswith different resource requirements to a small cluster of100 servers. The server configurations are randomly drawnfrom the distribution of Google cluster servers in Table 1,leading to a resource pool containing 52.75 CPU units and51.32 memory units in total. User 1 joins the system at thebeginning, requiring 0.2 CPU and 0.3 memory for each of itstask. As shown in Fig. 5, since only user 1 is active at thebeginning, it is allocated 40% CPU share and 62% memoryshare. This allocation continues until 200 s, at which time



11

0 200 400 600 800 1000 1200 1400 16000

20

40

60

80

100

Time (s)

CP

U S

ha

re (

%)

User 1 User 2 User 3

0 200 400 600 800 1000 1200 1400 16000

20

40

60

80

100

Time (s)

Me

mo

ry S

ha

re (

%)


0 200 400 600 800 1000 1200 1400 16000

20

40

60

80

100

Time (s)

Do

min

an

t S

ha

re (

%)


Fig. 5. CPU, memory, and global dominant share for threeusers on a 100-server system with 52.75 CPU units and51.32 memory units in total.

TABLE 2Resource utilization of the Slots scheduler with different

slot sizes.

Number of Slots CPU Utilization Memory Utilization10 per maximum server 35.1% 23.4%12 per maximum server 42.2% 27.4%14 per maximum server 43.9% 28.0%16 per maximum server 45.4% 24.2%20 per maximum server 40.6% 20.0%

user 2 joins and submits CPU-heavy tasks, each requiring 0.5CPU and 0.1 memory. Both users now compete for computingresources, leading to a DRFH allocation in which both usersreceive 44% global dominant share. At 500 s, user 3 startsto submit memory-intensive tasks, each requiring 0.1 CPUand 0.3 memory. The algorithm now allocates the same globaldominant share of 26% to all three users until user 1 finishesits tasks and departs at 1080 s. After that, only users 2 and3 share the system, each receiving the same share on theirglobal dominant resources. A similar process repeats until allusers finish their tasks. Throughout the simulation, we see thatthe Best-Fit DRFH algorithm precisely achieves the DRFHallocation at all times.

6.2 Resource UtilizationWe next evaluate the resource utilization of the proposed Best-Fit DRFH algorithm. We take the 24-hour computing demanddata from the Google traces and simulate it on a smaller cloudcomputing system of 2,000 servers so that fairness becomesrelevant. The server configurations are randomly drawn fromthe distribution of Google cluster servers in Table 1. We

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Time (min)

CP

U U

tiliz

atio

n

Best−Fit DRFH First−Fit DRFH Slots

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Time (min)

Me

mo

ry U

tiliz

atio

n

Best−Fit DRFH First−Fit DRFH Slots

Fig. 6. Time series of CPU and memory utilization.

compare Best-Fit DRFH with two other benchmarks, thetraditional Slots schedulers that schedules tasks onto slots ofservers (e.g., Hadoop Fair Scheduler [20]), and the First-FitDRFH that chooses the first server that fits the task. For theformer, we try different slot sizes and chooses the one withthe highest CPU and memory utilization. Table 2 summarizesour observations, where dividing the maximum server (1 CPUand 1 memory in Table 1) into 14 slots leads to the highestoverall utilization.

Fig. 6 depicts the time series of CPU and memory utilizationof the three algorithms. We see that the two DRFH implemen-tations significantly outperform the traditional Slots schedulerwith much higher resource utilization, mainly because thelatter ignores the heterogeneity of both servers and workload.This observation is consistent with findings in the homoge-neous environment where all servers are of the same hardwareconfigurations [11]. As for the DRFH implementations, wesee that Best-Fit DRFH leads to uniformly higher resourceutilization than the First-Fit alternative at all times.

The high resource utilization of Best-Fit DRFH naturallytranslates to shorter job completion times shown in Fig. 7a,where the CDFs of job completion times for both Best-FitDRFH and Slots scheduler are depicted. Fig. 7b offers a moredetailed breakdown, where jobs are classified into 5 categoriesbased on the number of its computing tasks, and for eachcategory, the mean completion time reduction is computed.While DRFH shows no improvement over Slots schedulerfor small jobs, a significant completion time reduction hasbeen observed for those containing more tasks. Generally,the larger the job is, the more improvement one may expect.Similar observations have also been found in the homogeneousenvironments [11].

Fig. 7 does not account for partially completed jobs andfocuses only on those having all tasks finished in both Best-Fit and Slots. As a complementary study, Fig. 8 computes thetask completion ratio – the number of tasks completed overthe number of tasks submitted – for every user using Best-Fit DRFH and Slots schedulers, respectively. The radius ofthe circle is scaled logarithmically to the number of tasks the



12

101

102

103

104

105

0

0.2

0.4

0.6

0.8

1

Job Completion Time (s)

Best−Fit DRFHSlots

(a) CDF of job completion times.

0

20

40

60

80

Job Size (tasks)

Co

mp

letio

n T

ime

Re

du

ctio

n

1−5051−100

101−500

501−1000>1000

−1% 2%

25%

43%

62%

(b) Job completion time reduction.

Fig. 7. DRFH improvements on job completion times overSlots scheduler.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Task completion ratio w/ Slots

Ta

sk c

om

ple

tio

n r

atio

w/

DR

FH

← y = x

Fig. 8. Task completion ratio of users using Best-Fit DRFHand Slots schedulers, respectively. Each bubble’s size islogarithmic to the number of tasks the user submitted.

user submitted. We see that Best-Fit DRFH leads to highertask completion ratio for almost all users. Around 20% usershave all their tasks completed under Best-Fit DRFH but donot under Slots.

6.3 Sharing IncentiveOur final evaluation is on the sharing incentive property ofDRFH. While we have shown in Sec. 4.5 that DRFH may notsatisfy the property in the strong sense, it remains unclearhow often this property would be violated in practice. Isit frequently happened or just a rare case? We answer thisquestion in this subsection.

We first compute the task completion ratio under the bench-mark per-server partition. To do this, we randomly choosedk/ne servers, where k is the number of servers and n isthe number of users in the traces. We then allocate these

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Task completion ratio in a dedicated cloud

Ta

sk c

om

ple

tio

n r

atio

w/

DR

FH

← y = x

Fig. 9. Comparison of task completion ratios under DRFHand that obtained in dedicated clouds (DCs). Each circle’sradius is logarithmic to the number of tasks submitted.

dk/ne servers to a user and schedule its tasks onto them.These dk/ne servers form a dedicated cloud exclusive for thisuser. We compute the task completion ratio obtained in thisdedicated cloud and compare it with the one obtained underthe DRFH allocation. Fig. 9 illustrates the comparison resultsfor all users. We see that most users would prefer DRFHallocation as compared with running tasks in a dedicatedcloud. In particular, only 2% users see fewer tasks finishedunder the DRFH allocation. Even for these users, the taskcompletion ratio decreases only slightly, as shown in Fig. 9.As a result, we see that DRFH only violates the property ofstrong sharing incentive as a rare case in the Google traces.

7 RELATED WORK

Despite the extensive computing system literature on fair re-source allocation, many existing works limit their discussionsto the allocation of a single resource type, e.g., CPU time [23],[24] and link bandwidth [25]–[29]. Various fairness notionshave also been proposed throughout the years, ranging fromapplication-specific allocations [30], [31] to general fairnessmeasures [25], [32], [33].

As for multi-resource allocation, state-of-the-art cloud com-puting systems employ naive single resource abstractions. Forexample, the two fair sharing schedulers currently supportedin Hadoop [20], [34] partition a node into slots with fixedfractions of resources, and allocate resources jointly at theslot granularity. Quincy [35], a fair scheduler developed forDryad [10], models the fair scheduling problem as a min-costflow problem to schedule jobs into slots. The recent work [18]takes the job placement constraints into consideration, yet itstill uses a slot-based single resource abstraction.

Ghodsi et al. [11] are the first in the literature to present asystematic investigation on the multi-resource allocation prob-lem in cloud computing systems. They propose DRF to equal-ize the dominant share of all users, and show that a numberof desirable fairness properties are guaranteed in the resultingallocation. DRF has quickly attracted a substantial amountof attention and has been generalized to many dimensions.Notably, Joe-Wong et al. [12] generalize the DRF measureand incorporate it into a unifying framework that captures thetrade-offs between allocation fairness and efficiency. Dolev etal. [13] suggest another notion of fairness for multi-resource



13

allocation, known as Bottleneck-Based Fairness (BBF), underwhich two fairness properties that DRF possesses are alsoguaranteed. Gutman and Nisan [14] consider another settingsof DRF with a more general domain of user utilities, andshow their connections to the BBF mechanism. Parkes etal. [15], on the other hand, extend DRF in several ways,including the presence of zero demands for certain resources,weighted user endowments, and in particular the case ofindivisible tasks. They also study the loss of social welfareunder the DRF rules. Kash et al. [16] extend the DRF modelto allow users to join the system over time but will neverleave. More recently, Bhattacharya et al. [36] generalize DRFto a hierarchical scheduler that offers service isolations ina computing system with a hierarchical structure. All theseworks assume, explicitly or implicitly, that all resources areeither concentrated into one super computer, or are distributedto a set of homogeneous servers with exactly the same resourceconfiguration.

However, server heterogeneity has been widely observedin today’s cloud computing systems. Specifically, Ahmad etal. [6] noted that datacenter clusters usually consist of bothhigh-performance servers and low-power nodes with differenthardware architectures. Reiss et al. [3], [8] illustrated a widerange of server specification in Google clusters. As for publicclouds, Farley et al. [4] and Ou et al. [5] observed significanthardware diversity among Amazon EC2 servers that maylead to substantially different performance across supposedlyequivalent VM instances. Ou et al. [5] also pointed outthat such server heterogeneity is not limited to EC2 only,but generally exists in long-lasting public clouds such asRackspace.

Other related works include fair-division problems in theeconomics literature, in particular the egalitarian divisionunder Leontief preferences [17] and the cake-cutting problem[19]. These works also assume the all-in-one resource model,and hence cannot be directly applied to cloud computingsystems with heterogeneous servers.

8 CONCLUDING REMARKS

In this paper, we study a multi-resource allocation problem ina heterogeneous cloud computing system where the resourcepool is composed of a large number of servers with differ-ent configurations in terms of resources such as processing,memory, and storage. The proposed multi-resource allocationmechanism, known as DRFH, equalizes the global dominantshare allocated to each user, and hence generalizes the DRFallocation from a single server to multiple heterogeneousservers. We analyze DRFH and show that it retains almostall desirable properties that DRF provides in the single-serverscenario. Notably, DRFH is envy-free, Pareto optimal, andgroup strategyproof. It also offers the sharing incentive in aweak sense. We design a Best-Fit heuristic that implementsDRFH in a real-world system. Our large-scale simulationsdriven by Google cluster traces show that, compared to thetraditional single-resource abstraction such as a slot scheduler,DRFH achieves significant improvements in resource utiliza-tion, leading to much shorter job completion times.

REFERENCES

[1] W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloudcomputing systems with heterogeneous servers,” in Proc. IEEE INFO-COM, 2014.

[2] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski,G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view ofcloud computing,” Commun. ACM, vol. 53, no. 4, pp. 50–58, 2010.

[3] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch, “Hetero-geneity and dynamicity of clouds at scale: Google trace analysis,” inProc. ACM SoCC, 2012.

[4] B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers,and M. M. Swift, “More for your money: Exploiting performanceheterogeneity in public clouds,” in Proc. ACM SoCC, 2012.

[5] Z. Ou, H. Zhuang, A. Lukyanenko, J. Nurminen, P. Hui, V. Mazalov,and A. Yla-Jaaski, “Is the same instance type created equal? exploitingheterogeneity of public clouds,” IEEE Trans. Cloud Computing, vol. 1,no. 2, pp. 201–214, 2013.

[6] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar,“Tarazu: Optimizing mapreduce on heterogeneous clusters,” in Proc.ACM ASPLOS, 2012.

[7] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting platform heterogeneityfor power efficient data centers,” in Proc. USENIX ICAC, 2007.

[8] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google Cluster-UsageTraces,” http://code.google.com/p/googleclusterdata/.

[9] “Apache Hadoop,” http://hadoop.apache.org.[10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:

distributed data-parallel programs from sequential building blocks,” inProc. EuroSys, 2007.

[11] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, andI. Stoica, “Dominant resource fairness: Fair allocation of multipleresource types,” in Proc. USENIX NSDI, 2011.

[12] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang, “Multi-resource allocation:Fairness-efficiency tradeoffs in a unifying framework,” in Proc. IEEEINFOCOM, 2012.

[13] D. Dolev, D. Feitelson, J. Halpern, R. Kupferman, and N. Linial, “Nojustified complaints: On fair sharing of multiple resources,” in Proc.ACM ITCS, 2012.

[14] A. Gutman and N. Nisan, “Fair allocation without trade,” in Proc.AAMAS, 2012.

[15] D. Parkes, A. Procaccia, and N. Shah, “Beyond dominant resourcefairness: Extensions, limitations, and indivisibilities,” in Proc. ACM EC,2012.

[16] I. Kash, A. Procaccia, and N. Shah, “No agent left behind: Dynamicfair division of multiple resources,” in Proc. AAMAS, 2013.

[17] J. Li and J. Xue, “Egalitarian division under Leontief preferences,” Econ.Theory, vol. 54, no. 3, pp. 597–622, 2013.

[18] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Choosy: Max-minfair sharing for datacenter jobs with constraints,” in Proc. ACM EuroSys,2013.

[19] A. D. Procaccia, “Cake cutting: Not just child’s play,” Commun. ACM,2013.

[20] “Hadoop Fair Scheduler,” http://hadoop.apache.org/docs/r0.20.2/fairscheduler.html.

[21] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes,“Omega: flexible, scalable schedulers for large compute clusters,” inProc. ACM EuroSys, 2013.

[22] A. Singhal, “Modern information retrieval: A brief overview,” IEEEData Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001.

[23] S. Baruah, J. Gehrke, and C. Plaxton, “Fast scheduling of periodic taskson multiple resources,” in Proc. IEEE IPPS, 1995.

[24] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, “Proportionate progress:A notion of fairness in resource allocation,” Algorithmica, vol. 15, no. 6,pp. 600–625, 1996.

[25] F. Kelly, A. Maulloo, and D. Tan, “Rate control for communicationnetworks: Shadow prices, proportional fairness and stability,” J. Oper.Res. Soc., vol. 49, no. 3, pp. 237–252, 1998.

[26] J. Mo and J. Walrand, “Fair end-to-end window-based congestioncontrol,” IEEE/ACM Trans. Networking, vol. 8, no. 5, pp. 556–567,2000.

[27] J. Kleinberg, Y. Rabani, and E. Tardos, “Fairness in routing and loadbalancing,” in Proc. IEEE FOCS, 1999.

[28] J. Blanquer and B. Ozden, “Fair queuing for aggregated multiple links,”in Proc. ACM SIGCOMM, 2001.

[29] Y. Liu and E. Knightly, “Opportunistic fair scheduling over multiplewireless channels,” in Proc. IEEE INFOCOM, 2003.



14

[30] C. Koksal, H. Kassab, and H. Balakrishnan, “An analysis of short-termfairness in wireless media access protocols,” in Proc. ACM SIGMET-RICS (poster session), 2000.

[31] M. Bredel and M. Fidler, “Understanding fairness and its impact onquality of service in IEEE 802.11,” in Proc. IEEE INFOCOM, 2009.

[32] R. Jain, D. Chiu, and W. Hawe, A quantitative measure of fairnessand discrimination for resource allocation in shared computer system.Eastern Research Laboratory, Digital Equipment Corporation, 1984.

[33] T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An axiomatic theoryof fairness in network resource allocation,” in Proc. IEEE INFOCOM,2010.

[34] “Hadoop Capacity Scheduler,” http://hadoop.apache.org/docs/r0.20.2/capacity scheduler.html.

[35] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Gold-berg, “Quincy: Fair scheduling for distributed computing clusters,” inProc. ACM SOSP, 2009.

[36] A. A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, andI. Stoica, “Hierarchical scheduling for diverse datacenter workloads,” inProc. ACM SoCC, 2013.

PLACEPHOTOHERE

Wei Wang received the B.Engr. and M.A.Sc.degrees from the Department of Electrical Engi-neering, Shanghai Jiao Tong University, in 2007and 2010. He is currently a Ph.D. candidatein the Department of Electrical and ComputerEngineering at the University of Toronto. Hisgeneral research interests cover the broad areaof computer networking, with special empha-sis on resource management and scheduling incloud computing systems. He is also interestedin problems at the intersection of computer net-

working and economics.

PLACEPHOTOHERE

Ben Liang received honors-simultaneous B.Sc.(valedictorian) and M.Sc. degrees in Electri-cal Engineering from Polytechnic University inBrooklyn, New York, in 1997 and the Ph.D.degree in Electrical Engineering with ComputerScience minor from Cornell University in Ithaca,New York, in 2001. In the 2001 - 2002 academicyear, he was a visiting lecturer and post-doctoralresearch associate at Cornell University. Hejoined the Department of Electrical and Com-puter Engineering at the University of Toronto in

2002, where he is now a Professor. His current research interests are inmobile communications and networked systems. He is an editor for theIEEE Transactions on Wireless Communications and an associate editorfor the Wiley Security and Communication Networks journal, in additionto regularly serving on the organizational or technical committee of anumber of conferences. He is a senior member of IEEE and a memberof ACM and Tau Beta Pi.

PLACEPHOTOHERE

Baochun Li received the B.Engr. degree fromthe Department of Computer Science and Tech-nology, Tsinghua University, China, in 1995 andthe M.S. and Ph.D. degrees from the Depart-ment of Computer Science, University of Illinoisat Urbana-Champaign, Urbana, in 1997 and2000. Since 2000, he has been with the Depart-ment of Electrical and Computer Engineering atthe University of Toronto, where he is currently aProfessor. He holds the Nortel Networks JuniorChair in Network Architecture and Services from

October 2003 to June 2005, and the Bell Canada Endowed Chairin Computer Engineering since August 2005. His research interestsinclude large-scale multimedia systems, cloud computing, peer-to-peernetworks, applications of network coding, and wireless networks. Dr.Li was the recipient of the IEEE Communications Society Leonard G.Abraham Award in the Field of Communications Systems in 2000. In2009, he was a recipient of the Multimedia Communications Best PaperAward from the IEEE Communications Society, and a recipient of theUniversity of Toronto McLean Award. He is a member of ACM and asenior member of IEEE.

multi-resource fair allocation in heterogeneous cloud computing systems

Documents