joint cache partition and job assignment on multi-core ... · we obtain better approximation...

TEL-AVIV UNIVERSITY

RAYMOND AND BEVERLY SACKLER

FACULTY OF EXACT SCIENCES

SCHOOL OF COMPUTER SCIENCE

Joint Cache Partition and Job Assignmenton Multi-Core Processors

This thesis is submitted in partial fulfillment of the requirements for the M.Sc. degree

in the School of Computer Science, Tel-Aviv University

by

Omry Tuval

The research work for this thesis has been carried out at Tel-Aviv University

under the supervision of Prof. Haim Kaplan

January 2013

Acknowledgements

I would like to thank my advisor Prof. Haim Kaplan for the countless hours of invaluable

discussions and work that went into this research and thesis. I am honored to have worked with

him and thank him greatly for all the encouragement along the journey. It has been a wonderful

experience.

I would like to thank Dr. Avinatan Hassidim for numerous ideas and amazing intuitions.

Without his help this thesis would not come to be. Above all, I thank him for being a true

friend.

Finally, I would like to thank my mother, Myriam Tuval, who supported and encouraged

me in taking time off to do this research. I strive and hope to be half the person that she is.

i

Abstract

Multicore shared cache processors pose a challenge for designers of embedded systems who try

to achieve minimal and predictable execution time of workloads consisting of several jobs. One

way in which this challenge was addressed is by statically partitioning the cache among the

cores and assigning the jobs to the cores with the goal of minimizing the makespan. Several

heuristic algorithms have been proposed that jointly decide how to partition the cache among

the cores and how to assign the jobs. We initiate a theoretical study of this problem which we

call the joint cache partition and job assignment problem.

In this problem the input is a set of jobs, where each job is specified by a function that

gives the running time of the job for each possible cache allocation. The goal is to statically

partition a cache of size K among c cores and assign each job to a core such that the makespan

is minimized.

By a careful analysis of the space of possible cache partitions we obtain a constant ap-

proximation algorithm for this problem. We give better approximation algorithms for a few

important special cases. We also provide lower and upper bounds on the improvement that can

be obtained by allowing dynamic cache partitions and dynamic job assignments.

We show that our joint cache partition and job assignment problem generalizes an interesting

special case of the problem of scheduling on unrelated machines that is still more general than

scheduling on related machines. In this special case the machines are ordered by their ”strength”

and the running time of each job decreases when it is scheduled on a stronger machine. We call

this problem the ordered unrelated machines scheduling problem. We give a polynomial time

algorithm for scheduling on ordered unrelated machines for instances where each job has only

two possible load values and the sets of load values for all jobs are of constant size.

ii

Contents

Abstract ii

1 Introduction 1

1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Motivating Practical Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Machine Scheduling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Single Core Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.4 Multi Core Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 The ordered unrelated machines problem 10

3 Joint cache partition and job assignment 14

3.1 A constant approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Single load and minimal cache demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 2-approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 32 -approximation with 2K cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 43 -approximation with 3K cache, using dominant matching . . . . . . . . . . . . . 22

3.2.4 Approximate optimization algorithms for the single load, minimal cache model . . 25

3.2.5 Dominant perfect matching in threshold graphs . . . . . . . . . . . . . . . . . . . . 26

3.2.6 PTAS for jobs with correlative single load and minimal cache demand . . . . . . . 29

3.3 Step functions with a constant number of load types . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 The corresponding special case of ordered unrelated machines . . . . . . . . . . . . 37

3.4 Joint dynamic cache partition and job scheduling . . . . . . . . . . . . . . . . . . . . . . . 40

4 Static partitions under bijective analysis 45

5 Cache partitions in the speed-aware model 52

5.1 Finding the optimal static partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Variable cache partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

iii

CONTENTS iv

Bibliography 58

Chapter 1

Introduction

We study the problem of assigning n jobs to c cores on a multi-core processor, and simultanously

partitioning a shared cache of size K among the cores. Each job j is given by a non-increasing

function Tj(x) indicating the running time of job j on a core with cache of size x. A solution is a

cache partition p, assigning p(i) cache to each core i, and a job assignment S assigning each job

j to core S(j). The total cache allocated to the cores in the solution is K, that isc∑i=1

p(i) = K.

The makespan of a cache partition p and a job assignment S is maxi∑

j|S(j)=i Tj(p(i)). Our

goal is to find a cache partition and a job assignment that minimize the makespan.

Multi-core processors are the prevalent computational architecture used today in PC’s, mo-

bile devices and high performance computing. Having multiple cores running jobs concurrently,

while sharing the same level 2 and/or level 3 cache, results in complex interactions between the

jobs, thereby posing a significant challenge in determining the makespan of a set of jobs. Cache

partitioning has emerged as a technique to increase run time predictability and increase perfor-

mance on multi-core processors [LLD+08, MCHvE06]. Theoretic research on online multi-core

caching shows that the cache partition (which may be dynamic) has more influence on the per-

formance than the eviction policy [Has10, LOS12]. To obtain effective cache partitions, methods

have been developed to estimate the running time of jobs as a function of allocated cache, that

is the functions Tj(x) (see for example the cache locking technique of [LZLX10]).

Recent empirical research [LZLX11, LLX12] suggests that jointly solving for the cache par-

tition and for the job assignment leads to significant improvements over combining separate

algorithms for the two problems. The papers [LZLX11, LLX12] suggest and test heuristic al-

gorithms for the joint cache partition and job assignment problem. Our work initiates the

theoretic study of this problem.

1

CHAPTER 1. INTRODUCTION 2

We study this problem in the context of multi-core caching, but our formulation and results

are applicable in a more general setting, where the running time of a job depends on the

availability of some shared resource (cache, CPU, RAM, budget, etc.) that is allocated to the

machines. This setting is applicable, for example, for users of a public cloud infrastructure like

Amazon’s Elastic Cloud. When a user decides on her public cloud setup, there is usually a

limited resource (e.g. budget), that can be spent on different machines in the cloud. The more

budget is spent on a machine, it runs jobs faster and the user is interested in minimizing the

makespan of its set of jobs, while staying within the given budget.

We also study the problem of cache partitioning for multi-core caching in more classical

settings, where each core is already assigned a specific job to run, and our algorithms only

determine the cache partition. In Chapter 4 we consider the model in which each core has

a sequence of page requests to serve. We assume that an algorithm consists of a static cache

partition and an eviction policy and the goal is to find an algorithm that minimizes the maximal

number of cache misses by any core. Assuming all inputs are possible and using bijective analysis

we characterize the optimal algorithm.

In Chapter 5 we study the speed aware multi-core caching problem in which each core is

specified by a speed function vi(x) indicating the speed of core i if it is allocated x cache pages

and we want to partition the cache such that the speed of the slowest core is maximized. We give

algorithms that find the optimal static cache partition and the optimal variable cache partition.

1.1 Our Contribution

We show that the joint cache partition and job assignment problem is related to an interesting

special case of scheduling on unrelated machines that we call the Ordered Unrelated Machines

scheduling problem. In this problem there is a total order on the machines which captures

their relative strength. Each job has a different running time on each machine and these

running times are non-increasing with the strength of the machine. In Chapter 2 we define this

scheduling problem and show a general reduction and an approximation preserving reduction

from scheduling on ordered unrelated machines to the joint cache partition and job assignment

problem.

We present a 36-approximation algorithm for the joint cache partition and job assignment

problem in Section 3.1. We obtain this algorithm by showing that it suffices to consider a subset


of polynomial size of the cache partitions.

We obtain better approximation guarantees for special cases of the joint cache partition

and job assignment problem. When each job has a fixed running time aj and a minimal

cache demand xj , we present, in Section 3.2, a 2-approximation algorithm, a 32 -approximation

algorithm that uses 2K cache and a 43 -approximation algorithm that uses 3K cache. We call this

special case the single load minimal cache demand problem. Our 43 -approximation algorithm

is based on an algorithm that finds a dominant perfect matching in a threshold graph that

has a perfect matching, presented in Section 3.2.5. This algorithm and the existence of such a

matching in such a threshold graph are of independent interest.

We present a polynomial time approximation scheme for the single load minimal cache

demand problem, in the case where the jobs’ loads and cache demands are correlative, that is

aj ≤ aj′ iff xj ≤ xj′ (Section 3.2.6).

We study, in Section 3.3, the case where the load functions of the jobs, Tj(x), are step

functions. That is, job j takes lj time to run if given at least xj cache, and otherwise it takes

hj > lj . We show that if there are a constant number of different lj ’s and hj ’s then the problem

can be reduced to the single load minimal cache demand problem and thereby obtain the same

approximation results as for that problem (Section 3.3). We further show that if we consider

the special case of scheduling on ordered unrelated machine that corresponds to this model,

then there is dynamic programming algorithm that finds the optimal schedule in polynomial

time (Section 3.3.1).

In Section 3.4 we generalize the joint cache partition and job assignment problem and

consider dynamic cache partitions and dynamic job schedules. We show upper and lower bounds

on the makespan improvement that can be gained by using dynamic partitions and dynamic

assignments.

In Chapter 4 we use bijective analysis to prove that if the cores are already assigned specific

jobs and if the cores request pages from disjoint sets of the same size, then any algorithm that

partitions the cache equally among the cores is at least as good as any algorithm that uses

another static partition, regardless of the eviction policies used by the algorithms. Our proof is

based on showing a majorization theorem for the cumulative distribution function of a random

variable that is the maximum of several binomial random variables. We also present some

observations on the problem when cores request pages from sets of different sizes.

For the speed aware model, our main contribution (Chapter 5) is defining a variable cache


partition and presenting a linear program, of exponential size, for finding the optimal variable

cache partition. We then provide a separation oracle [GLS81] for this linear program and thus

obtain a polynomial time algorithm for finding the optimal variable cache partition using the

ellipsoid method [Kha79].

1.2 Related Work

1.2.1 Motivating Practical Work

Cache locking emerged in the real time embedded systems community as a technique to reduce

the unpredictability that is introduced by paging to the running time of computational jobs.

The term cache locking refers to a technique in which a job is analyzed in advance to select

instructions and data to lock in the cache in order to minimize the job’s worst case execution

time. Falk et al [FPT07] describe a heuristic greedy algorithm that picks a subset of the

functions to lock in the cache. As long as some function can fit in the remaining free cache it

picks the function that if locked would generate the largest decrease in execution time and marks

it for locking. For a comparative survey of instruction cache locking heuristics, see [CPIM05].

Vera et al present an algorithm for data cache locking [VLX03] that statically analyzes data

dependencies in the program to decide what data to lock and heuristically locks data that is

likely to be used, to augment the static analysis. Liu et al [LX09] formalize the problem of cache

locking and prove that it is NP-hard by a reduction from set cover. The paper also presents

optimal polynomial algorithms under the assumption that each function in the job’s code is

used only once and shows that previously best known heuristic ([FPT07]) for the cache locking

problem performs experimentally similar to their optimal algorithm for this special case.

The joint cache partition and job assignment problem is defined and studied empirically

by Liu et al in [LZLX11]. Cache-locking techniques are used in order to estimate the jobs’

execution time for each cache allocation. They use this mapping from cache allocation to

running time as the input for the joint cache partition and job assignment problem. The

main observation of this paper is that the job assignment and the cache partition influence

each other and therefore should be solved simultaneously. Their heuristic algorithm starts by

allocationg the same amount of cache to each core. It assigns the jobs to the cores, using

Graham’s algorithm ([Gra69]) for scheduling on identical machines, which is a 43 approximation

algorithm. Graham’s algorithm assigns the jobs, in a non-increasing order of their load, to the


currently least loaded core. Given the resulting job assignment, the algorithm computes the

optimal cache partition using a simple greedy algorithm (See Section 5.1). They further adjust

their solution by trying to move the smallest job currently on the most loaded core to the least

loaded core and recompute the optimal cache partition. If this improves the makespan, the

change is accepted and the the adjustment process is repeated. This is performed until the

adjustment is no longer beneficial or until a set number of iterations is reached. The paper also

provides an empirical study of the performance of this technique. The work described in this

thesis initiates a theoretical study of the joint cache partition and job assignment problem.

1.2.2 Machine Scheduling Theory

The joint cache partition and job assignment problem generalizes well known machine scheduling

problems where the objective function is the makespan.

In the problem of scheduling on unrelated machines, for each job j and each machine i

we are given T (i, j) which is the running time of job j on machine i. We want to find an

assignment of the jobs to the machines that minimizes the makespan. Lenstra et al. [LST90]

gave a 2-approximation algorithm for scheduling on unrelated machines that is based on a 2-

approximation algorithm for the decision problem. It formulates the decision problem as a linear

program, finds a vertex v of the feasible region if the problem is feasible, and then rounds it

to an integral solution with a makespan that is at most twice that of the fractional assignment

defined by v. The rounding is based on the fact that any vertex of the feasible region has at

most m+ |J | non zero variables and therefore at most m jobs that are not integrally assigned.

Consider any vertex of the linear program and consider a bipartite graph over the jobs and the

machines such that there is an edge between job j and machine i if a fraction of job j is assigned

to machine i, according to v. Lenstra et al. show how to find, in this graph and in polynomial

time, a matching that covers all the jobs. This matching defines an integral assignment of

makespan at most twice the makespan of the fractional solution defined by v. Lenstra et al also

show that it is NP-hard to approximate the unrelated machines scheduling problem to within

a factor better than 32 . This is based on showing that deciding whether an instance in which

T (i, j) ∈ 1, 2, for any i, j, has an assignment of makespan at most 2 is NP-Complete, by

a reduction from 3-dimensional-matching. Shchepin and Vakhania [SV05] improved Lenstra’s

rounding technique and obtained a 2− 1m approximation algorithm for unrelated machines.

The problem of scheduling on uniformly related machines is a special case of the unrelated


machines scheduling problem. In this problem, the input consists of a load lj for each job j and

a speed si for any machine i and we assume that machine i runs job j in timeljsi

. The goal

is again to assign the jobs to the machines such that the makespan in minimized. Hochbaum

and Shmoys [HS88] present a polynomial time approximation scheme for related machines, by

reducing the decision problem of scheduling on related machines to a bin packing problem with

bins of variable size. They give an (1+ε) approximation algorithm for this bin packing problem.

They also give a more practical 32 approximation algorithm.

Scheduling with processing set restrictions is a family of scheduling problems, where each

job j is allowed to be assigned to a given subset of the machines, denoted by Mj . Epstein and

Levin [EL11] show polynomial time approximation schemes for the following special cases of

scheduling with processing set restrictions:

• The nested model where for any two jobs j, j′ such that |Mj | ≤ |Mj′ |, either Mj ∩Mj′ = ∅

or Mj ⊂Mj′ and each job j has an identical load on all the machines in Mj .

• The tree hierarchical model where the machines are vertices of a given rooted tree and for

any job j, Mj is defined by a path starting from the root and not necessarily ending at a

leaf and each job j has an identical load on all the machines in Mj .

• The speed hierarchical model, in which we have an instance of the related machines

scheduling problem and each job j has a minimal speed requirement. A job can be

assigned only to machines that meet its minimal speed requirement.

Their PTASs for these problems are based on solving a rounded down instance using dynamic

programming. Our polynomial time algorithm for the special case of ordered unrelated ma-

chines, in Section 3.3.1, is inspired by these PTASs.

Bonficai and Wiese [BW12] consider a special case of unrelated machines in which there is

a constant number of different machine types. Each job may have different loads on machines

of different types but it has the same load for all machines of the same type. They present a

polynomial time approximation scheme for the problem. Their algorithm is based on classifying

jobs as large or small for each machine type. By rounding the loads of the large jobs they

are able to enumerate, in polynomial time, on a set of preallocated slots for large jobs on the

machines, without deciding the identity of the jobs in the slots. They assign the large jobs to

the slots and the small jobs to the remaining space by iteratively solving a series of related

linear programs.


Ebenlendr et al. [EKS08] gave a 74 approximation algorithm for the special case of scheduling

with processing set restrictions where each job can be assigned to at most two machines, and

it has the same load on both. They reformulate this problem as the following graph balancing

problem. Consider an undirected graph where the vertices correspond to the machines, and

the weight of each vertex is the sum of the loads of all jobs that must run on the machine

corresponding to this vertex. Each job j that can run on two machines corresponds to an edge

between the two corresponding vertices and its weight is the load of job j. Given an orientation

of the edges, they define the cost of a vertex v to be the sum of v’s weight and the weights of

all edges directed toward v. The goal is to direct the edges such that the maximum cost of any

vertex is minimized. They obtain the 74 -approximation algorithm by rounding a linear program

similar to Lenstra’s but with additional constraints the make sure that any vertex has at most

one edge of weight greater than 12 oriented toward it. Finally, they show that it is NP-hard

achieve an approximation ratio better than 32 for graph balancing.

1.2.3 Single Core Caching

In the classical paging problem we serve a single sequence of page requests. If a requested

page is in the cache when it is requested, it is a cache hit and otherwise it is a cache miss. In

case of a cache miss the requested page is fetched into the cache and we need to decide which

page in the cache to evict. The goal is to serve the sequence with a minimum number of cache

misses. Furthest-in-the-future [Bel66] is an optimal offline algorithm for the paging problem

that upon any cache miss, evicts the page in the cache whose next request is furthest in the

request sequence.

Most of the theoretic work on paging considers the online problem, where the request se-

quence is not given in advance and the algorithms’ performance is studied under the framework

of competitive analysis [ST85]. In competitive analysis, the performance of an online algorithm

is compared to the performance of the optimal offline algorithm. The competitive ratio of an

online algorithm is the maximal ratio between the cost of the online algorithm and the cost of

the offline optimal algorithm, for any input. That is, the competitive ratio of algorithm A is

maxσA(σ)

OPT (σ) where A(σ) is the number of misses by A on input σ, and OPT (σ) is the smallest

possible number of misses of any (offline) algorithm on σ.

In the same paper, Sleator and Tarjan [ST85] show that any deterministic online paging

algorithm has a competitive ratio of Ω(K) where K is the cache size. This lower bound follows


since an adversary can cause any deterministic online algorithm to have a cache miss on every

page request and on the other hand Furthest-in-the-future has at most 1 cache miss for every

K page requests. They further prove that the commonly used eviction policies Least-Recently-

Used (LRU) and First-In-First-Out (FIFO) are K-competitive. LRU is a special case of a wider

class of Marking algorithms that are all K-competitive [KMRS88]. The K-competitiveness of

these algorithms is proved by splitting the page requests sequence into phases such that the

optimal offline algorithm has at least one cache miss in each phase and any marking algorithm

has at most K cache misses in each phase. For more on competitive analysis of deterministic

and randomized online paging algorithms, see [Ira96].

Bijective analysis [ADLO07] was first introduced as an alternative way to analyze the per-

formance of online algorithms for the paging problem. Bijective analysis directly compares

between online algorithms, allowing it to better differentiate between different algorithms that

have the same competitive ratio. Online algorithm A is at least as good as online algorithm B,

under bijective analysis, for inputs of size n if there is a bijection π of the inputs of size n such

that A(π(σ)) ≤ B(σ). Note that bijective analysis considers the performance of the algorithms

for every possible input and not just the worst case.

In [AS09] bijective analysis is used to prove the optimality of LRU when locality of reference

is assumed. This provides a theoretic justification to the experimental results that show that

LRU performs in practice much better than some other marking algorithms.

1.2.4 Multi Core Caching

In the multicore caching problem, c cores share a cache of size K and each core has a separate

sequence of page requests of length n. When a core requests a page that is currently not in

the cache, this core is delayed by τ > 1 time units while this page is fetched into the cache

from main memory and some page currently in the cache (that may have been requested by

another core) is evicted to make room for the fetched page. While a core is fetching a page from

main memory, the other cores continue to advance through their request sequences. The goal

is to design an algorithm that decides, for each cache miss, which page currently in the cache

is evicted, such that the maximal number of cache misses by any core is minimized.

Much of the difficulty in designing competitive online algorithms for multi-core caching

stems from the fact that the way in which the request sequences of the different cores interleave

depends on the decisions of the algorithm. An algorithm with competitive ratio logarithmic in


the number of cores is obtained in [BGV00], for a different model in which the interleaving of

the request sequences is fixed.

Hassidim [Has10] considers the more realistic scenario in which the interleaving of the request

sequences does depend on the algorithm. He proves that even if we restrict the sequences of

different cores to be disjoint, then the offline problem is NP-hard. He further shows that if we

compare LRU with K cache to an optimal offline solution with Kα cache (a technique called

resource augmentation [ST85]), then LRU is Ω(τ/α) competitive. Note that an algorithm that

does not use the cache at all is Θ(τ) competitive.

The work in [Has10] also shows that whenever the optimal offline algorithm evicts a page of core

i it is the page that core i requests furthest in the future. This means that given the amount of

cache allocated to each core at each time, it is easy to decide (offline) which page to evict and

therefore the main challenge is to (dynamically) partition the cache between the cores.

Lopez-Ortiz and Salinger [LOS12] continued the work in the model presented in [Has10].

They show that online algorithms that use static cache partitions have a competitive ratio

of Ω(n) when compared to an optimal offline algorithm that uses dynamic cache partitions.

They also show that combining dynamic cache partitions with the traditional single-core online

eviction policies (like LRU) results in an arbitrarily large competitive ratio, as a function of

n. This paper criticizes [Has10] for considering algorithms that intentionally evict cache pages

and cause cache-misses that may lead to a more favourable interleaving of the cores’ request

sequences. The paper also defines an algorithm to be honest if it only evicts a page when

it incurs a cache miss. They show that for any offline algorithm, there is an honest offline

algorithm that is at least as good as the original algorithm. Finally, they also show that even

if a cache miss takes the same amount of time to serve as a cache hit, and thus does not affect

the sequences’ interleaving, then the offline problem of deciding whether a certain input can be

served such that each core does not have more cache misses than a given threshold per core is

NP-complete.

Chapter 2

The ordered unrelated machines

problem

The ordered unrelated machines scheduling problem is defined as follows. There are c machines

and a set J of jobs. The input is a matrix T (i, j) giving the running time of job j on machine i,

such that for each two machines i1 < i2 and any job j, T (i1, j) ≥ T (i2, j). The goal is to assign

the jobs to the machines such that the makespan is minimized.

The ordered unrelated machines scheduling problem is a special case of scheduling on un-

related machines in which there is a total order on the machines that captures their relative

strengths. This special case is natural since in many practical scenarios the machines have some

underlying notion of strength and jobs run faster on a stronger machine. For example a newer

computer typically dominates an older one in all parameters, or a more experienced employee

does any job faster than a new recruit.

Lenstra et al [LST90] gave a 2 approximation algorithm for scheduling on unrelated machines

based on rounding an optimal fractional solution to a linear program, and proved that it is NP-

hard to approximate the problem to within a factor better than 32 . It is currently an open

question if there are better approximation algorithms for ordered unrelated machines than the

more general algorithms that approximate unrelated machines.

Another well-studied scheduling problem is scheduling on uniformly related machines. In

this problem, the time it takes for machine i to run job j isljsi

where lj is the load of job j and

si is the speed of machine i. A polynomial time approximation scheme for related machines

is described in [HS88]. It is easy to see that the problem of scheduling on related machines is

a special case of the problem of scheduling on ordered unrelated machines, and therefore the

10

CHAPTER 2. THE ORDERED UNRELATED MACHINES PROBLEM 11

ordered unrelated machines problem is NP-hard.

The ordered unrelated machines problem is closely related to the joint cache partition and

job assignment problem. Consider an instance of the joint cache partition and job assignment

problem with c cores, K cache and a set of jobs J such that Tj(x) is the load function of job

j. If we fix the cache partition to be some arbitrary partition p, and we index the cores in

non-decreasing order of their cache allocation, then we get an instance of the ordered unrelated

machines problem, where T (i, j) = Tj(p(i)). Our constant approximation algorithm for the joint

cache partition and job assignment problem, described in Section 3.1, uses this observation as

well as Lenstra’s 2-approximation algorithm for unrelated machines. In the rest of this section

we prove that the joint cache partition and job assignment problem is at least as hard as the

ordered unrelated machines scheduling problem.

We reduce the ordered unrelated machine problem to the joint cache partition and job

assignment problem. Consider the decision version of the ordered unrelated scheduling problem,

with c machines and n = |J | jobs, where job j takes time T (i, j) to run on machine i. We want

to decide if it is possible to schedule the jobs on the machines with makespan at most M .

Define the following instance of the joint cache partition and job assignment problem. This

instance has c cores, a total cache K = c(c + 1)/2 and n′ = n + c jobs. The first n jobs

(1 ≤ j ≤ n) correspond to the jobs in the original ordered unrelated machines problem, and c

jobs are new jobs (n + 1 ≤ j ≤ n + c). The load function Tj(x) of job j, where 1 ≤ j ≤ n,

equals T (x, j) if x ≤ c and equals T (c, j) if x > c. The load function Tj(x) of job j, where

n + 1 ≤ j ≤ n + c, equals M + δ if x ≥ j − n for some δ > 0 and equals ∞ if x < j − n. Our

load functions Tj(x) are non-increasing because the original T (i, j)’s are non-increasing in the

machine index i.

Lemma 2.1. The makespan of the joint cache partition and job assignment instance defined

above is at most 2M+δ if and only if the makespan of the original unrelated scheduling problem

is at most M .

Proof. Assume there is an assignment S′ of the jobs in the original ordered unrelated machines

instance of makespan at most M . We show a cache partition p and job assignment S for the

joint cache partition and job assignment instance with makespan at most 2M + δ.

The cache partition p is defined such that p(i) = i for each core i. The partition p uses

exactly K = c(c + 1)/2 cache. The job assignment S is defined such that for a job j > n,


S(j) = j − n and for a job j ≤ n, S(j) = S′(j). The partition p assigns i cache to core i, which

is exactly enough for job n+ i, which is assigned to core i by S, to run in time M + δ. It is easy

to verify that p, S is a solution to the joint cache partition and job assignment instance with

makespan at most 2M + δ.

Assume there is a solution p, S for the joint cache partition and job assignment instance,

with makespan at most 2M + δ. Job j, such that n < j ≤ n+ c, must run on a core with cache

at least j−n, or else the makespan would be infinite. Moreover, no two jobs j1 > n and j2 > n

are assigned by S the same core, as this would give a makespan of at least 2M + 2δ. Combining

these observations with the fact that the total available cache is K = c(c + 1)/2, we get that

the cache partition must be p(i) = i for each core i. Furthermore, each job j > n is assigned

by S to core j − n and all the other jobs assigned by S to core j − n are jobs corresponding to

original jobs in the ordered unrelated machines instance. Therefore, the total load of original

jobs assigned by S to core i is at most M .

We define S′, a job assignment for the original ordered unrelated machines instance, by

setting S′(j) = S(j) for each j ≤ n. Since S assigns original jobs of total load at most M on

each core, it follows that the makespaen of S′ is at most M .

The following theorem follows immediately from Lemma 2.1

Theorem 2.2. There is a polynomial-time reduction from the ordered unrelated machines

scheduling problem to the joint cache partition and job assignment problem.

The reduction in the proof of Lemma 2.1 does not preserve approximation guarantees.

However by choosing δ carefully we can get the following result.

Theorem 2.3. Given an algorithm A for the joint cache partition and job assignment problem

that approximates the optimal makespan up to a factor of 1+ε, for 0 < ε < 1, we can construct an

algorithm for the ordered unrelated machines scheduling problem that approximates the optimal

makespan up to a factor of 1 + 2ε+ 2ε2

1−ε−χ for any χ > 0.

Proof. We first obtain a(

1 + 2ε+ 2ε2

1−ε−χ

)-approximation algorithm for the decision version of

the ordered unrelated machines scheduling problem. That is, an algorithm that given a value

M , either decides that there is no assignment of makespan M or finds an assignment with

makespan(

1 + 2ε+ 2ε2

1−ε−χ

)M .

Given an instance of the ordered unrelated machines scheduling problem, we construct an

instance of the joint cache partition and job assignment as described before lemma 2.1, and set


δ = 2εM1−ε−χ , for an arbitrarily small χ > 0. We use algorithm A to solve the resulting instance of

the joint cache partition and job assignment problem. Let p, S be the solution returned by A.

We define S′(j) = S(j) for each 1 ≤ j ≤ n. If the makespan of S′ is at most(

1 + 2ε+ 2ε2

1−ε−χ

)M

we return S′ as the solution and otherwise decide that there is no solution with makespan at

most M .

If the makespan of the original instance is at most M , then by lemma 2.1 there is a solution

to the joint cache partition and job assignment instance resulting from the reduction, with

makespan at most 2M+δ. Therefore p, S, the solution returned by algorithm A, is of makespan

at most (1 + ε)(2M + δ).

By our choice of δ we have that (1+ε)(2M+δ) < 2M+2δ and therefore each core is assigned

by S at most one job j, such that j > n. In addition, any job j such that n < j ≤ n+ c, must

run on a core with cache at least j − n, or else the makespan would be infinite. Combining

these observations with the fact that the total available cache is K = c(c + 1)/2, we get that

the cache partition must be p(i) = i for each core i. Furthermore, each job j > n is assigned

by S to core j − n and all the other jobs assigned by S to core j − n are jobs corresponding to

original jobs in the ordered unrelated machines instance. Therefore, the total load of original

jobs assigned by S to core i is at most (1 + ε)(2M + δ)−M − δ. It follows that the makespan

of S′ is at most (1 + ε)(2M + δ)−M − δ = M(

1 + 2ε+ 2ε2

1−ε−χ

).

We obtained a(

1 + 2ε+ 2ε2

1−ε−χ

)-approximation algorithm for the decision version of the or-

dered unrelated machines scheduling problem. In order to approximately solve the optimization

problem, we can perform a binary search for the optimal makespan using the approximation

algorithm for the decision version of the problem and get a(

1 + 2ε+ 2ε2

1−ε−χ

)-approximation

algorithm for the optimization problem. We obtain an initial search range for the binary search

by usingn∑j=1

T (c, j) as an upper bound on the makespan of the optimal schedule and 1c

n∑j=1

T (c, j)

as a lower bound. (See section 3.2.4 for a detailed discussion of a similar case of using an approx-

imate decision algorithm in a binary search framework to obtain an approximate optimization

algorithm.)

Chapter 3

Joint cache partition and job

assignment

3.1 A constant approximation algorithm

We first obtain an 18-approximation algorithm for the joint cache partition and job assignment

problem that uses (1 + 52ε)K cache for some constant 0 < ε < 1

2 . We then show another

algorithm that uses K cache and approximates the makespan up to a factor of 36.

Our first algorithm, denoted by A, enumerates over a subset of cache partitions, denoted

by P (K, c, ε). For each partition in this set A approximates the makespan of the corresponding

scheduling problem, using Lenstra’s algorithm, and returns the partition and associated job

assignment with the smallest makespan.

Let K ′ = (1 + ε)dlog1+ε(K)e, the smallest integral power of (1 + ε) which is at least K. The

set P (K, c, ε) contains cache partitions in which the cache allocated to each core is an integral

power of (1 + ε) and the number of different integral powers used by the partition is at most

log2(c). We denote by b the number of different cache sizes in a partition. Each core is allocated

K′

(1+ε)ljcache, where lj ∈ N and 1 ≤ j ≤ b. The smallest possible cache allocated to any core is

the smallest integral power of (1+ε) which is at least Kεc and the largest possible cache allocated

to a core is K ′. We denote by σj the number of cores with cache at least K′

(1+ε)lj. It follows that

there are (σj − σj−1) cores with K′

(1+ε)ljcache. We require that σj is an integral power of 2 and

14

CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 15

that the total cache used is at most(1 + 5

2ε)K. Formally,

P (K, c, ε) = (l = < l1, . . . , lb >, σ =< σ0, σ1, . . . , σb >) | b ∈ N, 1 ≤ b ≤ log2 c (3.1)

∀j, lj ∈ N, 0 ≤ lj ≤ log1+ε

(cε

)+ 1, ∀j, lj+1 > lj (3.2)

∀j ∃uj ∈ N s.t. σj = 2uj , σ0 = 0, σb ≤ c, ∀j σj+1 > σj (3.3)

b∑j=1

(σj − σj−1)K ′

(1 + ε)lj≤(

1 +5

2ε

)K (3.4)

When the parameters are clear from the context, we use P to denote P (K, c, ε). Let M(p, S)

denote the makespan of cache partition p and job assignment S. The following theorem specifies

the main property of P , and is proven in the remainder of this section.

Theorem 3.1. Let p, S be any cache partition and job assignment. There are a cache partition

and a job assignment p, S such that p ∈ P and M(p, S) ≤ 9M(p, S).

An immediate corollary of Theorem 3.1 is that algorithm A described above finds a cache

partition and job assignment with makespan at most 18 times the optimal makespan.

Lemma 3.2 shows that A is a polynomial time algorithm.

Lemma 3.2. The size of P is polynomial in c.

Proof. Let (l, σ) ∈ P . The vector σ is a strictly increasing vector of integral powers of 2, where

each power is at most c. Therefore the number of possible vectors for σ is bounded by the

number of subsets of 20, . . . , 2log2(c) which is O(2log2 c) = O(c). The vector l is a strictly

increasing vector of integers, each integer is at most log1+ε(cε ) + 1. Therefore the number of

vectors l is bounded by the number of subsets of integers that are at most log1+ε(cε )+1 which is

O(2log1+ε(cε)) = O(2

log2(cε )

log2(1+ε) ) = Poly(c) since ε is a constant. Therefore |P | = O(c 2log2(

cε )

log2(1+ε) ).

Let (p, S) be a cache partition and a job assignment that use c cores, K cache and have

a makespan M(p, S). Define a cache partition p1 such that for each core i, if p(i) < Kεc then

p1(i) = Kεc and if p(i) ≥ Kε

c then p1(i) = p(i). For each core i, p1(i) ≤ p(i) + kεc and hence the

total amount of cache allocated by p1 is bounded by (1 + ε)K. For each core i, p1(i) ≥ p(i) and

therefore M(p1, S) ≤M(p, S).

Let p2 be a cache partition such that for each core i, p2(i) = (1+ ε)dlog1+ε(p1(i))e, the smallest

integral power of (1 + ε) that is at least p1(i). For each i, p2(i) ≥ p1(i) and thus M(p2, S) ≤


M(p1, S) ≤M(p, S). We increased the total cache allocated by at most a multiplicative factor

of (1 + ε) and therefore the total cache used by p2 is at most (1 + ε)2K ≤ (1 + 52ε)K since ε < 1

2 .

Let ϕ be any cache partition that allocates to each core an integral power of (1 + ε) cache.

We define the notion of cache levels. We say that core i is of cache level l in ϕ if ϕ(i) = K′

(1+ε)l.

Let cl(ϕ) denote the number of cores in cache level l in ϕ. The vector of cl’s, which we call the

cache levels vector of ϕ, defines the partition ϕ completely since any two partitions that have

the same cache level vector are identical up to a renaming of the cores.

Let σ(ϕ) be the vector of prefix sums of the cache levels vector of ϕ. Formally, σl(ϕ) =l∑

i=0ci(ϕ). Note that σl(ϕ) is the number of cores in cache partition ϕ with at least K′

(1+ε)lcache

and that for each l, σl(ϕ) ≤ c.

For each such cache partition ϕ, we define the significant cache levels li(ϕ) recursively as

follows. The first significant cache level l1(ϕ) is the first cache level l such that cl(ϕ) > 0.

Assume we already defined the i− 1 first significant cache levels and let l′ = li−1(ϕ) then li(ϕ)

is the smallest cache level l > l′ such that σl(ϕ) ≥ 2σl′(ϕ).

Lemma 3.3. Let lj and lj+1 be two consecutive significant cache levels of ϕ, then the total

number of cores in cache levels in between lj and lj+1 is at most σlj (ϕ). Let lb be the last

significant cache level of ϕ then the total number of cores in cache levels larger than lb is at

most σlb(ϕ).

Proof. Assume to the contrary thatlj+1−1∑f=lj+1

cf (ϕ) ≥ σlj (ϕ). This implies that for l′ = lj+1 − 1,

σl′(ϕ) ≥ 2σlj (ϕ) which contradicts the assumption that there are no significant cache levels in

between lj and lj+1 in ϕ. The proof of the second part of the lemma is analogous.

Let cl = cl(p2). For each core i, Kεc ≤ p2(i) ≤ K ′, so we get that if l is a cache level in p2

such that cl 6= 0 then 0 ≤ l ≤ log1+ε(cε ) + 1. Let σl = σl(p2) and σ =< σ1, . . . , σb′ >, where

b′ = log1+ε(cε ) + 1. Let li = li(p2), for 1 ≤ i ≤ b, where b is the number of significant cache

levels in p2.

We adjust p2 and S to create a new cache partition p3 and a new job assignment S3. Cache

partition p3 has cores only in the significant cache levels l1, . . . , lb of p2. We obtain p3 from p2

as follows. Let f be a non-significant cache level in p2. If there is a j such that lj−1 < f < lj

then we take the cf cores in cache level f in p2 and reduce their cache so they are now in cache

level lj in p3. If f > lb then we remove the cf cores at level f from our solution. It is easy to

check that the significant cache levels of p3 are the same as of p2, that is l1, . . . , lb. Since we


only reduce the cache allocated to some cores, the new cache partition p3 uses no more cache

than p2 which is at most (1 + 52ε)K.

We construct S3 by changing the assignment of the jobs assigned by S to cores in non-

significant cache levels in p2. As before, let f be a nonsignificant cache level and let lj−1 be the

maximal significant cache level such that lj−1 < f . For each core i in cache level f in p2 we

move all the jobs assigned by S to core i, to a target core in cache level lj−1 in p3. Lemma 3.4

specifies the key property of this job-reassignment.

Lemma 3.4. We can construct S3 such that each core in a significant level of p3 is the target

of the jobs from at most two cores in a nonsignificant level of p2.

Proof. Let c3 denote the cache levels vector of p3 and let σ3 denote the vector of prefix sums

of c3. From the definition of p3 follows that for all j, σ3lj = σlj , and that for j > 1, c3lj =

σ3lj − σ3lj−1

= σlj − σlj−1.

By Lemma 3.3 the number of cores in nonsignificant levels in p2 whose jobs are reassigned

to one of the c3lj cores in level lj in p3 is at most σlj . So for j > 1 the ratio between the number

of cores whose jobs are reassigned to the number of target cores in level lj in p3 is at most

σljσlj−σlj−1

= 1 +σlj−1

σlj−σlj−1≤ 2. For j = 1 the number of target cores in level l1 of p3 is c3l1 = σl1

which is at least as large as the number of cores at nonsignificant levels between l1 and l2 in p2

so we can reassign the jobs of a single core of a nonsignificant level between l1 and l2 in p2 to

each target core.

Corollary 3.5. M(p3, S3) ≤ 3M(p, S)

Proof. In the new cache partition p3 and job assignment S3 we have added to each core at a

significant level in p3 the jobs from at most 2 other cores at nonsignificant levels in p2. The

target core always has more cache than the original core, thus the added load from each original

core is at most M(p2, S). It follows that M(p3, S3) ≤ 3M(p2, S) ≤ 3M(p, S).

Let c3 denote the cache levels vector of p3 and let σ3 denote the vector of prefix sums of c3.

We now define another cache partition p based on p3. Let uj = blog2(σ3lj

)c. The partition p has

2u1 cores in cache level l1, and 2uj − 2uj−1 cores in cache level lj for 1 < j ≤ b. The cache levels

l1, . . . , lb are the significant cache levels of p and p has cores only in its significant cache levels.

Let clj denote the number of cores in the significant cache level lj in p.

Lemma 3.6. 3clj ≥ c3lj


Proof. By the definition of uj , we have that 2uj ≤ σ3lj < 2uj+1. So for j > 1

cljc3lj

=2uj − 2uj−1

σ3lj − σ3lj−1

>2uj − 2uj−1

2uj+1 − 2uj−1=

2uj−uj−1 − 1

2uj−uj−1+1 − 1(3.5)

Since lj and lj−1 are two consecutive significant cache levels we have that uj − uj−1 ≥ 1. The

ratio in 3.5 is an increasing function of uj −uj−1 and thus minimized by uj −uj−1 = 1, yielding

a lower bound of 13 . For j = 1,

cl1c3l1

= 2u1σ3l1

> 2u12u1+1 = 1

2 .

Lemma 3.6 shows that the cache partition p has in each cache level lj at least a third of the

cores that p3 has at level lj . Therefore, there exists a job assignment S that assigns to each

core of cache level lj in p the jobs that S3 assigns to at most 3 cores in cache level lj in p3.

We only moved jobs within the same cache level and thus their load remains the same, and the

makespan M(p, s) ≤ 3M(p3, S3) ≤ 9M(p, s).

Lemma 3.7. Cache partition p is in the set P (K, c, ε).

Proof. Let σ be the vector of prefix sums of c. The vectors < l1, . . . , lb >,< σl1 , . . . , σlb >

clearly satisfy properties 1-3 in the definition of P (K, c, ε). It remains to show that p uses at

most (1 + 52ε)K cache (property 4).

Consider the core with the xth largest cache in p. Let lj be the cache level of this core.

Thus σlj ≥ x. Since σlj is the result of rounding down σ3lj to the nearest integral power of 2,

we have that σlj ≤ σ3lj . It follows that σ3lj ≥ x and therefore the core with the xth largest cache

in p3 is in cache level lj or smaller and thus is it has at least as much cache as the xth largest

core in p. So p uses at most the same amount of cache as p3 which is at most (1 + 52ε)K.

This concludes the proof of Theorem 3.1, and establishes that our algorithm A is an 18-

approximation algorithm for the problem, using (1 + 52ε)K cache.

We provide a variation of algorithmA that uses at mostK cache, and finds a 36-approximation

for the optimal makespan. Algorithm B enumerates on r, 1 ≤ r ≤ K, the amount of cache

allocated to the first core. It then enumerates over the set of partitions P = P (K−r2 , d c2e−1, 25).

For each partition in P it adds another core with r cache and applies Lenstra’s approximation

algorithm on the resulting instance of the unrelated machines scheduling problem, to assign all

the jobs in J to the d c2e cores. Algorithm B returns the partition and assignment with the

minimal makespan it encounters.


Theorem 3.8. If there is a solution of makespan M that uses at most K cache and at most c

cores then algorithm B returns a solution of makespan 36M that uses at most K cache and at

most c cores.

Proof. Let (p, S) be the a solution of makespan M , K cache and c cores. W.l.o.g. assume that

the cores are indexed according to the non-increasing order of their cache allocation in this

solution, that is p(i+ 1) > p(i).

Let J ′ = j ∈ J | S(j) ≥ 3. Consider the following job assignment S′ of the jobs in J ′ to

the cores of odd indices greater than 1 in (p, S). The assignment S′ assigns to core 2i − 1, for

i ≥ 2, all the jobs that are assigned by S to cores 2i− 1 and 2i. Note that all the jobs assigned

by S′ to some core are assigned by S to a core with at most the same amount of cache and thus

the makespan of S′ is at most 2M .

Assume r = p(1). Then K = r +∑

odd i≥3p(i) + p(i − 1) ≥ r +

∑odd i≥3

2p(i) since p is non-

increasing. Therefore we get that∑

odd i≥3p(i) ≤ K−r

2 . Therefore we can assign the jobs in J ′ to

d c2e − 1 cores with a total cache of K−r2 , such that the makespan is at most 2M . By Theorem

3.1, there is a partition p′ ∈ P (K−r2 , d c2e − 1, 25) that allocates at most (1 + 5225)K−r2 = K − r

cache to d c2e − 1 cores, and a job assignment S′ of the jobs in J ′ to these cores such that the

makespan of p′, S′ is at most 18M .

Let p be a cache partition that adds to p′ another core (called “core 1”) with r cache. The

total cache used by p is at most K. Let S be a job assignment such that S(j) = S′(j) for j ∈ J ′

and for a job j ∈ J \ J ′ (a job that was assigned by S either to core 1 or to core 2), S(j) = 1.

Since the makespan of (p, S) is M we know that the load on core 1 in the solution p, S is at

most 2M . It follows that the makespan of p, S is at most 18M .

When algorithm B fixes the size of the cache of the first core to be r = p(1), and considers

p′ ∈ P (K−r2 , d c2e − 1, 25) then it obtains the cache partition p. We know that S is a solution

to the corresponding scheduling problem with makespan at most 18M . Therefore Lenstra’s

approximation algorithm finds an assignment with makespan at most 36M .

3.2 Single load and minimal cache demand

We consider a special case of the general joint cache partition and job assignment problem where

each job has a minimal cache demand xj and single load value aj . Job j must run on a core

with at least xj cache and it contributes a load of aj to the core. We want to decide if the jobs


can be assigned to c cores, using K cache, such that the makespan is at most M? W.l.o.g. we

assume M = 1.

In Section 3.2.1 we describe a 2-approximate decision algorithm that if the given instance

has a solution of makespan at most 1, returns a solution with makespan at most 2 and otherwise

may fail. In Sections 3.2.2 and 3.2.3 we improve the approximation guarantee to 32 and 4

3 at

the expense of using 2K and 3K cache, respectively. In Section 3.2.4 we show how to obtain an

approximate optimization algorithm using an approximate decision algorithm and a standard

binary search technique.

3.2.1 2-approximation

We present a 2-approximate decision algorithm, denoted by A2. Algorithm A2 sorts the jobs in

a non-increasing order of their cache demand. It then assigns the jobs to the cores in this order.

It keeps assigning jobs to a core until the load on the core exceeds 1. Then, A2 starts assigning

jobs to the next core. Note that among the jobs assigned to a specific core the first one is the

most cache demanding and it determines the cache allocated to this core by A2. Algorithm A2

fails if the generated solution uses more than c cores or more than K cache. Otherwise, A2

returns the generated cache partition and job assignment.

Theorem 3.9. If there is a cache partition and job assignment of makespan at most 1 that use

c cores and K cache then algorithm A2 finds a cache partition and job assignment of makespan

at most 2 that use at most c cores and at most K cache.

Proof. Let Y = (p, S) be the cache partition and job assignment with makespan at most 1

whose existence is assumed by the lemma. Y has makespan at most 1 so the sum of the loads

of all jobs is at most c. Since A2 loads each core, except maybe the last one, with more than 1

load it follows that A2 uses at most c cores.

Since Y has makespan at most 1 the load of each of the jobs is at most 1. Algorithm A2

only exceeds a load of 1 on a core by the load of the last job assigned to this core and thus A2

yields a solution with makespan at most 2.

Assume w.l.o.g that the cores in Y are indexed such that for any core i, p(i + 1) ≤ p(i).

Assume that the cores in A2 are indexed in the order in which they were loaded by A2. By

the definition of A2 the cores are also sorted by non-increasing order of their cache allocation.

Denote by z(i) the amount of cache A2 allocates to core i. We show that for all i ∈ 1, . . . , c,


z(i) ≤ p(i). This implies that algorithm A2 uses at most K cache.

A2 allocates to the first core the cache required by the most demanding job so z(1) = maxj xj .

This job must be assigned in Y to some core and therefore z(1) ≤ p(1). Assume to the contrary

that z(i) > p(i) for some i. Each job j with cache demand xj > p(i) must be assigned in Y to

one of the first (i − 1) cores, because all the other cores don’t have enough cache to run this

job. Since Y has makespan at most 1 we know that∑

j|xj>p(i)aj ≤ (i− 1). Consider all the jobs

with cache demand at least z(i). Algorithm A2 failed to assign all these jobs to the first (i− 1)

cores, and we know that A2 assigns more than 1 load to each core. So∑

j|xj≥z(i)aj > (i − 1).

Since z(i) > p(i) and there is a job with cache demand z(i), we have∑

j|xj≥z(i)aj <

∑j|xj>p(i)

aj

which leads to a contradiction. Therefore z(i) ≤ p(i) for all i and algorithm A2 uses at most K

cache.

3.2.2 32-approximation with 2K cache

We define a job to be large if aj >12 and small otherwise. Our algorithm A 3

2assigns one large

job to each core. Let si be the load on core i after the large jobs are assigned. Let ri = 1− si.

We process the small jobs by non-increasing order of their cache demand xj , and assign them

to the cores in non-increasing order of the cores’ ri’s. We stop assigning jobs to a core when its

load exceeds 1 and start loading the next core. Algorithm A 32

allocates to each core the cache

demand of its most demanding job. Algorithm A 32

fails if the resulting solution uses more than

c cores or more than 2K cache.

Theorem 3.10. If there is a cache partition and job assignment of makespan at most 1 that

use c cores and K cache then A 32

finds a cache partition and job assignment that use at most

2K cache, at most c cores and have a makespan of at most 32 .

Proof. Let Y = (p, S) be the cache partition and job assignment with makespan at most 1 whose

existence is assumed by the lemma. The existence of Y implies that there are at most c large

jobs in our input and that the total volume of all the jobs is at most c. Therefore algorithm

A 32

uses at most c cores to assign the large jobs. Furthermore, when A 32

assigns the small jobs

it loads each core, except maybe the last one, with a load of at least 1 and thus uses at most c

cores. Algorithm A 32

provides a solution with makespan at most 32 since it can only exceed a

load of 1 on any core by the load of a single small job.

Let z be the cache partition generated by A 32. Let Cl be the set of cores whose most cache


demanding job is a large job and Cs be the set of cores whose most cache demanding job is a

small job. For core i ∈ Cl, Let ji be the most cache demanding job assigned to core i, so we have

z(i) = xji . The solution Y = (p, S) is a valid solution thus xji ≤ p(S(ji)) so z(i) ≤ p(S(ji)). If

j1, j2 are two large jobs then S(j1) 6= S(j2) and we get that∑i∈Cl

z(i) ≤∑i∈Cl

p(S(ji)) ≤c∑i=1

p(i) =

K.

In the rest of the proof we index the cores in the solution of A 32

such that r1 ≥ r2 . . . ≥ rc.This

is the same order in which A 32

assigns small jobs to the cores. In Y we assume that the cores

are indexed such that p(i) ≥ p(i + 1). We now prove the z(i) ≤ p(i) for any core i ∈ Cs.

Assume, to the contrary, that for some i, z(i) > p(i). Let α be the cache demand of the

most cache demanding small job on core i in Y . Let J1 = j | aj ≤ 12 , xj ≥ z(i) and let

J2 = j | aj ≤ 12 , xj > α). Since α ≤ p(i) and by our assumption p(i) < z(i) we get that

α < z(i) and therefore J1 ⊆ J2.

A 32

does not assign all the jobs of J1 to its first (i− 1) cores and therefore the total load of

the jobs in J1 is greater thani−1∑l=1

rl. On the other hand we know that in Y , assignment S assigns

all the jobs in J2 on its first i − 1 cores while not exceeding a load of 1. Thus the total load

of jobs in J2 is at most the space available for small jobs on the first (i − 1) cores in solution

Y . Since r1 ≥ r2 . . . ≥ rc, and since in any solution each core runs at most one large job, we

get thati−1∑l=1

rl is at least as large as the space available for small jobs in any subset of (i − 1)

cores in any solution. It follows that the total load of jobs in J2 is smaller than in J1. This

contradicts the fact that J1 ⊆ J2.

We conclude that for every i ∈ Cs, z(i) ≤ p(i). This implies that the total cache allocated

to cores in Cs is at most K. We previously showed that the total cache allocated to cores in Cl

is at most K and thus the total cache used by Algorithm A 32

is at most 2K.

3.2.3 43-approximation with 3K cache, using dominant matching

We present a 43 approximate decision algorithm, A 4

3, that uses at most 3K cache. The main

challenge is assigning the large jobs, which here are defined as jobs of load greater than 13 .

There are at most 2c large jobs in our instance, because we assume there is a solution of

makespan at most 1 that uses c cores. Algorithm A 43

matches these large jobs into pairs, and

assigns each pair to a different core. In order to perform the matching, we construct a graph G

where each vertex represents a large job j of weight aj >13 . If needed, we add artificial vertices


of weight zero to have a total of exactly 2c vertices in the graph. Each two vertices have an

edge between them if the sum of their weights is at most 1. The weight of an edge is the sum

of the weights of its endpoints.

A perfect matching in a graph is a subset of edges such that every vertex in the graph is

incident to exactly one edge in the subset. We note that there is a natural bijection between

perfect matchings in the graph G and assignments of makespan at most 1 of the large jobs to

the cores. The c edges in any perfect matching define the assignment of the large jobs to the

c cores as follows: Let (a, b) be an edge in the perfect matching. If both a and b correspond

to large jobs, we assign both these jobs to the same core. If a corresponds to a large job and

b is an artificial vertex, we assign the job corresponding to a to its own core. If both a and b

are artificial vertices, we leave a core without any large jobs assigned to it. Similarly we can

injectively map any assignment of the larges jobs of makespan at most 1 to a perfect matching

in G: For each core that has 2 large jobs assigned to it, we select the edge in G corresponding

to these jobs, for each core with a single large job assigned to it, we select an edge between the

corresponding real vertex and an arbitrary artificial vertex, and for each core with no large jobs

assigned to it we select an edge in G between two artificial vertices.

A dominant perfect matching in G is a perfect matching Q such that for every i, the i heaviest

edges in Q are a maximum weight matching in G of i edges. The graph G is a threshold graph

[MP95], and in Section 3.2.5 we provide a polynomial time algorithm that finds a dominant

perfect matching in any threshold graph that has a perfect matching. If there is a solution for

the given instance of makespan at most 1 then the assignment of the large jobs in that solution

correspond to a perfect matching in G and thus algorithm A 43

can apply the algorithm from

Section 3.2.5 and find a dominant perfect matching, Q, in G.

Algorithm A 43

then assigns the small jobs (load ≤ 13) similarly to algorithms A2 and A 3

2

described in Sections 3.2.1 and 3.2.2, respectively. It greedily assigns jobs to a core, until the

core’s load exceeds 1. Jobs are assigned in a non-increasing order of their cache demand and

the algorithm goes through the cores in a non-decreasing order of the sum of loads of the large

jobs on each core. Once all the jobs are assigned, the algorithm allocates cache to the cores

according to the cache demand of the most demanding job on each core. Algorithm A 43

fails if

it does not find a dominant perfect matching in G or if the resulting solution uses more than c

cores or more than 3K cache.

Theorem 3.11. If there is a solution that assigns the jobs to c cores with makespan at most 1


and uses K cache then algorithm A 43

assigns the jobs to c cores with makespan at most 43 and

uses at most 3K cache.

Proof. Let Y = (p, S) be a solution of makespan at most 1, that uses c cores and K cache.

Algorithm A 43

provides a solution with makespan at most 43 since it may only exceed a load

of 1 on any core by the load of a single small job.

Algorithm A 43

uses at most c cores to assign the large jobs because the assignment is based

on a perfect matching of size c in G. The existence of Y implies that the total load of all jobs

is at most c. When A 43

assigns the small jobs it exceeds a load of 1 on all cores it processes,

except maybe the last one, and therefore we get that A 43

uses at most c cores.

Let z be the cache partition generated by A 43. Let Cl be the set of cores whose most

demanding job is a large job and Cs be the set of cores whose most demanding job is a small

job.

Consider any core i ∈ Cl. Let j be the most cache demanding large job assigned to core i.

Job j runs in solution Y on some core S(j). Therefore z(i) = xj ≤ p(S(j)). Since each core in

Y runs at most two large jobs, we get that the total cache allocated by our algorithm to cores

in Cl is at most 2K.

Consider the large jobs assigned to cores according to the dominant perfect matching Q.

Denote by si the load on core i after the large jobs are assigned (and before the small jobs

are assigned) and let ri = 1 − si. W.l.o.g. we assume the cores in A 43

are indexed such that

r1 ≥ . . . ≥ rc. For every i,c∑l=i

sl is at least as large as this sum in any assignment of the large

jobs of makespan at most 1 because any such assignment defines a perfect matching in graph G

and ifc∑l=i

sl is larger in some other assignment then Q is not a dominant perfect matching in G.

Since the total volume of all large jobs is fixed, we get that for every core i the amount of free

volume on cores 1 till i,i∑l=1

rl, is maximal and can not be exceeded by any other assignment of

the large jobs of makespan at most 1.

W.l.o.g we assume that the cores in solution Y = (p, S) are indexed such that p(i) ≥ p(i+1).

Let i be any core in Cs. We show that z(i) ≤ p(i). Assume, to the contrary, that z(i) > p(i).

Let α be the cache demand of the most demanding small job assigned to core i in solution Y .

Let J1 = j | aj ≤ 13 , xj ≥ z(i) and J2 = j | aj ≤ 1

3 , xj > α. Since α ≤ p(i) < z(i), we get

that J1 ⊆ J2.

Solution Y assigns all the jobs in J2 to its first (i− 1) cores, without exceeding a makespan

of 1. Therefore the total volume of jobs in J2 is at most the total available space solution Y


has on its first (i− 1) cores after assigning the large jobs. Since we know that for every i,i∑l=1

rl

is maximal and can not be exceeded by any assignment of the large jobs of makespan at most

1, we get that the total volume of jobs in J2 is at mosti∑l=1

rl. Algorithm A 43

does not assign

all the jobs in J1 to its first (i− 1) cores, and since A 43

loads each of the first (i− 1) cores with

at least 1, we get that the total volume of jobs in J1 is greater thani∑l=1

rl. So we get that the

total volume of jobs in J2 is less than the total volume of jobs in J1 but that is a contradiction

to the fact that J1 ⊆ J2. Therefore we get that z(i) ≤ p(i), for every i ∈ Cs. It follows that the

total cache allocated by our algorithm to cores in Cs is at most K and this concludes the proof

that our algorithm allocates a total of at most 3K cache to all cores.

3.2.4 Approximate optimization algorithms for the single load, minimal cache

model

We presented approximation algorithms for the decision version of the joint cache partition

and job assignment problem in the single load and minimal cache demand model. If there is

a solution with makespan m, algorithms A2, A 32

and A 43

find a solution of makespan 2m, 3m2

and 4m3 , that uses K, 2K and 3K cache, respectively. We now show how to transform these

algorithms into approximate optimization algorithms using a standard binary search technique

[LST90].

Lemma 3.12. Given m, K and c, assume there is a polynomial time approximate decision

algorithm that if there is a solution of makespan m, K cache and c cores, returns a solution of

makespan αm, βK cache and c cores, where α and β are at least 1. Then, there is a polynomial

time approximation algorithm that finds a solution of makespan αmopt, βK cache and c cores,

where mopt is the makespan of the optimal solution with K cache and c cores.

Proof. Let’s temporarily assume that the loads of all jobs are integers. This implies that for

any cache partition and job assignment the makespan is an integer.

Our approximate optimization algorithm performs a binary search for the optimal makespan

and maintains a search range [L,U ]. Initially, U =n∑j=1

aj and L =⌈1cU⌉. Clearly these initial

values of L and U are a lower and an upper bound on the optimal makespan, respectively. Let

A be the approximate decision algorithm whose existence is assumed in the lemma’s statement.

In each iteration, we run algorithm A with parameters K, c and m = bL+U2 c. If A succeeds

and returns a solution with makespan at most αm we update the upper bound U := m. If A


fails, we know there is no solution of makespan at most m, and we update the lower bound

L := m + 1. It is easy to see that the binary search maintains the invariant that after any

iteration, if the search range is [L,U ] then mopt ∈ [L,αU ] and we have a solution of makespan

at most αU . The binary search stops when L = U .

The makespan of the solution when the binary search stops is at most αU = αL ≤ αmopt.

The binary search stops after O(log2(n∑j=1

aj)) iterations, and since A runs in polynomial time,

we get that our algorithm runs in polynomial time. This shows that our binary search algorithm

is a polynomial time α-approximation algorithm.

If the loads in our instance are not integers, let 12φ

be the precision in which the loads are

given. By multiplying all loads by 2φ we get an equivalent instance where all the loads of the

jobs are integers. Note that this only adds φ iterations to the binary search and our algorithm

still runs in polynomial time.

The following theorem follows immediately from Lemma 3.12.

Theorem 3.13. Using the approximate decision algorithms presented in this section, we obtain

polynomial time approximate optimization algorithms for the single load, minimal cache demand

problem with approximation factors 2, 32 and 4

3 that use K, 2K and 3K cache, respectively.

3.2.5 Dominant perfect matching in threshold graphs

Let G = (V,E) be an undirected graph with 2c vertices where each vertex x ∈ V has a weight

w(x) ≥ 0. The edges in the graph are defined by a threshold t > 0 to be E = (x, y) |

w(x) +w(y) ≤ t, x 6= y. Such a graph G is known as a threshold graph [CH73, MP95]. We say

that the weight of an edge (x, y) is w(x, y) = w(x) + w(y).

A perfect matching A in G is a subset of the edges such that every vertex in V is incident to

exactly one edge in A. Let Ai denote the i-th heaviest edge in A. We assume, w.l.o.g, that there

is some arbitrary predefined order of the edges in E that is used, as a secondary sort criteria,

to break ties in case several edges have the same weight. In particular, this implies that Ai is

uniquely defined.

Definition 3.14. A perfect matching A dominates a perfect matching B if for every x ∈

1, . . . , cx∑i=1

w(Ai) ≥x∑i=1

w(Bi)

Definition 3.15. A perfect matching A is a dominant matching if A dominates any other

perfect matching B.


Let A and B be two perfect matchings in G. We say that A and B share a prefix of length l

if Ai = Bi for i ∈ 1, . . . , l. The following greedy algorithm finds a dominant perfect matching

in a threshold graph G that has a perfect matching. We start with G0 = G. At step i, the

algorithm selects the edge (x, y) with maximum weight in the graph Gi. If there are several

edges of maximum weight, then (x, y) is the first by the predefined order on E. The graph Gi+1

is obtained from Gi by removing vertices x, y and all edges incident to x or y. The algorithm

stops when it selected c edges and Gc is empty.

Lemma 3.16. For every x ∈ 0, . . . , c− 1, If graph Gx has a perfect matching, then the graph

Gx+1 has a perfect matching.

Proof. Let Mx denote the perfect matching in graph Gx. Let (a, b) be the edge of maximum

weight in Gx that we remove, with its vertices and their incident edges, to obtain Gx+1. If

(a, b) ∈ Mx then clearly Mx \ (a, b) is a perfect matching in Gx+1. If (a, b) 6∈ Mx, and since

Mx is a perfect matching of Gx, there are two vertices c and d such that (a, c) and (b, d) are in

Mx. The edge (a, b) is the maximum weight edge in Gx and thus w(b) ≥ w(c) and w(a) ≥ w(d).

Therefore (c, d) must be an edge in Gx because w(c) + w(d) ≤ w(a) + w(b) ≤ t which is the

threshold defining the edges in our threshold graph. Let Mx+1 = Mx \ (a, c), (b, d) ∪ (c, d).

It is easy to see that Mx+1 is a perfect matching in graph Gx+1.

Theorem 3.17. If G is a threshold graph with 2c vertices that has a perfect matching, then the

greedy algorithm described above finds a dominant perfect matching.

Proof. Lemma 3.16 implies that our greedy algorithm is able to select a set of c edges that is a

perfect matching in G. Denote this matching by Q.

Assume, to the contrary, that Q is not a dominant perfect matching in G. Let A be a perfect

matching that is not dominated by Q and that share the longest possible prefix with Q. Let

x denote the length of the shared prefix of Q and A. Let Gx denote the graph obtained from

G by removing the x edges that are the heaviest in both A and Q, their vertices and all edges

incident to these vertices.

Let (a, b) = Qx+1. Since A and Q share a maximal prefix of length x, Ax+1 6= (a, b) .

Since (a, b) is of maximum weight in Gx, it follows that (a, b) 6∈ A (otherwise, it would have

been Ax+1). The set of edges Ax+1, . . . , Ac form a perfect matching of Gx so there must

be two edges and two indices l1 > x and l2 > x, such that Al1 = (a, d), Al2 = (b, c). We

assume w.l.o.g. that l1 < l2. The edge (a, b) is of maximum weight in Gx and therefore


w(a) ≥ w(c) and w(b) ≥ w(d). It follows that w(c, d) ≤ w(a, b) ≤ t, and therefore (c, d) ∈ Gx.

Let A′ = A\(a, d), (b, c)∪(a, b), (c, d). Clearly, A′ is a perfect matching in G, A′x+1 = (a, b)

and therefore A′ shares a prefix of length x + 1 with Q. If A′ dominates A, then since Q does

not dominate A, it follows that Q does not dominate A′. Thus A′ is a perfect matching that

shares a prefix of length x + 1 with Q and is not dominated by Q. This is a contradiction to

the choice of A. We finish the proof by showing that A′ dominates A.

Let l3 be the index such that A′l3 = (c, d). Since w(b) ≥ w(d), l3 > l2. Let ∆(l) =l∑

i=1w(A′i)−

l∑i=1

w(Ai). The matchings A′ and A share a prefix of length x, so for every 1 ≤ l ≤ x,

∆(l) = 0. For x + 1 ≤ l < l1, ∆(l) = w(a, b) − w(Al) ≥ 0 since (a, b) is the edge of maximum

weight in Gx. For l1 ≤ l < l2, ∆(l) = w(a, b) − w(a, d) ≥ 0 also by the maximality (a, b). For

l2 ≤ l < l3, ∆(l) = w(A′l) − w(c) − w(d) which is non-negative because l < l3 and therefore

w(A′l) ≥ w(A′l3) = w(c) + w(d). For l ≥ l3, ∆(l) = 0. This shows that A′ dominates A and

concludes our proof that Q is a dominant perfect matching in G.

On dominant perfect matchings in d-uniform hypergraphs

The problem of finding a dominant perfect matching in a d-uniform threshold hypergraph1 that

has a perfect matching is interesting in the context of the single load, minimal cache demand

special case of the joint cache partition and job assignment problem. If we can find such a

matching then an algorithm, similar to Algorithm A 43

in Section 3.2.3, would give a solution

that uses (d+ 1)K cache and approximates the makespan up to a factor of d+2d+1 .

However, the following example shows that in a 3-uniform threshold hypergraph that has

a perfect matching, a dominant perfect matching does not necessarily exist. Let ε > 0 be an

arbitrarily small constant. Consider a hypergraph with 12 vertices, 3 vertices of each weight

in 13 ,29 ,

49 − ε, ε. Each triplet of vertices is an edge if the sum of its weights is at most 1.

This hypergraph has a perfect matching. In fact, let’s consider two perfect matchings in this

hypergraph. Matching A consists of the edges (13 ,13 ,

13), (49− ε,

49− ε, ε), (49− ε,

29 ,

29) and (29 , ε, ε).

Matching B consists of three edges of the form (13 ,29 ,

49 − ε) and one edge of the form (ε, ε, ε).

It is easy to check that A and B are valid perfect matchings in this hypergraph. Any dominant

perfect matching in this hypergraph must contain the edge (13 ,13 ,

13) in order to dominate A,

since this is the only edge of weight 1 in this hypergraph. The sum of the two heaviest edges

1 A d-uniform threshold hypergraph is defined on a set of vertices, V , each with a non-negative weight w(v).The set of edges, E, contains all the subsets S ⊂ V of size d such that the sum of the weights of the vertices inS is at most some fixed threshold t > 0.


in matching B is 2 − 2ε and therefore any dominant perfect matching must have an edge of

weight at least 1 − 2ε, as otherwise the matching will not dominate matching B. But, if the

edge (13 ,13 ,

13) is in the dominant matching, then all edges disjoint from (13 ,

13 ,

13) have a weight

smaller than 1− 2ε. Thus no dominant perfect matching exists in this hypergraph.

Matching A in the example above is the perfect matching found by applying the greedy

algorithm to this hypergraph. It is interesting to note that in a 3-uniform threshold hypergraph,

the greedy algorithm does not necessarily find a perfect matching at all. This is because Lemma

3.16 does not extend to 3-uniform threshold hypergraphs. Let ε > 0 be an arbitrarily small

constant. Consider a hypergraph with 9 vertices, 3 vertices of each weight in 13 ,29 ,

49 − ε. Each

triplet of vertices is an edge if the sums of its weights is at most 1. This hypergraph has a perfect

matching since the 3 edges of the form (13 ,29 ,

49 − ε) are a perfect matching in this hypergraph.

However the greedy algorithm first selects the edge (13 ,13 ,

13) and then selects an edge of the

form (29 ,29 ,

4−ε9 ). The remaining hypergraph now contains three vertices and no edges, so the

greedy algorithm is stuck and fails to find a perfect matching.

3.2.6 PTAS for jobs with correlative single load and minimal cache demand

The main result in this section is a polynomial time approximation scheme for instances of the

single load minimal cache demand problem, where there is a correlation between the load and

the cache demand of jobs with non-zero cache demand. This special case is motivated by the

observation that often there is some underlying notion of a job’s “hardness” that affects both

its load and its minimal cache demand.

Consider an instance of the single load minimal cache demand problem such that for any

two jobs j, j′ such that xj and xj′ are non-zero, aj ≤ aj′ ⇐⇒ xj ≤ xj′ . We call a job j such

that xj > 0 a demanding job and a job j such that xj = 0 a non-demanding job. We consider

the following decision problem: We want to decide if there is a cache partition of K cache to

c cores and an assignment of jobs to the cores such that the jobs’ minimal cache demands are

satisfied and that the resulting makespan is at most m? By scaling down the loads of the jobs

by m, we assume w.l.o.g that m = 1.

Let ε > 0. We present an algorithm that if there is a cache partition and a job assignment

with makespan at most 1, returns a cache partition and a job assignment with makespan at

most (1 + 2ε). Otherwise, our algorithm either decides that there is no solution of makespan at

most 1 or returns a solution of makespan at most (1 + 2ε). Combining this algorithm with a


binary search, we obtain a PTAS.

If there is a job j such that aj > 1 then our algorithm decides that there is no solution of

makespan at most 1. Thus we assume that for any j, aj ≤ 1.

Let J = J1 ∪ J2, J1 = j ∈ J | aj ≥ ε, J2 = J\J1. In the first phase, we deal only with

jobs in J1. For each j ∈ J1 let uj = maxu ∈ N | ε + uε2 ≤ aj. We say that ε + ujε2 is the

rounded-down load of job j.

Let UD = uj | j ∈ J1, xj > 0 and UND = uj | j ∈ J1, xj = 0. An assignment pattern

of a core is a table that indicates for each u ∈ UD how many demanding jobs of rounded-down

load ε + uε2 are assigned to the core and for each u ∈ UND how many non-demanding jobs of

rounded-down load ε+ uε2 are assigned to the core. Note that an assignment pattern of a core

does not identify the actual jobs assigned to the core. We only consider assignment patterns

whose rounded-down load is at most 1.

A configuration of cores is a table indicating how many cores we have of each possible

assignment pattern. A configuration of cores T is valid if for every u ∈ UD, the number of

demanding jobs in J1 whose uj = u equals the sum of the numbers of demanding jobs with

uj = u in all assignment patterns in T and, similarly, for every u ∈ UND, the number of non-

demanding jobs in J1 whose uj = u equals the sum of the numbers of non-demanding jobs with

uj = u in all assignment patterns in T .

The outline of our algorithm is as follows. The algorithm enumerates over all valid configu-

rations of cores. For each valid configuration T , we find an actual assignment of the jobs in J1

that matches T and minimizes the total cache used. We then proceed to assign the jobs in J2,

in a way that guarantees that if there is a solution of makespan 1 and K cache that matches

this configuration of cores, then we obtain a solution of makespan at most (1 + 2ε) and at most

K cache. If our algorithm does not generate a solution of makespan at most (1 + 2ε) and at

most K cache, for all valid configurations of cores, then our algorithm decides that no solution

of makespan at most 1 exists.

Let T be a valid configuration of cores. For each core i ∈ 1, . . . , c, let qi be the maximal

rounded-down load of a demanding job assigned to core i according to the assignment pattern

of core i in T . Let αi be the number of demanding jobs of rounded-down load qi on core

i, according to T . We assume w.l.o.g that the cores are indexed such that qi ≥ qi+1. Let

Q = qi | i ∈ 1, . . . , c. For each q ∈ Q, let s(q) be the index of the first core i with qi = q

and let e(q) be the index of the last core i with qi = q. Assume that the cores s(q), . . . , e(q) are


indexed such that αs(q) ≥ . . . ≥ αe(q). Let J1(q) = j ∈ J1 | xj 6= 0, ε+ ujε2 = q, the set of all

demanding jobs in J1 whose rounded down load is q. Let Y (q) be the set of thee(q)∑i=s(q)

αi jobs of

smallest cache demands in J1(q).

Our algorithm builds an assignment matching T of minimal cache usage among all assign-

ments matching T . To do so, our algorithm goes over Q in a decreasing order and distributes the

jobs in Y (q) to the cores s(q), . . . , e(q) in this order of the cores such that core i ∈ [s(q), e(q)], in

turn, gets the αi most cache demanding jobs in Y (q) that are not yet assigned. After we assign

the demanding jobs with the maximal rounded-down load on each core, our algorithm arbi-

trarily chooses the identity of all other jobs in the configuration T . These are non-demanding

jobs and demanding jobs whose rounded-down load is not of the maximal rounded-down load

on their core. Each core is allocated cache according to the cache demand of the most cache

demanding job that is assigned to it.

The algorithm continues with the jobs in J2. It first assigns the demanding jobs in J2, in

the following greedy manner. Order these jobs from the most cache demanding to the least

cache demanding. For each core, we consider two load values: its actual load which is the sum

of the actual loads of jobs in J1 assigned to the core, and its rounded down load which is the

sum of rounded down loads of jobs in J1 assigned to the core. We order the cores such that

first we have all the cores that already had some cache allocated to them in the previous phase

of the algorithm, in an arbitrary order. Following these cores, we order the cores with no cache

allocated to them, from the least loaded core to the most loaded core, according to their rounded

down loads. These cores are either empty or have only non demanding jobs, from J1, assigned

to them. The algorithm assigns the jobs to the cores in these orders (of the jobs and of the

cores) and stops adding more jobs to a core and moves to the next one when the core’s actual

load exceeds 1 + ε. After all these jobs are assigned, the algorithm adjusts the cache allocation

of the cores whose most cache demanding job is now a job of J2.

Finally, it assigns the non-demanding jobs in J2. Each such job is assigned arbitrarily to a

core whose actual load does not already exceed 1 + ε.

Lemma 3.18. The number of valid configurations of cores is O(cO(1)).

Proof. We first consider the number of assignment patterns with rounded-down load at most

1. Since for each job j, aj ≤ 1, the size of UD and the size of UND are at most⌊1−εε2

⌋=

O( 1ε2

) = O(1). In an assignment pattern of load at most 1, there are at most 1ε jobs in J1


assigned to each core and thus we get that the number of possible assignment patterns is at

most O((1ε )( 1ε2

)) = O(1). Since the number of assignment patterns we consider is O(1), it follows

that the number of possible configurations of cores is O(cO(1)).

Since our algorithm spends a polynomial time per configuration of cores then Lemma 3.18

implies that our algorithm runs in polynomial-time.

Lemma 3.19. For any configuration of cores T there is an assignment matching T of minimal

cache usage among all assignments matching T , that for each q ∈ Q assigns thee(q)∑i=s(q)

αi least

cache demanding jobs in J1(q) (i.e. the set of jobs Y(q)) to the cores s(q), . . . , e(q).

Proof. Consider a job assignment S of minimal cache usage that matches T . Assume that for

some q ∈ Q assignment S does not assign all the jobs in Y (q) to the cores s(q), . . . , e(q). So

there is a core i ∈ [s(q), e(q)] that runs a job j ∈ J1(q) \ Y (q).

Since S assignse(q)∑i=s(q)

αi jobs from J1(q) to cores s(q), . . . , e(q) and since jobs in J1(q) cannot

be assigned to cores i′ > e(q), it follows that there is a core i′ < s(q) and a job j′ ∈ Y (q) such

S(j′) = i′. Suppose we switch the assignment of jobs j and j′ and run job j on core i′ and job

j′ on core i. Let S′ denote the resulting assignment. The cache required by core i′ does not

increase, as it runs demanding jobs of rounded down load greater than q and therefore of cache

demand greater than the cache demand of job j. By the choice of the jobs j and j′ we know

that xj′ ≤ xj and therefore the cache required by core i in S′ can only decrease compared to

the cache required by core i in S. It follows that the cache usage of S′ is at most that of S and

since S is of the minimal cache usage of all assignments that match T , we get that the cache

usage of S′ must be the same as of S.

By repeating this argument as long as there is a job that violates Lemma 3.19, we obtain

an assignment as required.

Lemma 3.20. For any configuration of cores T , Let S be an assignment matching T such

that for each q ∈ Q and for each core i ∈ [s(q), e(q)], if we index the jobs in Y (q) from the

most cache demanding to the least cache demanding, assignment S assigns to core i the jobs in

Y (q) of indicesi−1∑

j=s(q)

αj + 1, . . . ,i∑

j=s(q)

αj. Assignment S is of minimal cache usage, among all

assignments matching T .

Proof. Assume to the contrary that assignment S is not of minimal cache usage, among all

assignments matching T . Let S′ be an assignment whose existence is guaranteed by Lemma


3.19. Since S and S′ have different cache usages, there exists q ∈ Q such that S and S′ differ on

their assignment of the jobs in Y (q). We index the jobs in Y (q) from the most cacn demanding

to the least cache demanding. Let j ∈ Y (q) be the first job (most cache demanding) in Y (q)

such that S(j) 6= S′(j). We select S′ such that it maximizes j among all assignments satisfying

Lemma 3.19 that disagree with S on the assignment of the jobs in Y (q).

Denote i = S(j) and i′ = S′(j). Since S and S′ both assign αi jobs from Y (q) to core i and

since j is the first job in Y (q) on which S and S′ disagree, then there is a job j2 ∈ Y (q), j2 > j

such that S′(j2) = i.

We first assume that there is a job j1 < j such that S(j1) = i. Let S′′ be the assignment

such that S′′(j) = i, S′′(j2) = i′ and for any job h 6∈ j,j2, S′′(h) = S′(h). The cache required

by core i′ in S′′ is at most the cache required by core i′ in S′, since j < j2. Since j1 < j and

S(j1) = i, we know that S′(j1) = i and also S′′(j1) = i. This implies that in S′′, core i requires

the same amount of cache as in S′. It follows that S′′ is also an assignment of minimal cache

usage, and that it satisfies Lemma 3.19. Since S′′(j) = S(j), we get a contradiction to the way

we selected S′. Thus S is of minimal cache usage, among all assignments matching T .

We now assume that j is the first job in Y (q) such that S(j) = i. Let S′′ be the following

assignment. Any job that is assigned by S′ to a core different than i and i′ is assigned by S′′ to

the same core. For any job x such that S′(x) = i′, S′′(x) = i. All the αi′ least cache demanding

jobs assigned by S′ to core i are assigned by S′′ to core i′. Note that αi ≥ αi′ and therefore

assignment S′′ is well defined.

Since S and S′ agrees on the assignment of jobs j < j in Y (q) and assign them to cores l < i,

then job j is the most cache demanding job assigned to cores l ≥ i by S′ and S′′. Therefore

in assignment S′, core i′ requires xj cache and in assignment S′′ core i requires xj cache. In

assignment S′′, core i′ is assigned a set of jobs that is a subset of the jobs assigned to core i by

S′. Thus the cache required by core i′ in assignment S′′, is at most the cache required by core

i in assignment S′. It follows that S′′ is also an assignment of minimal cache usage, and that it

satisfies Lemma 3.19. This contradicts the choice of S′ and concludes the proof that assignment

S is of minimal cache usage, among all assignments matching T .

Corollary 3.21. For each configuration of cores T our algorithm builds an actual assignment

of minimal cache usage of the jobs in J1 that matches T .

Proof. The assignment returned by our algorithm is an assignment S, as in the statement of


Lemma 3.20.

Lemma 3.22. Consider an instance of the correlative single load minimal cache demand prob-

lem. If there is a cache partition and job assignment that schedules the jobs on c cores, uses at

most K cache and has a makespan of at most 1 then our algorithm finds a cache partition and

job assignment that schedules the jobs on c cores, uses at most K cache and has a makespan of

at most (1 + 2ε).

Proof. Let A be a solution of makespan at most 1 with c cores and K cache, whose existence

is assumed by the lemma. Let TA be the configuration of the cores corresponding to the

assignment of the jobs in J1 by solution A and assume our algorithm currently considers TA in

its enumeration.

We show that our algorithm succeeds in assigning all the jobs to c cores. Let’s assume to

the contrary that it fails. It can only fail if all cores are assigned an actual load of more than

(1 + ε) and there are still remaining jobs to assign. This indicates that the total volume to

assign is larger than c(1 + ε), which contradicts the fact that assignment A is able to assign the

jobs to c cores with makespan at most 1.

Let S denote the assignment of all jobs on c cores that out algorithm returns when it

considers TA. We know that S matches TA for jobs in J1. We now show that in S each core has

an actual load of at most 1+2ε. When we restrict S to J1 we know that the rounded down load

on each core is at most 1 and that each core has at most 1ε jobs from J1 assigned to it. Since

the actual load of any job in J1 is at most ε2 larger than its rounded down load, we get that if

we restrict assignment S to J1, the actual load on each core is at most 1 + ε2

ε = 1 + ε. The way

our algorithm assigns the jobs in J2 implies that the actual load of a core in assignment S can

only exceed 1 + ε by the load of a single job from J2. Therefore the actual load on any core in

assignment S is at most 1 + 2ε.

We show that assignment S uses at most K cache. Cache is allocated by our algorithm in

two steps: when it decides on the actual assignment of the jobs in J1 that matches TA and when

it assigns the demanding jobs in J2. Lemma 3.21 shows that S restricted to J1 is of minimal

cache usage of all assignments matching TA and thus uses at most the same amount of cache

as assignment A restricted to J1.

We show that when we also take into account the demanding jobs in J2, S uses at most the

same amount of cache as A. Assume the cores in S are indexed according to the order in which


our algorithm assigns demanding jobs from J2 to them. Assume the cores in A are indexed

such that core i in S and core i in A have the same assignment pattern. For any core in S, we

say that its free space is (1 + ε) minus the sum of the actual loads of all jobs in J1 assigned to

it by S. For any core in A, we say that its free space is 1 minus the sum of the actual loads of

all jobs in J1 assigned to it by A. For any i, core i in S has the same rounded down load as

core i in A and the actual load of core i in S is at most ε larger than the actual load of core i

in A. Therefore, by the definition of free space, the free space of core i in solution S is at least

the free space of core i in solution A.

Let i2 be the number of cores in S that have a demanding job from J1 assigned to them.

When our algorithm assigns jobs in J2 to a core i ≤ i2, it does not increase the cache required

by core i since any demanding job in J1 is at least as cache demanding as any demanding job

in J2. It follows that the total cache required by cores 1, . . . , i2 in S is at most the total cache

required by cores 1, . . . , i2 in A.

Let i > i2 be a core in S whose cache demand is determined by a job from J2. We now show

that core i in S requires no more cache than core i in A. This will conclude the proof that S

uses at most K cache.

The total load of demanding jobs in J2 that S assigns to cores 1, . . . , i − 1 is at least the

sum of the free space of these cores, since our algorithm exceeds an actual load of 1 + ε on each

core before moving the next. The sum of the free space of cores 1, . . . , i− 1 in S is at least the

sum of the free space of the cores 1, . . . , i−1 in A, which in turn is an upper bound on the total

load of demanding jobs from J2 that are assigned in A to cores 1, . . . , i−1. Since our algorithm

assigns the demanding jobs in J2 in a non-increasing order of their cache demand we get that

the cache demand of the most cache demanding job from J2 on core i in S is at most the cache

demand of the most cache demanding job in J2 on core i in A.

Lemma 3.22 shows that for any ε′ > 0, we have a polynomial time (1 + 2ε′)-approximate

decision algorithm. Given ε > 0, by applying our algorithm with ε′ = ε/2 we obtain a polynomial

time (1 + ε)-approximate decision algorithm.

By using a binary search similar to the one in Lemma 3.12 we obtain an (1+ε)-approximation

for the optimization problem, using our (1 + ε)-approximate decision algorithm. To conclude,

we have proven the following theorem.

Theorem 3.23. There is a polynomial time approximation scheme for the joint cache partition


and job assignment problem, when the jobs have a correlative single load and minimal cache

demand.

3.3 Step functions with a constant number of load types

Empirical studies [Dre07] suggest that the load of a job, as a function of available cache, is

often similar to a step-function. The load of the job drops at a few places where the cache size

exceeds the working-set required by some critical part. In between these critical cache sizes the

load of the job decreases negligibly with additional cache. The problems we consider in this

section are motivated by this observation.

Formally, each job j ∈ J is described by two load values lj < hj and a cache demand

xj ∈ 0, . . . ,K. If job j is running on a core with at least xj cache then it takes lj time and

otherwise it takes hj time. If a job is assigned to a core that meets its cache demand, xj , we

say that it is assigned as a small job. If it is assigned to a core that doesn’t meet its cache

demand we say that it is assigned as a large job. At first we study the case where the number

of different load types is constant and then we show a polynomial time scheduling algorithm for

the corresponding special case of the ordered unrelated machines scheduling problem.

Let L = lj | j ∈ J and H = hj | j ∈ J, the sets of small and large loads, respectively.

Here we assume that |L| and |H| are both bounded by a constant.

For each α ∈ L, β ∈ H, we say that job j is of small type α if lj = α and we say that job

j is of large type β if hj = β. If job j is of small type α and large type β we say that it is of

load type (α, β). Note that jobs j1, j2 of the same load type may have different cache demands

xj1 6= xj2 and thus the number of different job types is Ω(K) and not O(1).

We reduce this problem to the single load minimal cache demand problem studied in Section

3.2. For each load type (α, β), we enumerate on the number, x(α, β), of the jobs of load type

(α, β) that are assigned as large jobs. For each setting of the values x(α, β) for all load types,

we create an instance of the single load minimal cache demand problem in which each job

corresponds to a job in our original instance. For each job j which is one of the x(α, β) most

cache demanding jobs of load type (α, β) we create a job of load β and cache demand 0. For

each job j of load type (α, β) which is not one of the x(α, β) most cache demanding jobs of

this load type, we create a job of load α and cache demand xj . We solve each of the resulting

instances using any algorithm for the single load minimal cache demand problem presented in


Section 3.2, and choose the solution with the minimal makespan. We transform this solution

back to a solution of the original instance, by replacing each job with its corresponding job in

the original instance. Note that this does not affect the makespan or the cache usage.

Lemma 3.24. Given a polynomial time α-approximation algorithm for the single load minimal

cache demand problem that uses at most βK cache, the reduction described above gives a poly-

nomial time α-approximation algorithm for the problem where job loads are step functions with

a constant number of load types, that uses at most βK cache.

Proof. Consider an instance of the joint cache partition and job assignment problem with load

functions that are step functions with a constant number of load types. Assume there is a

solution A for this instance of makespan m that uses at most K cache. Let x(α, β) be the

number of jobs of load type (α, β) that are assigned in A as large jobs. W.l.o.g we can assume

that for each (α, β), the x(α, β) jobs that are assigned as large jobs are the x(α, β) most

cache demanding jobs of load type (α, β). The existence of A implies that when our algorithm

considers the same values for x(α, β), for each (α, β), it generates an instance of the single

load cache demand problem that has a solution of makespan at most m and at most K cache.

Applying the α-approximation algorithm for the single load minimal cache demand problem,

whose existence in assumed by the lemma, on this instance yields a solution of makespan at

most αm that uses at most βK cache. This solution is transformed to a solution of our original

instance without affecting the makespan or the cache usage.

Our algorithm runs in polynomial time since the size of the enumeration is O(n|L||H|).

Corollary 3.25. For instances in which the load functions are step functions with a constant

number of load types there are polynomial time approximation algorithms that approximate the

makespan up to a factor of 2, 32 and 4

3 and use at most K, 2K and 3K, respectively.

3.3.1 The corresponding special case of ordered unrelated machines

Recall that if we fix the cache partition in an instance of the joint cache partition and job

assignment problem then we obtain an instance of the ordered unrelated machines scheduling

problem. For the case where the load functions are step functions with a constant number of

load types, the resulting ordered unrelated machines instance can be solved in polynomial time

using the dynamic programming algorithm described below. The dynamic program follows a


structure similar to the one used in [EL11], where polynomial time approximation schemes are

obtained for several variants of scheduling with restricted processing sets.

In this special case of the ordered unrelated scheduling problem job j runs in time lj on

some prefix of the machines, and in time hj on the suffix (we assume that the machines are

ordered in non-increasing order of their strength/cache allocation). For simplicity, we assume

xj is given as the index of the first machine on which job j has load hj . If job j takes the same

amount of time to run regardless of cache, we assume xj = c + 1 and its load on any machine

is lj . As before, we assume that L = lj | j ∈ J and H = hj | j ∈ J are of constant size.

We design a polynomial time algorithm that finds a job assignment that minimizes the

makespan. The algorithm does a binary search for the optimal makespan, as in Section 3.2.4,

using an algorithm for the following decision problem: Is there an assignment of the jobs J to

the c machines with makespan at most M? By scaling the loads, we assume that M = 1.

For every machine m, we define Sm = j ∈ J | xj = m+1, the set of all jobs that are large

on machine m+ 1 and small on any machine i ≤ m. Let Sm(α, β) = j ∈ Sm | lj = α, hj = β

and bm(α, β) = |Sm(α, β)|. It is convenient to think of bm as a vector in 0, . . . , nL×H .

Let a ∈ 0, . . . , nL×H , δ ∈ 0, . . . , nH and m be any machine. Let J(m, a) be a set of

jobs which contains all the jobs in⋃mi=1 Si together with additional a(α, β) jobs of load type

(α, β) fromc⋃

i=m+1Si, for each load type (α, β). Let πm(a, δ) be 1 if we can schedule all the jobs

in J(m, a), except for δ(β) jobs of each large load type β, on the first m machines. Note that

since the additional jobs specified by a are small on all machines 1, . . . ,m, πm(a, δ) does not

depend on the additional jobs’ identity. Our original decision problem has a solution if and only

if πc(~0,~0) = 1.

Consider the decision problem π1(a, δ). We want to decide if it is possible to schedule the

jobs in J(1, a), except for δ(β) jobs of each large load type β, on machine 1. To decide this,

our algorithm chooses the δ(β) jobs of each large job type β that have the largest small loads

and removes them from J(1, a). If the sum of the small loads of the remaining jobs is at most

1, then π1(a, δ) = 1, and otherwise π1(a, δ) = 0.

To solve πm(a, δ) we enumerate, for each load type (α, β), on ξ(α, β), the number of jobs in

J(m, a) of this load type that are assigned as small jobs to machine m. Note that these jobs

are either in Sm(α, β) or in the additional set of a(α, β) jobs of type (α, β). For each β ∈ H, we

enumerate on the number λ(β) of jobs in J(m, a) of large load type β that are assigned as large

jobs to machine m. The following lemma is the basis for our dynamic programming scheme.


Its proof is straightforward.

Lemma 3.26. We can schedule all the jobs in J(m, a) except for δ(β) jobs of large load type

β (for each β ∈ H) on machines 1, . . . ,m with makespan at most 1 such that ξ(α, β) jobs of

load type (α, β) are assigned to machine m as small jobs and λ(β) jobs of large load type β are

assigned to machine m as large jobs if and only if the following conditions hold:

• For each (α, β) ∈ L ×H, ξ(α, β) ≤ a(α, β) + bm(α, β): The number of jobs of each load

type that we assign as small jobs to machine m is at most the number of jobs in J(m, a)

of this load type that are small on machine m.

•∑β∈H

λ(β)β +∑

(α,β)∈L×Hξ(α, β)α ≤ 1. The total load of the jobs assigned to machine m

is at most 1.

• Let a′ = a+ bm − ξ and δ′ = δ+ λ then πm−1(a′, δ′) = 1. The jobs in J(m− 1, a′), except

for δ′(β) jobs of large load β for each β ∈ H, can be scheduled on machines 1, . . . ,m− 1

with makespan at most 1.

The algorithm for solving πm(a, δ) sets πm(a, δ) = 1 if it finds λ and ξ such that the

conditions in Lemma 3.26 are met. If the conditions are not met for all λ and ξ then πm(a, δ) = 0.

Our dynamic program solves πm(a, δ) in increasing order of m from 1 to c and returns the

result of πc(~0,~0). The correctness of the dynamic program follows from Lemma 3.26 and from

the fact that for m = 1, our algorithm chooses the jobs that it does not assign to machine 1

such that the remaining load on machine 1 is minimized. Therefore we set π1(a, δ) = 1 if and

only if there is a solution of makespan at most 1.

By adding backtracking links, our algorithm can also construct a schedule with makespan at

most 1. We maintain links between each πm(a, δ) that is 1 to a corresponding πm−1(a′, δ′) that

is also 1, according to the last condition in Lemma 3.26. Tracing back the links from πc(~0,~0)

gives us an assignment with makespan at most 1 as follows. Consider a link between πm(a, δ)

and πm−1(a′, δ′). This defines λ = δ′ − δ and ξ = a + bm − a′. For each (α, β) we assign to

machine m, ξ(α, β) arbitrary jobs of load type (α, β) fromc⋃

i=mSi that we have not assigned

already, and we reserve λ(β) slots of load β on machine m to be populated with jobs later. Our

algorithm guarantees that the load on machine m is at most 1. When we reach π1(a, δ), for

some a and δ, in the backtracking phase, we have δ(β) slots of size β allocated on machines

2, . . . ,m. The δ(β) jobs of large load β with the largest small loads in J(1, a) are assigned to


these slots. Note that these jobs may be large on their machine and have a load of β or they

may be small and have a load smaller than β. In any case, the resulting assignment assigns all

the jobs in J and has a makespan of at most 1.

The number of problems πm(a, δ) that our dynamic program solves isO(cn|L||H|) = O(cnO(1)).

To solve each problem, we check the conditions in Lemma 3.26 for O(n|L||H|) possible λ’s and

ξ’s. This takes O(1) per λ and ξ since we already computed πm−1(a′, δ′) for every a′ and δ′.

Thus the total complexity of this algorithm is polynomial. This concludes the proof of the

following theorem.

Theorem 3.27. Our dynamic programming algorithm is a polynomial-time exact optimization

algorithm for the special case of the ordered unrelated machines scheduling problem, where each

job j has load lj on some prefix of the machines, and load hj ≥ lj on the corresponding suffix.

3.4 Joint dynamic cache partition and job scheduling

We consider a generalization of the joint cache partition and job assignment problem that allows

for dynamic cache partitions and dynamic job assignments. We define the generalized problem

as follows. As before, J denotes the set of jobs, there are c cores and a total cache of size K.

Each job j ∈ J is described by a non-increasing function Tj(x).

A dynamic cache partition p = p(t, i) indicates the amount of cache allocated to core i at

time unit t 2. For each time unit t,c∑i=1

p(t, i) ≤ K. A dynamic assignment S = S(t, i) indicates

for each core i and time unit t, the index of the job that runs on core i at time t. If no job

runs on core i at time t then S(t, i) = −1. If S(t, i) = j 6= −1 then for any other core i2 6= i,

S(t, i2) 6= j. Each job has to perform 1 work unit. If job j runs for α time units on a core with x

cache, then it completes αTj(x)

work. A partition and schedule p, S are valid if all jobs complete

their work. Formally, p, S are valid if for each job j,∑

<t,i>∈S−1(j)

1Tj(p(t,i))

= 1. The load of core

i is defined as the maximum t such that S(t, i) 6= −1. The makespan of p, S is defined as the

maximum load on any core. The goal is to find a valid dynamic cache partition and dynamic

job assignment with a minimal makespan.

It is easy to verify that dynamic cache partition and dynamic job assignment, as defined

above, generalize the static partition and static job assignment. The partition is static if for

every fixed core i, p(t, i) is constant with respect to t. The schedule is a static assignment if for

2To simplify the presentation we assume that time is discrete.


every job j, there are times t1 < t2 and a core i such that S−1(j) = < t, i >| t1 ≤ t ≤ t2.

We consider four variants of the joint cache partition and job assignment problem. The

static partition and static assignment variant studied so far, the variant in which the cache

partition is dynamic and the job assignment is static, the variant in which the job assignment

is dynamic and the cache partition is static and the variant in which both are dynamic.

Note that in the variant where the cache partition is dynamic but the job assignment is

static we still have to specify for each core, in which time units it runs each job that is assigned

to this core. That is, we have to specify a function S(t, i) for each core i. This is due to the fact

that different schedules that have the same set of jobs assigned to a particular core, when the

cache partition is dynamic, may have different loads, since jobs may run with different cache

allocations. When the cache partition is also static, the different schedules that have the same

set of jobs on a particular core have the same load, and it suffices to specify which jobs are

assigned to which core.

We study the makespan improvement that can be gained by allowing a dynamic solution. We

show that allowing a dynamic partition and a dynamic assignment can improve the makespan

by a factor of at most c, the number of cores. We also show an instance where by using a

dynamic partition and a static assignment we achieve an improvement factor arbitrarily close

to c. We show that allowing a dynamic assignment of the jobs, while keeping the cache partition

static, improves the makespan by at most a factor of 2, and that there is an instance where an

improvement of 2− 2c is achieved, for c ≥ 2.

Given an instance of the joint cache partition and job assignment problem, we denote by OSS

the optimal static cache partition and static job assignment, by ODS the optimal dynamic cache

partition and static job assignment, by OSD the optimal static cache partition and dynamic job

schedule and by ODD the optimal dynamic cache partition and dynamic job schedule. For any

solution A we denote its makespan by M(A).

Lemma 3.28. For any instance of the joint cache partition and job assignment problem,

M(OSS) ≤ cM(ODD).

Proof. Let A be the trivial static partition and schedule, that assigns all jobs to the first core

and allocates all the cache to this core. Let’s consider any job j that takes a total of α time

to run in the solution ODD. Whenever a fraction of job j runs on some core with some cache

partition, it has at most K cache available to it. Therefore, in solution A, when we run job j


continuously on one core with K cache, it take at most α time. Since the total running time of

all the jobs in solution ODD is at most cM(ODD), we get M(OSS) ≤M(A) ≤ cM(ODD).

Corollary 3.29. For any instance of the joint cache partition and job assignment problem,

M(OSS) ≤ cM(ODS).

Proof. Clearly, M(ODS) ≥ M(ODD) for any instance. Combine this with Lemma 3.28 and we

get that M(OSS) ≤ cM(ODS)

Lemma 3.30. For any ε > 0 there is an instance of the joint cache partition and job assignment

problem, such that M(OSS) > (c− ε)M(ODS).

Proof. Let b be an arbitrary constant. Let’s consider the following instance with two types of

jobs. There are c jobs of type 1, such that for each such job j, Tj(x) = ∞, for x < K and

Tj(K) = 1. There are bc jobs of type 2, such that for each such job j, Tj(x) = bc if x < Kc and

Tj(x) = 1 if x ≥ Kc .

Consider the following solution. The static job assignment runs b jobs of type 2 on each

core. After b time units, it runs the c jobs of type 1 on core 1. The dynamic cache partition

starts with each core getting Kc cache. The cache partition changes after b time units and core

1 gets all the cache. This solution has a makespan of b+ c and therefore M(ODS) ≤ b+ c.

There is an optimal static cache partition and static job assignment that allocates to each

core 0, Kc or K cache, because otherwise we can reduce the amount of cache allocated to a core

without changing the makespan of the solution. This implies that there are only two static cache

partitions that may be used by the optimal static solution: the partition in which p(i) = Kc for

each core i, and the partition that gives all the cache to a single core. It is easy to see that if

we use the cache partition where p(i) = Kc we get a solution with an infinite makespan because

of the jobs of type 1. Therefore this optimal static solution uses a cache partition that gives

all the cache to a single core. Given this partition, the optimal job assignment is to run all the

c jobs of type 1 on the core with all the cache, and assign to that core additional bc − (c − 1)

jobs of type 2. So the load on that core is bc + 1. Each of the c − 1 cores with no cache is

assigned exactly one job of type 2, and each such core has a load of bc. Therefore the ratio

M(OSS)M(ODS)

≥ bc+1b+c . The lower bound on this ratio approaches c as b approaches infinity. Since b is

an arbitrarily chosen constant, we can choose it large enough such that we get a lower bound

that is greater than c− ε, for any ε > 0.


Corollary 3.31. For any ε > 0 there is an instance of the joint cache partition and job assign-

ment problem, such that M(OSS) > (c− ε)M(ODD).

Proof. Consider the same instance as in the proof of Lemma 3.30. For that instance, M(OSS) >

(c − ε)M(ODS). It follows that M(OSS) > (c − ε)M(ODD) for the instance in Lemma 3.30 ,

since M(ODS) ≥M(ODD).

Lemma 3.32. For any instance of the joint cache partition and job assignment problem,

M(OSS) ≤ 2M(OSD).

Proof. Consider any instance of the joint cache partition and job assignment problem and let

OSD = (p, S). Let xij be the fraction of job j’s work unit that is carried out by core i. Formally,

xij = |t|(t,i)∈S−1(j)|Tj(p(i))

. Let’s consider the instance of scheduling on unrelated machines where

job j runs on core i in time Tj(p(i)). Since for every job j,c∑i=1

xij = 1 then xij is a fractional

assignment for that instance of the unrelated machines scheduling problem. The makespan of

this fractional solution is M(OSD). Let y be the optimal fractional assignment of the defined

instance of unrelated machines. We know that if we apply Lenstra’s algorithm [LST90] to this

instance, we get an integral assignment, denoted by z, such that the makespan of z is at most

twice the makespan of y and therefore at most twice the makespan of x. Assignment z is a

static job assignment and therefore (p, z) is a solution to the joint static cache partition and

static job assignment problem of our original instance, with makespan at most twice M(OSD).

It follows that M(OSS) ≤ 2M(OSD).

Lemma 3.33. For c ≥ 2, there is an instance of the joint cache partition and job assignment

problem such that M(OSS)M(OSD) = 2− 2

c .

Proof. Consider the following instance. There are c jobs, where each takes 1− 1c time regardless

of the cache allocation, and one job that takes 1 time unit, regardless of cache. The optimal

static schedule for this instance assigns two jobs of size 1 − 1c to the first core, assigns one job

of size 1 − 1c to each of the cores 2, . . . , c − 1, and assigns the unit sized job to the last core.

This yields a makespan of 2− 2c . The optimal dynamic assignment assigns one job of size 1− 1

c

fully to each core, and then splits the unit job equally among the cores, to yield a makespan of

1. Notice that this can be scheduled such that the unit job never runs simultaneously on more

than one core. This is achieved by running the ith fraction, of size 1c , of the unit job on core i at

time i−1c . The other jobs, that are fully assigned to a single core, are paused and resumed later,


if necessary, to accommodate the fractions of the unit sized job. Therefore in this instance the

ratio M(OSS)M(OSD) is exactly 2− 2

c .

Chapter 4

Static partitions under bijective

analysis

We consider a model where c cores share a cache of size K. Each core has a single job to perform

that is defined by a sequence of page requests. We assume that the sequences of all cores are

of the same length. Core i requests pages from a working set Wi of possible pages. We assume

that for any two cores i and j, Wi and Wj are disjoint. The cache is statically partitioned

between the cores according to a cache partition p. Core i has p(i) cache allocated to it which

is populated by a subset of Wi of size p(i). Core i serves its page requests according to their

order in the core’s sequence. For each page request by core i, if the requested page is currently

in the cache then the core i has a cache hit and otherwise it has a cache miss. When core i has

a cache miss on page x, it can replace one of its currently cached pages by x. We denote by E

an eviction policy that decides for each core and each cache miss, which cached page, if any, to

replace. We study this problem in an online setting where the request sequences are not known

in advance. An algorithm for the problem is a pair (p,E) where p is a static cache partition and

E is an eviction policy. The goal is to find an algorithm that minimizes the maximum number

of cache misses for any core.

Bijective analysis [ADLO07] is a technique to directly compare online algorithms. Bijective

analysis defines that online algorithm A is at least as good as online algorithm B if there is a

bijection π that maps any input s to an input π(s) of the same length, such that for any input

s, A(π(s)) ≤ B(s).

Theorem 4.3, which gives the main result of this section, show that if |Wi| = |Wj | for any i, j

45

CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 46

and that if the set of possible inputs of length n isc∏i=1

Wni , the set of all possible combinations of c

sequences of length n such that the pages in sequence i are in Wi, then any online algorithm that

partitions the cache equally among the cores is as least as good as any other online algorithm,

regardless of the eviction policies.

Bijective analysis was first introduced in [ADLO07] in the context of online algorithms for

the single-core paging problem. Bijective analysis can differentiate between different algorithms

that may appear to be equal under competitive analysis [ST85]. Angelopoulos and Schwitzer

[AS09] used bijective analysis to prove the optimality of Least-Recently-Used as a page eviction

policy for single core caching, assuming locality of reference. In contrast, the competitive ratio

of LRU is equivalent to a wider class of marking algorithms [KMRS88], some of which have

a poor performance in practice. This optimality result is strongly suggested by experimental

work on caching but is not captured by competitive analysis.

Definition 4.1. Let A and B be two online algorithms for a minimization problem and let In

denote all possible inputs of size n. Algorithm A is at least as good as algorithm B, on inputs of

size n, if there is a bijection π : In → In such that for any s ∈ In, A(π(s)) ≤ B(s). We denote

this by A n B.

Directly showing the bijection, as required by Definition 4.1, can be difficult in various

scenarios. Theorem 4.2 provides a technique to show the existence of such bijections using

stochastic dominance. Stochastic dominance was previously used to measure performance of

online algorithms in [HV08] and applied to the online bin coloring problem. The following result

shows the equivalence of bijective analysis and stochastic dominance analysis.

Theorem 4.2. Let A and B be two algorithms for a minimization problem. Let In be the set

of all inputs of size n. Then A n B if and only if for any x, Pr(A(s) ≤ x) ≥ Pr(B(s) ≤ x)

assuming s is uniformly sampled from In.

Proof. Assume A n B. There is a bijection π such that for any s ∈ In, A(π(s)) ≤ B(s). For

any algorithm Alg, let IAlg(x) = s ∈ In | Alg(s) ≤ x.

For any s ∈ IB(x), A(π(s)) ≤ B(s) ≤ x and therefore π(s) ∈ IA(x). This shows that

π(IB(x)) ⊆ IA(x) and therefore |IA(x)| ≥ |π(IB(x))| = |IB(x)|. Assuming s is uniformly

sampled from In we get Pr(A(s) ≤ x) = |IA(x)||In| ≥

|IB(x)||In| = Pr(B(s) ≤ x).

For the other direction, assume that for every x, Pr(A(s) ≤ x) ≥ Pr(B(s) ≤ x), assuming

s is uniformly sampled from In. This implies that |IA(x)| ≥ |IB(x)|, for every x.


Let c1 < c2 < . . . < cb be the possible cost values returned by algorithm B for inputs

in In. We build a bijection π from In to In as follows. We start by selecting an arbitrary

injection π1 from IB(c1) to IA(c1). Let Q1 = IA(c1) \ π1(IB(c1)) be the set of unmatched

elements in IA(c1). In the ith step, 2 ≤ i ≤ b, we select an arbitrary injection πi of the

elements in IB(ci) \ IB(ci−1) to elements in Qi−1 ∪ (IA(ci) \ IA(ci−1)). We then set Qi =

(IA(ci) \ (π1(IB(c1))) \i⋃

j=1πj(IB(cj) \ IB(cj−1)), the set of the unmatched elements in IA(ci).

Let π be the bijection from In to In such that π(x) = πi(x) for any x such that B(x) = ci.

It is easy to see that this is a well defined bijection, if we can find injections πi for any i. For

any i, we show that the size of the domain of πi is at most the size of the range allowed by this

algorithm. For i = 1, we know that |IB(c1)| ≤ |IA(c1)| because of the stochastic dominance

assumption. For i ≥ 2, the size of the domain of πi is |IB(ci) \ IB(ci−1)| = |IB(ci)|−|IB(ci−1)| ≤

|IA(ci)|− |IA(ci−1)|+ |IA(ci−1)|− |IB(ci−1)| = |IA(ci) \ IA(ci−1)|+ |Qi−1| which is the size of the

allowed range. Therefore this algorithm is able to injectively match, in each step i, the elements

in IB(ci) \ IB(ci−1) to elements in Qi−1 ∪ (IA(ci) \ IA(ci−1)).

Let s ∈ In and let i be such that B(s) = ci. By π’s construction, we know that π(s) ∈ IA(ci)

and therefore A(π(s)) ≤ ci = B(s). The existence of π implies that A n B.

We now prove our main result, under the assumption that for any core i, |Wi| = mK.

Theorem 4.3. Let p′ be the cache partition that allocates cache equally to the cores. Let p be

any other cache partition. Let E and E′ be any two page eviction policies. Then for any n,

(p′, E′) n (p,E).

Proof. We assume w.l.o.g that for each i, p(i) is the fraction of the cache of size K allocated to

core i and thereforec∑i=1

p(i) = 1.

Let In be the set of all c request sequences, each of length n, and assume we uniformly

sample an input from In. This implies that each page request, for each core, is uniformly and

independently selected from Wi. Thus for any cache partition p and any eviction policy E, the

probability of any page request of core i to be a cache hit is p(i)m . Therefore the number of cache

misses of core i is a binomial random variable, Xi = Binom(1− pim , n), with success probability

1 − pim and n trials. Let X = maxiXi. For any cache partition p and for any 0 ≤ b ≤ n, let

Fp(b) = Pr(X(s) ≤ b|p) denote the cumulative distribution function of X at point b.

For b = n, Fp(n) = 1 for any p, and in particular Fp(b) is maximized by p = p′. For b = 0,

Fp(0) =c∏i=1

(p(i)m

)n= 1

mnc

(c∏i=1

p(i)

)mand it is easy to see that Fp(0) is maximized by p = p′.


The set of all p’s such thatc∑i=1

p(i) = 1 is closed and bounded and Fp(b) is a continuous function

of p, and thus the maximum of Fp(b) is attained by some cache partition. We show that for any

b ∈ 1, . . . , n − 1, any partition p 6= p′ does not maximize Fp(b). It then follows that Fp(b) is

maximized by p = p′. By Theorem 4.2, this implies the optimality of any algorithm that uses

cache partition p′, regardless of the eviction policy.

Let p 6= p′ be a cache partition. We assume that for every i, p(i) 6= 0 as otherwise X = n

and any cache partition with non-zero cache allocations is better than p.

We can represent the cumulative distribution function of Xi for any i as a regularized

incomplete beta function.

Pr(Xi ≤ b | p(i)) = (n− b)(n

b

) p(i)m∫0

tn−b−1(1− t)bdt

And therefore:

Fp(b) = Pr(X ≤ b|p) =c∏l=1

Pr(xl ≤ b|p(l)) =

(n

b

)c(n− b)c

c∏l=1

p(l)m∫0

tn−b−1(1− t)bdt (4.1)

Since p 6= p′ there are i, j such that p(i) < p(j). To simplify notation we denote α =

p(i) + p(j), q = p(i) and p(j) = α − q. We know that q < α2 . We are interested in how Fp(b)

changes if we increase q, and keep the cache allocated to any core other than i, j fixed. There

are two terms in the product in Equation 4.1 that depend on q, the term for core i and the term

for core j and thus if we denote C =(nb

)c(n−b)c

∏l 6∈i,j

p(l)m∫0

tn−b−1(1− t)bdt, which is independent

of q, we get that:

∂F

∂q= C

[1

m

( qm

)n−b−1 (1− q

m

)b α−qm∫0


− 1

m

(α− qm

)n−b−1(1− (α− q)

m

)b qm∫0


]

We show that this derivative is positive, which implies that we can increase Fp(b) by in-

creasing q. Since b < n, C is a positive constant, and clearly 1m is a positive. By ignoring both


constants and substituting x = qtα−q in the first integral we get:

( qm

)n−b−1 (1− q

m

)b qm∫0

(x(α− q)

q

)n−b−1(1− x(α− q)

q

)b (α− q)q

dx

−(α− qm

)n−b−1(1− (α− q)

m

)b qm∫0


Rearranging we get:

(α− q)q

(α− qm

)n−b−1 (1− q

m

)b qm∫0

xn−b−1(

1− x(α− q)q

)bdx

−(α− qm

)n−b−1(1− (α− q)

m

)b qm∫0


Dividing both operands by the positive value(α−qm

)n−b−1, we get that it suffices to prove

that:

(α− q)q

(1− q

m

)b qm∫0

xn−b−1(

1− x(α− q)q

)bdx >

(1− (α− q)

m

)b qm∫0

tn−b−1(1− t)bdt (4.2)

Since the integration boundaries are the same for both sides of the inequality, it suffices to

prove

(α− q)q

(1− q

m

)bxn−b−1

(1− x(α− q)

q

)b≥(

1− (α− q)m

)bxn−b−1(1− x)b

for any x ∈ [0, qm ], with strict inequality for x ∈ (0, qm). Since (α−q)q > 1, it suffices to prove

(1− q

m

)bxn−b−1

(1− x(α− q)

q

)b≥(

1− (α− q)m

)bxn−b−1(1− x)b (4.3)

If x = 0 we clearly get an equality. Dividing the inequality by xn−b−1 and taking the b-th

root from both sides, we get that is sufficient to show that:

(1− q

m

)(1− x(α− q)

q

)≥(

1− (α− q)m

)(1− x) (4.4)


Which is equivalent to:

x

((α− 2q)

q

)≤ (α− 2q)

m

This last inequality strictly holds for any x < qm and there is an equality for x = q

m .

Therefore, any partition p 6= p′ does not maximize Fp(b) and this concludes the proof that

the cache partition p′ maximizes Fp(b) for any b and thus proves the theorem.

A natural extension of this result is to consider cores of arbitrary working set sizes. That

is, the case where |Wi| and |Wj | may be be different for different cores i 6= j. An intuitive

conjecture is that the partition p(i) = |Wi|c∑j=1|Wj |

is the optimal cache partition under bijective

analysis.

This conjecture turns out to be false. Fixing the problem’s parameters n,K and the |Wi|’s,

we can directly compute Pr(X ≤ b | p) for any b and p by using the representation of the

cumulative distribution function of each binomial random variable as a regularized incomplete

beta function. By computationally going over all 0 ≤ b ≤ n and all cache partitions, we found

that for different b’s, Pr(X ≤ b | p) is maximized by a different partition pb. This means that no

static cache partition stochastically dominates all other partitions, and thus there is no optimal

static partition under bijective analysis when cores have working sets of different sizes.

For example, consider the case of c = 3, K = 30, |W1| = 40, |W2| = 60, |W3| = 80 and the

length of the request sequences n = 100. Figure 4.1 shows that no single partition maximizes

the cumulative distribution function for all values of b.

Looking at several similar figures for different parameters gives rise to the following conjec-

ture.

Conjecture 4.4. Let pb be the cache partition that maximizes Pr(X ≤ b | pb). Then for any i,

pb(i) is between 1c and |Wi|

c∑j=1|Wj |

.

Notice that this conjecture is a generalization of Theorem 4.3. The proof of Theorem 4.3

is based on showing that for any partition that does not allocate the cache equally among the

cores, there is a pair p(i) and p(j) such that by changing them while maintaining their sum

fixed we can increase Pr(X ≤ b | p) for any b.

One may try to prove Conjecture 4.4 by a similar approach. That is, by showing that we

can improve a partition by locally changing p(i) and p(j) for some two cores i, j such that at

least one of them violates the condition in the conjecture. Unfortunately, we can show that this


Figure 4.1: No partition maximizes the cumulative distribution function for any b. Consider c = 3,K = 30, |W1| = 40, |W2| = 60, |W3| = 80 and the length of the request sequences n = 100. For eachvalue of b, the three line indicate the cache partition that maximizes Pr(X ≤ b | p). The red line is p(3),the green line is p(2) and the blue line is p(1).

approach fails. Assume for example, that K = 30, c = 3, n = 100 and |W1| = 50, |W2| = 40

and |W3| = 10. Let p be the partition p(1) = 0.42, p(2) = 0.41 and p(3) = 0.17. It is easy to see

that p does not meet the criteria of Conjecture 4.4. Specifically, p(2) is too large and it is the

only cache allocation in p that is out of the range specified by Conjecture 4.4. If we consider

the pair p(2) and p(3), fix their sum and consider the derivative of Pr(X ≤ b | p) as a function

of p(2) we get that for b ≥ 85 the derivative is positive. Similarly, if we consider the pair p(2)

and p(1), we get that the derivative of Pr(X ≤ b | p) as a function of p(2) is positive for b ≤ 52.

This leads to two possible future research directions for proving Conjecture 4.4:

1. Assuming partition p does not meet the criteria of the conjecture, can we show that for any

b there is a pair p(i(b)) and p(j(b)) such that while maintaining p(i(b))+p(j(b)) we can increase

Pr(X ≤ b | p) and reach a partition where both p(i(b)) and p(j(b)) are closer to the required

ranges.

2. Can we find a more global way to modify the partition p such that in some metric it gets

closer to satisfy Conjecture 4.4 and Pr(X ≤ b | p) increases, for any b.

Chapter 5

Cache partitions in the speed-aware

model

As in the previous section, we assume that each core has a single job to perform. However,

rather then specifying the job by a request sequence, we specify the job by a speed function

vi(a) that indicates the speed in which core i progresses through its job if it is allocated a cache

pages. We assume that for any core i, vi(a) is a non-decreasing function of a. Our goal is the

find the cache partition for which the speed of the slowest core is maximized.

To motivate the assumption that the cores’ speed functions are known, we consider the

following two scenarios:

• Consider an offline multi-core caching problem where each core has a sequence of page

requests to serve. Assume that a cache hit takes 1 time unit and a cache miss takes τ > 1

time units. Assume also that each core statically populates its allocated cache of size a

with the optimal subset of pages, that is the set of the a most frequent pages in the core’s

sequence. Let fi(a) denote the sum of the frequencies of these a most frequent pages for

core i. Then the speed vi(a) = 1τ (1−fi(a))+fi(a). Note that given the requests sequences

we can compute vi(a) for any i and a.

• Speeds are also available in a probabilistic model, where the pages requested by each

core are drawn independently from a given distribution on its working set. Let qi(x) be

the probability of core i requesting page x. Assume, w.l.o.g, that the pages of core i

are indexed such qi(x) is a non-increasing function of the page x. It is clear that if core

i has a cache pages allocated to it, the optimal pages to place in the cache are the a

52

CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 53

pages with the highest probabilities according to qi, even if it can dynamically change the

contents of its cache. If core i is allocated a pages in the cache then its expected speed is:

vi(a) = 1τ

wi∑x=a+1

qi(x) +a∑

x=1qi(x).

5.1 Finding the optimal static partition

The following greedy algorithm finds the optimal static partition.

Algorithm 1 Greedy Static Partition

for all i = 1, . . . , c do ai = 0 . start cores with no cache at allend forfor i = 1, . . . ,K do

j = arg minj vj(aj) . find the currently slowest coreaj = aj + 1 . give it an additional cache page

end for

Lemma 5.1. Algorithm 1 generates the optimal static cache partition.

Proof. Let p be the cache partition generated by Algorithm 1. Assume to the contrary that

there is a partition ψ such that the slowest core under partition ψ is faster than the slowest core

under partition p. Let i be the index of the slowest core under partition p. By the selection

of ψ we know that vi(p(i)) < vi(ψ(i)). Since vi is non-decreasing this implies that ψ(i) > p(i).

Since both partitions allocate a total of K cache we know that there is another core j such that

ψ(j) < p(j). Let’s consider the iteration in which Algorithm 1 gives core j its (ψ(j) + 1) cache

page. Core j’s speed at the beginning of that iteration is vj(ψ(j)) and it is at most the speed

of core i, at that iteration. Since in Algorithm 1 the speeds of the cores only increase as more

cache pages are allocated, it follows that vi(p(i)) ≥ vj(ψ(j)). This contradicts the selection of

ψ.

5.2 Variable cache partitioning

In this section we consider a variable cache partition. Let β = β(K, c) =(K−1c−1)

denote the

number of possible cache partitions. Let p1, . . . pβ be all possible cache partitions. A variable

cache partition is a distribution x1, . . . , xβ over p1, . . . pβ. We are given the core’s speed functions

v1, . . . , vc and our goal is the find a variable cache partition that maximizes the expected speed

of the slowest core. We assume that for any core, allocating more cache has a marginally

decreasing benefit. Formally, we assume for any i, vi is non-decreasing and concave. Note


that in both scenarios described in the beginning of Chapter 5, the vi’s are non-decreasing and

concave.

The following example shows that the expected speed of the slowest core in a variable

cache partition can in fact be better than the speed of the slowest core in the optimal static

cache partition. We consider a probabilistic model which, as we previously established, is a

special case of the speed-aware model. Assume there are 2 cores and a cache of size 3. Each

core accesses one of two pages, each with probability 12 . In this case the speed functions are

v1(1) = v2(1) = 12 + 1

2τ , v1(2) = v2(2) = v1(3) = v2(3) = 1. The optimal static partition gives 2

pages to one core and gives 1 page to the other core. The speed of the slowest core under this

partition is 12 + 1

2τ . Let j denote the index of the above mentioned static partition and let j′

denote index of the symmetric cache partition (switch the core that gets 2 pages with the one

that gets one page) then the optimal variable cache partition is xj = xj′ = 12 and xl = 0 for any

l 6∈ j, j′. Under this variable cache partition, both cores have an expected speed of 14τ + 3

4

which is better than the speed of the slowest core in the optimal static partition.

Let pj denote the jth cache partition. Each cache partition pj corresponds to a vector

of cores’ speeds v1(pj(1)), . . . , vc(pj(c)). To simplify notation, we denote vi,j = vi(pj(i)), and

the vector of speeds is v1,j , . . . , vc,j . The optimal variable cache partition is a solution of the

following linear program:

Maximize λ subject to

∀1 ≤ i ≤ c,β∑j=1

vi,jxj ≥ λ

β∑j=1

xj = 1

and λ ≥ 0,∀1 ≤ j ≤ β, xj ≥ 0

The first c constraints ensure that the expected speed of each core is at least λ and the last

constraint ensures that the variable cache partition is a distribution. The number of variables in

this linear program is the number of cache partitions β = Θ(Kc), and the number of constraints

is c+ 1.


Consider the dual linear program:

Minimize z subject to

∀1 ≤ j ≤ βc∑i=1

vi,jyi ≤ zc∑i=1

yi = 1

and z ≥ 0, ∀1 ≤ i ≤ c, yi ≥ 0

In the dual linear program the variable yi represents the weight of core i. We define the

weighted speed of cache partition pj , given core weights y1, . . . , yc, asc∑i=1

vi,jyi, which is the

weighted average of the speeds of the cores under partition pj . A solution of the dual program

are weights y1, . . . , yc that minimize the weighted speed of the fastest cache partition.

This linear program has c+1 variables and an exponential number of inequalities. In [GLS81]

Grotschel et al proposed a technique to optimize a linear program with an exponential number

of inequalities in polynomial time, independent of the number of inequalities. The technique is

based on Khachiyan’s ellipsoid method [Kha79] that does not require the explicit inequalities to

be available but instead to have a separation oracle for the solution set of the linear program. A

thorough review of linear programming, including the ellipsoid and separation oracle techniques,

can be found in [Sch98].

Definition 5.2. A separation oracle for a linear program of n variables is an algorithm that

given x ∈ Rn decides if x is a feasible solution of the linear program or finds a constraint that

is violated by x.

The following theorem is proved by [GLS81]:

Theorem 5.3. Let Q be a solution set of linear inequalities Q = x | Ax ≤ b ⊂ Rn and d ∈ Rn

the optimization direction, i.e. maximizing dTx for x ∈ K. Let T be the maximal bit-encoding

length of any value in A and b. Let S be the maximal bit-encoding length of any value in d.

Then:

Given a separation oracle for Q that for every x ∈ Rn runs in time ψ, which is polynomial in n

and T , and returns a violated constraint of encoding length polynomial in n and T , the ellipsoid

method can optimize the linear program in time polynomial in ψ, n, T and S.

A separation oracle for our dual linear program is an algorithm that given a point in Rc+1,

which is composed of y ∈ Rc and z ∈ R, determines if any inequality is violated by y, z and if


so returns one such violated inequality. A violated inequality corresponds to a cache partition

whose weighted speed with respect to y is greater than z. Our separation oracle checks the last

constraint explicitly and if it holds it then computes the fastest cache partition. If the weighted

speed of this partition is at most z then y, z is a feasible solution to the dual linear program.

Otherwise, the fastest cache partition defines a violated inequality that our separation oracle

returns.

Our separation oracle computes the fastest cache partition as follows. Let ∆vi(x + 1) =

vi(x+ 1)− vi(x) be the speed increase for core i if it is given an additional cache page on top of

the x pages already allocated to it. We start with all the cores having no cache, that is p(i) = 0,

for any i ∈ 1, . . . , c. We find core i = arg max1≤i≤c yi∆vi(p(i) + 1) and give it another cache

page, p(i) := p(i) + 1. We repeat this step K times.

After all the cache pages are allocated to the cores, we check if the weighted speed of

the resulting cache partition,c∑i=1

yivi(p(i)), is greater than z and if so, we return the inequality

corresponding to this partition as a violated inequality. Otherwise, our separation oracle decides

that the given point is a feasible solution of the dual linear program.

Theorem 5.4. The cache partition generated by the above process is the fastest cache partition,

with respect to core weights y.

Proof. Let p be the cache partition generated by the above process. Assume to the contrary

that it is not the fastest with respect to y. Let ψ be a cache partition of the fastest weighted

speed with respect to y. We further select ψ among all partitions of the fastest weighted speed

as one that minimizesc∑i=1| ψ(i)− p(i) |, the sum of absolute distances from p.

Let i be the first index such that for any i′ < i, ψ(i′) = p(i′) and ψ(i) 6= p(i). We assume

w.l.o.g that p(i) < ψ(i) (otherwise we can rename the cores). Since both partitions allocate a

total of K pages, there is a core j > i such that p(j) > ψ(j).

Consider the iteration in which our algorithm decides to give core j its (ψ(j) + 1) page. Let

φ be the cache partition our algorithm has before this iteration. We know that φ(j) = ψ(j) and

p(i) ≥ φ(i). Since vi is a concave non-decreasing function we get that ∆vi is a non-increasing

function. Thus ∆vi(φ(i)) ≥ ∆vi(p(i)) ≥ ∆vi(ψ(i)). Since our algorithm decides in this iteration

to give the next cache page to core j we know that yj∆vj(ψ(j) + 1) ≥ yi∆vi(φ(i) + 1) ≥

yi∆vi(p(i) + 1) ≥ yi∆vi(ψ(i)). This inequality implies that if we take partition ψ and move one

page from core i to core j, it will not decrease the weighted speed of the partition, as moving


this page changes the weighted speed by yj∆vj(ψ(j) + 1) − yi∆vi(ψ(i)) ≥ 0. If this difference

is positive, it means that we have generated a faster partition which contradicts the choice of

ψ. Therefore we obtained a new partition whose weighted speed is same as ψ’s. Notice that

the sum of absolute distances between this partition and p is smaller then the sum of absolute

distances between ψ and p. This contradicts the selection of ψ and thus concludes the proof

that p is of the fastest weighted speed with respect to y.

Our separation oracle builds the fastest cache partition by sequentially allocating the K

cache pages. It allocates each page by comparing the speed improvement gained by each of

the c alternatives, which involves arithmetic operations on the core speeds and the given query

point y ∈ Rc. Thus the separation oracle’s time complexity involves O(Kc) steps, each taking a

time polynomial in the bit-encoding length of the given query point and the bit-encoding length

of the core speeds. The violated constraint that is returned by this separation oracle is always

a speeds vector of some cache partition, and thus its bit-encoding length is polynomial in c and

the encoding length of the core speeds.

Going back to theorem 5.3 we can see that we have a linear program in Rc with upper

bound T on the encoding length of individual elements in the constraints and an optimization

direction of constant encoding length. Our separation oracle runs in polynomial time in K, c

and in the speeds encoding length and returns violated constraints whose encoding length is

polynomial in c and the encoding length of the core speeds. Thus theorem 5.3 provides us with

an algorithm to solve this dual linear program, and the variable cache partition problem, in

time polynomial in K, c and the speeds encoding length.

Bibliography

[ADLO07] S. Angelopoulos, R. Dorrigiv, and A. Lopez-Ortiz. On the separation and equiva-

lence of paging strategies. In SODA, pages 229–237, 2007.

[AS09] S. Angelopoulos and P. Schweitzer. Paging and list update under bijective analysis.

In SODA, pages 1136–1145, 2009.

[Bel66] L. A. Belady. A study of replacement algorithms for a virtual-storage computer.

IBM Syst. J., 5(2):78–101, June 1966.

[BGV00] R. D. Barve, E. F. Grove, and J. S. Vitter. Application-controlled paging for a

shared cache. SIAM J. Comput., 29(4):1290–1303, February 2000.

[BW12] Vincenzo Bonifaci and Andreas Wiese. Scheduling unrelated machines of few dif-

ferent types. CoRR, abs/1205.0974, 2012.

[CH73] V. Chvatal and P. L. Hammer. Set-packing problems and threshold graphs. Tech-

nical Report CORR 73-21, Dep. of Combinatorics and Optimization, Waterloo,

Ontario, 1973.

[CPIM05] A. M. Campoy, I. Puaut, A. P. Ivars, and J. V. B. Mataix. Cache contents selection

for statically-locked instruction caches: An algorithm comparison. In Proceedings

of the 17th Euromicro Conference on Real-Time Systems, ECRTS ’05, pages 49–56,

Washington, DC, USA, 2005. IEEE Computer Society.

[Dre07] U. Drepper. What every programmer should know about memory, 2007.

http://people.redhat.com/drepper/cpumemory.pdf.

[EKS08] Tomas Ebenlendr, Marek Krcal, and Jirı Sgall. Graph balancing: a special case

of scheduling unrelated parallel machines. In Proceedings of the nineteenth an-

58

BIBLIOGRAPHY 59

nual ACM-SIAM symposium on Discrete algorithms, SODA ’08, pages 483–490,

Philadelphia, PA, USA, 2008. Society for Industrial and Applied Mathematics.

[EL11] L. Epstein and A. Levin. Scheduling with processing set restrictions: PTAS results

for several variants. Int. J. Prod. Econ., 133(2):586 – 595, 2011.

[FPT07] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction cache

locking using worst-case execution paths. In Proceedings of the 5th IEEE/ACM

international conference on Hardware/software codesign and system synthesis,

CODES+ISSS ’07, pages 143–148, New York, NY, USA, 2007. ACM.

[GLS81] M. Grotschel, L. Lovasz, and A. Schrijver. The ellipsoid method and its conse-

quences in combinatorial optimization. Combinatorica, 1(2):169–197, 1981.

[Gra69] R. L Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on

Applied Mathematics, 17:416—429, 1969.

[Has10] A. Hassidim. Cache replacement policies for multicore processors. In ICS, pages

501–509, 2010.

[HS88] D. S. Hochbaum and D. B. Shmoys. A polynomial approximation scheme for

scheduling on uniform processors: Using the dual approximation approach. SIAM

J. Comput., 17(3):539–551, 1988.

[HV08] Benjamin Hiller and Tjark Vredeveld. Probabilistic analysis of online bin coloring

algorithms via stochastic comparison. In ESA, pages 528–539, 2008.

[Ira96] S. Irani. Competitive analysis of paging: A survey. In In Proceedings of the

Dagstuhl Seminar on Online Algorithms, Dagstuhl, 1996.

[Kha79] L. G. Khachiyan. A polynomial algorithm in linear programming. Doklady

Akademii Nauk SSSR, 244:1093–1096, 1979.

[KMRS88] A. R. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator. Competitive snoopy

caching. Algorithmica, 3:77–119, 1988.

[LLD+08] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights

into multicore cache partitioning: Bridging the gap between simulation and real

systems. In HPCA, pages 367–378, 2008.

BIBLIOGRAPHY 60

[LLX12] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for multi-task real-time

embedded systems. Real-Time Systems, 48(2):166–197, 2012.

[LOS12] A. Lopez-Ortiz and A. Salinger. Paging for multi-core shared caches. In ITCS,

pages 113–127. ACM, 2012.

[LST90] J. K. Lenstra, D. B. Shmoys, and Eva Tardos. Approximation algorithms for

scheduling unrelated parallel machines. Math. Program., 46:259–271, 1990.

[LX09] M. Liu, T.and Li and C. J. Xue. Minimizing WCET for real-time embedded

systems via static instruction cache locking. In Proceedings of the 2009 15th IEEE

Symposium on Real-Time and Embedded Technology and Applications, RTAS ’09,

pages 35–44, Washington, DC, USA, 2009. IEEE Computer Society.

[LZLX10] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Task assignment with cache partitioning

and locking for WCET minimization on MPSoC. In ICPP, pages 573–582. IEEE

Computer Society, 2010.

[LZLX11] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Joint task assignment and cache partition-

ing with cache locking for WCET minimization on MPSoC. J. Parallel Distrib.

Comput., 71(11):1473–1483, 2011.

[MCHvE06] A. M. Molnos, S. D. Cotofana, M. J. M. Heijligers, and J. T. J. van Eijndhoven.

Throughput optimization via cache partitioning for embedded multiprocessors. In

ICSAMOS, pages 185–192, 2006.

[MP95] N. V. R. Mahadev and U. N. Peled. Threshold graphs and related topics, volume 56

of Annals of Discrete Mathematics. Elsevier, 1995.

[Sch98] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley, June

1998.

[ST85] D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging

rules. Commun. ACM, 28(2):202–208, 1985.

[SV05] E. V. Shchepin and N. Vakhania. An optimal rounding gives a better approxi-

mation for scheduling unrelated machines. Oper. Res. Lett., 33(2):127–133, March

2005.

BIBLIOGRAPHY 61

[VLX03] X. Vera, B. Lisper, and J. Xue. Data cache locking for higher program predictabil-

ity. In Proceedings of the 2003 ACM SIGMETRICS international conference on

Measurement and modeling of computer systems, SIGMETRICS ’03, pages 272–

282, New York, NY, USA, 2003. ACM.

joint cache partition and job assignment on multi-core ... · we obtain better approximation...

Documents