locality 2000

8/19/2019 Locality 2000

1/12

The Data Locality of Work Stealing

Umut A. AcarSchool of Computer Science

Carnegie Mellon [email protected]

Guy E. BlellochSchool of Computer Science

Carnegie Mellon [email protected]

Robert D. Blumofe Department of Computer Sciences

University of Texas at [email protected]

Abstract

This paper studies the data locality of the work-stealing schedulingalgorithm on hardware-controlled shared-memory machines. Wepresent lower and upper bounds on the number of cache missesusing work stealing, and introduce a locality-guided work-stealingalgorithm along with experimental validation.

As a lower bound, we show that there is a family of multi-threaded computations G

n

each member of which requires ( n )total instructions (work), for which when using work-stealing thenumber of cachemisses on one processoris constant, whileeven ontwo processors the total number of cache misses is ( n ) . This im-plies that for generalcomputations there is no useful bound relatingmultiprocessor to uninprocessor cache misses. For nested-parallelcomputations, however,we show that on P processors the expectedadditional number of cache misses beyondthose on a single proces-sor is boundedby O ( C d m

s

e P T

1

) , where m is the execution timeof an instruction incurring a cache miss, s is the steal time, C is thesize of cache, and T

1

is the number of nodes on the longest chainof dependences. Based on this we give strong bounds on the totalrunning time of nested-parallel computations using work stealing.

For the second part of our results, we present a locality-guided

work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an afnity fora processor. Our initial experiments on iterative data-parallel appli-cations show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the perform-ance up to 5 0 % over static partitioning under multiprogrammedwork loads. Furthermore, the locality-guided work stealing im-proves the performance of work-stealing up to 8 0 % .

1 Introduction

Many of today’s parallel applications use sophisticated, adaptivealgorithms which are best realized with parallel programming sys-tems that support dynamic, lightweight threads such as Cilk [8],Nesl [5], Hood [10], and many others [3, 16, 17, 21, 32]. The coreof these systems is a thread scheduler that balances load among theprocesses. In addition to a good load balance, however, good datalocality is essential in obtaining high performance from modernparallel systems.

Several researches have studied techniques to improve the datalocality of multithreaded programs. One class of such techniquesis based on software-controlled distribution of data among the lo-cal memories of a distributed shared memory system [15, 22, 26].Another class of techniques is based on hints supplied by the pro-grammer so that “similar” tasks might be executed on the sameprocessor [15, 31, 34]. Both these classes of techniques rely onthe programmer or compiler to determine the data access patternsin the program, which may be very difcult when the program hascomplicated data accesspatterns. Perhaps the earliest class of tech-niques was to attempt to execute threads that are close in the com-putation graph on the same processor [1, 9, 20, 23, 26, 28]. Thework-stealing algorithm is the most studied of these techniques [9,11, 19, 20, 24, 36, 37]. Blumofe et al showed that fully-strict com-putations achieve a provably good data locality [7] when executedwith the work-stealing algorithm on a dag-consistent distributedshared memory systems. In recent work, Narlikarshowedthat work stealing improves the performance of space-efcientmultithreadedapplications by increasing the data locality [29]. None of this pre-vious work, however, has studied upper or lower bounds on thedata locality of multithreaded computations executed on existinghardware-controlled shared memory systems.

In this paper, we present theoretical and experimental resultson the data locality of work stealing on hardware-controlled sharedmemory systems (HSMSs). Our rst set of results are upper andlower bounds on the number of cachemisses in multithreaded com-putations executed by the work-stealing algorithm. Let M

1

( C ) de-note the number of cache misses in the uniprocessor execution andM

P

( C ) denote the number of cache misses in a P -processor ex-ecution of a multithreaded computation by the work stealing algo-rithm on an HSMS with cache size C . Then, for a multithreadedcomputation with T

1

work (total number of instructions), T 1

criti-cal path (longest sequenceof dependences), we showthe followingresults for the work-stealing algorithm running on a HSMS.

Lower bounds on the number of cache misses for generalcomputations: We show that there is a family of computa-tions G

n

with T 1

= ( n ) such that M 1

( C ) = 3 C whileeven on two processors the number of misses M 2 ( C ) =

( n ) .

Upper bounds on the number of cache misses for nested-parallel computations: For a nested-parallel computation, weshow that M

P

M

1

( C ) + 2 C , where is the number of steals in the P -processor execution. We then show that theexpected number of steals is O ( d m

s

e P T

1

) , where m is thetime for a cache miss and s is the time for a steal.

Upper bound on the execution time of nested-parallel com-putations: We show that the expected execution time of a

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers, or to redistribute to lists, requires prior specific

permiss ion and/or a fee.

SPAA 2000, Bar Harbor, Maine USA© ACM 2000 1-58113-185-2/00/07...$5.00

1

8/19/2019 Locality 2000

2/12

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30 35 40 45 50

S p e e

d u p

Number of Processes

linearwork-stealing

locality-guided workstealingstatic partioning

Figure 1: The speedup obtained by three different over-relaxationalgorithms.

nested-parallel computation on P processors is O ( T 1 ( C )P

+

m d

m

s

e C T

1

+ ( m + s ) T

1

) , where T 1

( C ) is the uniprocessor

execution time of the computation including cache misses.As in previous work [6, 9], we represent a multithreaded com-

putation as a directed, acyclic graph ( dag ) of instructions. Eachnode in the dag represents a single instruction and the edges repre-sent ordering constraints. A nested-parallel computation [5, 6] is arace-free computation that can be representedwith a series-paralleldag [33]. Nested-parallel computations include computations con-sisting of parallel loops and fork an joins and any nesting of them.This class includes most computations that can be expressed inCilk [8], and all computations that can be expressed in Nesl [5].Our results show that nested-parallel computations have much bet-ter locality characteristics under workstealingthan do general com-putations. We also briey consider another class of computations,computations with futures [12, 13, 14, 20, 25], and show that theycan be as bad as general computations.

The secondpart of our results are on further improving the datalocality of multithreaded computations with work stealing. In work stealing, a processorsteals a thread from a randomly (with uniformdistribution) chosen processor when it runs out of work. In certainapplications, such as iterative data-parallel applications, randomsteals may cause poor data locality. The locality-guided work steal-ing is a heuristic modication to work stealing that allows a threadto have an afnity for a process. In locality-guided work stealing,when a process obtains work it gives priority to a thread that hasafnity for the process. Locality-guided work stealing can be usedto implement a number of techniques that researchers suggest toimprove data locality. For example, the programmer canachieve aninitial distribution of work among the processesor schedule threadsbasedon hints by appropriately assigningafnities to threads in thecomputation.

Our preliminary experiments with locality-guided work steal-ing give encouraging results, showing that for certain applicationsthe performance is very close to that of static partitioning in ded-icated mode (i.e. when the user can lock down a xed number of processors), but does not suffer a performance cliff problem [10]in multiprogrammed mode (i.e. when processors might be takenby other users or the OS). Figure 1 shows a graph comparing work stealing, locality-guided work stealing, and static partitioning for asimple over-relaxation algorithm on a 1 4 processor Sun Ultra En-terprise. The over-relaxation algorithm iterates over a 1 dimen-sional array performing a 3 -point stencil computation on each step.The superlinear speedup for static partitioning and locality-guided

work stealing is due to the fact that the data for each run does nott into the L 2 cache of one processor but ts into the collective L 2cache of 6 or more processors. For this benchmark the followingcan be seen from the graph.

1. Locality-guided work stealing does signicantly better thanstandard work stealing since on each step the cache is pre-warmed with the data it needs.

2. Locality-guided work stealing does approximately as well asstatic partitioning for up to 14 processes.

3. When trying to schedule more than 14 processes on 14processors static partitioning has a serious performancedrop. The initial drop is due to load imbalance caused bythe coarse-grained partitioning. The performance then ap-proaches that of work stealing as the partitioning gets morene-grained.

We are interested in the performance of work-stealing computa-tions on hardware-controlled shared memory (HSMSs). We modelan HSMS as a group of identical processors each of which has itsown cache and has a single shared memory. Each cache containsC blocks and is managedby the memory subsystemautomatically.

We allow for a variety of cache organizationsand replacementpoli-cies, including both direct-mapped and associative caches. We as-sign a server process with each processor and associate the cacheof a processor with process that the processor is assigned. One lim-itation of our work is that we assume that there is no false sharing.

2 Related Work

As mentioned in Section 1, there are three main classes of tech-niques that researchers have suggested to improve the data localityof multithreaded programs. In the rst class, the program data isdistributedamong the nodes of a distributedshared-memory systemby the programmer and a thread in the computation is scheduledonthe node that holds the data that the thread accesses [15, 22, 26].In the second class, data-locality hints supplied by the programmerare used in thread scheduling [15, 31, 34]. Techniques from bothclasses are employedin distributed shared memory systems suchasCOOL and Illinois Concert [15, 22] and also used to improve thedata locality of sequential programs [31]. However, the rst classof techniques do not apply directly to HSMSs, becauseHSMSs donot allow software controlled distribution of dataamong the caches.Furthermore, both classes of techniques rely on the programmer todetermine the data access patterns in the application and thus, maynot be appropriate for applications with complex data-access pat-terns.

The third class of techniques, which is based on executionof threads that are close in the computation graph on the sameprocess, is applied in many scheduling algorithms including work stealing [1, 9, 23, 26, 28, 19]. Blumofe et al showed boundson the number of cache misses in a fully-strict computation exe-

cuted by the work-stealing algorithm under the dag-consistent dis-tributed shared-memory of Cilk [7]. Dag consistency is a relaxedmemory-consistency model that is employed in the distributedshared-memory implementation of the Cilk language. In a dis-tributed Cilk application, processes maintain the dag consistencyby means of the BACKER algorithm. In [7], Blumofe et al boundthe number of shared-memory cache misses in a distributed Cilk application for caches that are maintained with the LRU replace-ment policy. They assumed that accesses to the shared memoryare distributed uniformly and independently, which is not gener-ally true because threads may concurrently access the same pagesby algorithm design. Furthermore, they assumed that processes do

2

8/19/2019 Locality 2000

3/12

4

5

93

1 2 8 11

76

10

12 13

Figure 2: A dag (directed acyclic graph) for a multithreaded com-putation. Threads are shown as gray rectangles.

not generate steal attempts frequently by making processes do ad-ditional page transfers before they attempt to steal from anotherprocess.

3 The Model

In this section, we present a graph-theoreticmodel for multithreadedcomputations, describe the work-stealing algorithm, dene series-parallel and nested-parallel computations and introduce our modelof an HSMS (Hardware-controlled Shared-Memory System).

As with previouswork [6, 9] we representa multithreaded com-putation as a directed acyclic graph, a dag , of instructions (see Fig-ure 2). Each node in the dag represents an instruction and the edgesrepresent ordering constraints. There are three types of edges, con-tinuation, spawn, and dependency edges. A thread is a sequentialordering of instructions and the nodes that corresponds to the in-structions are linked in a chain by continuation edges. A spawnedge represents the creation of a newthread and goesfrom the noderepresenting the instruction that spawns the new thread to the noderepresenting the rst instruction of the new thread. A dependencyedge from instruction i of a thread to instruction j of some otherthread represents a synchronization between two instructions suchthat instruction j must be executed after i . We draw spawn edgeswith thick straight arrows, dependency edgeswith curly arrows andcontinuation edgeswith thick straight arrows throughout this paper.

Also we show paths with wavy lines.For a computationwith an associateddag G , wedenethe computational work , T

1

, as the number of nodes in G and the critical path , T

1

, as the number of nodes on the longest path of G .Let u and v be any two nodes in a dag. Then we call u an an-

cestor of v , and v a descendant of u if there is a path from u to v .Any node is its descendantand ancestor. We say that two nodes are relatives if there is a path from one to the other, otherwise we saythat the nodes are independent . The children of a node are inde-pendent because otherwise the edge from the node to one child isredundant. We call a common descendant y of u and v a merger of u and v if the paths from u to y and v to y have only y in common.We dene the depth of a node u as the number of edges on theshortest path from the root node to u . We dene the least common ancestor of u and v as the ancestor of both u and v with maximumdepth. Similarly, we dene the greatest common descendant of u and v , as the descendantof both u and v with minimum depth. Anedge ( u v ) is redundant if there is a path between u and v thatdoes not contain the edge ( u v ) . The transitive reduction of a dagis the dag with all the redundant edges removed.

In this paper we are onlyconcernedwith the transitive reductionof the computational dags. We also require that the dags have asingle node with in-degree 0 , the root , and a single node with out-degree 0 , the nal node.

In a multiprocess execution of a multithreaded computation, in-dependent nodes can execute at the same time. If two independentnodes read or modify the same data, we say that they are RR or

WW sharing respectively. If one node is reading and the other ismodifying the data we say they are RW sharing. RW or WW shar-ing cancause data races, and the output of a computation with suchraces usually depends on the scheduling of nodes. Such races aretypically indicative of a bug [18]. We refer to computations that donot have anyRW or WW sharing as race-free computations. In thispaper we consider only race-free computations.

The work-stealing algorithm is a thread scheduling algorithmfor multithreaded computations. The idea of work-stealing datesback to the researchof Burton and Sleep [11] and has been studiedextensivelysince then [2, 9, 19, 20, 24, 36, 37]. In the work-stealingalgorithm, each process maintains a pool of ready threads and ob-tains work from its pool. When a process spawns a new thread theprocess adds the thread into its pool. When a process runs out of work and nds its pool empty, it chooses a random process as itsvictim and tries to steal work from the victim’s pool.

In our analysis, we imagine the work-stealing algorithm oper-ating on individual nodes in the computation dag rather than on thethreads. Consider a multithreadedcomputationand its execution bythe work-stealing algorithm. We divide the execution into discretetime steps such that at each step, each process is either working ona node, which we call the assigned node , or is trying to steal work.The execution of a node takes 1 time step if the node doesnot incura cachemiss and m steps otherwise. We say that a node is executed at the time step that a process completes executing the node. Theexecution time of a computation is the number of time steps thatelapse between the time step that a process starts executing the rootnode to the time step that the nal node is executed. The execution schedule species the activity of each processat each time step.

During the execution, each process maintains a deque (doublyended queue) of ready nodes; we call the ends of a deque the topand the bottom . When a node, u , is executed, it enablessome othernode v if u is the last parent of v that is executed. We call the edge( u v ) an enabling edge and u the designated parent of v . When aprocessexecutesa nodethat enablesother nodes, oneof the enablednodes become the assigned node and the process pushes the restontothe bottom of its deque. If no node is enabled, thenthe processobtains work from itsdeque by removinga node from the bottom of

the deque. If a processnds its deque empty, it becomesa thief andsteals from a randomly chosen process, the victim . This is a steal attempt and takes at least s and at most k s time steps for someconstant k 1 to complete. A thief process might make multiplesteal attempts before succeeding, or might never succeed. When asteal succeeds, the thief process starts working on the stolen nodeat the step following the completion of the steal. We say that a stealattempt occurs at the step it completes.

The work-stealing algorithm can be implemented in variousways. We say that an implementation of work stealing is deterministic if, whenever a process enables other nodes, the imple-mentation always chooses the same node as the assigned node forthen next step on that process, and the remaining nodes are alwaysplaced in the deque in the same order. This must be true for bothmultiprocess and uniprocess executions. We refer to a determin-istic implementation of the work-stealing algorithm together withthe HSMS that runs the implementation as a work stealer . Forbrevity, we refer to an execution of a multithreaded computationwith a work stealer as an execution . We dene the total work asthe number of steps taken by a uniprocess execution, including thecache misses, and denote it by T

1

( C ) , where C is the cache size.We denote the number of cache misses in a P -process executionwith C -block caches as M

P

( C ) . We dene the cache overhead of a P -process execution as M

P

( C ) ; M

1

( C ) , where M 1

( C ) isthe number of misses in the uniprocess execution on the samework stealer.

We refer to a multithreaded computation for which the transi-

3

8/19/2019 Locality 2000

4/12

s t

2G1G

s t uG 2

G 1

s t

(a) (b) (c)

Figure 3: Illustrates the recursive denitionfor series-parallel dags.

Figure (a) is the base case, gure (b) depicts the serial, and gure(c) depicts the parallel composition.

tive reduction of the corresponding dag is series-parallel [33] asa series-parallel computation . A series-parallel dag G ( V E ) is adag with two distinguished vertices, a source , s 2 V and a sink ,t 2 V and can be dened recursively as follows (see Figure 3).

Base: G consists of a single edge connecting s to t .

Series Composition: G consists of two series-parallel dagsG

1

( V

1

E

1

) and G 2

( V

2

E

2

) with disjoint edge setssuch thats is the source of G

1

, u is the sink of G 1

and the source of G

2

, and t is the sink of G 2

. Moreover V 1

\ V

2

= f u g .

ParallelComposition: The graph consistsof two series-paralleldags G

1

( V

1

E

1

) and G 2

( V

2

E

2

) with disjoint edges setssuch that s and t are the source and the sink of both G

1

andG

2

. Moreover V 1

\ V

2

= f s t g .

A nested-parallel computation is a race-free series-parallelcompu-tation [6].

We alsoconsider multithreaded computationsthat use futures [12,13, 14, 20, 25]. The dag structures of computations with futuresare dened elsewhere [4]. This is a superclass of nested-parallelcomputations, but still much more restrictive than general com-putations. The work-stealing algorithm for futures is a restrictedform of work-stealing algorithm, where a process starts executing anewly created thread immediately, putting its assigned thread ontoits deque.

In our analysis, we consider several cache organization and re-placement policies for an HSMS. We model a cache as a set of (cache) lines , each of which can hold the data belonging to a mem-ory block (a consecutive, typically small, region of memory). Oneinstruction can operate on at most one memory block. We say thatan instruction accesses a block or the line that contains the block when the instruction reads or modies the block. We say that aninstruction overwrites a line that contains the block b when the in-struction accesses some other block that replaces b in the cache.We say that a cache replacement policy is simple if it satises twoconditions. First the policy is deterministic. Second whenever thepolicy decides to overwrite a cache line, l , it makes the decisionto overwrite l by only using information pertaining to the accessesthat are made after the last access to l . We refer to a cache man-aged with a simple cache-replacement policy as a simple cache.Simple caches and replacement policies are common in practice.For example, least-recently used (LRU) replacement policy, directmapped caches and set associative caches where each set is main-tained by a simple cache replacement policy are simple.

In regards to the denition of RW or WW sharing, we assumethat reads and writes pertain to the whole block. This means we donot allow for false sharing—when two processes accessing differ-ent portions of a block invalidate the block in each other’ s caches.In practice, false sharing is an issue, but can often be avoided bya knowledge of underlying memory system and appropriately pad-ding the shared data to prevent two processes from accessing dif-ferent portions of the same block.

12 1914 17

4 5

1

3 6 10

2 9

7 8

15

1811 13 16

4CRL

4C

Figure 4: The structure for dag of a computation with a large cacheoverhead.

4 General Computations

In this section, we show that the cache overhead of a multiprocessexecution of a general computation and a computation with futurescan be large even though the uniprocess execution incurs a smallnumber of misses.

Theorem 1 There is a family of computations

f G

n

: n = k C f o r k 2 Z

+

g

with O ( n ) computational work, whose uniprocessexecution incurs3 C misses while any 2 -process execution of the computation incurs

( n ) misses on a work stealer with a cache size of C , assumingthat S = O ( C ) , where S is the maximum steal time.

Proof: Figure 4 shows the structure of a dag, G 4 C

for n = 4 C .Each node except the root node represents a sequence of C in-structions accessing a set of C distinct memory blocks. The rootnode represents C + S instructions thataccesses C distinct memoryblocks. The graph has two symmetric components L

4 C

and R 4 C

,which corresponds to the left and the right subtree of the root ex-cluding the leaves. We partition the nodes in G

4 C

into threeclasses,such that all nodes in a class access the same memory blocks whilenodes from different classesaccessmutually disjoint set of memoryblocks. The rst class contains the root node only, the second class

contains all the nodes inL

4 C

, and the third class contains the restof the nodes, which are the nodes in R 4 C

and the leaves of G 4 C

.For general n = k C , G

n

can be partitioned into L n

, R n

and the kleaves of G

n

and the root similarly. Each of L n

and R n

contains2 d

k

2

e ; 1 nodes and has the structure of a complete binary tree withadditional k leaves at the lowest level. There is a dependencyedgefrom the leaves of both L

n

and R n

to the leaves of G n

.Consider a work stealer that executes the nodes of G

n

in the or-der that they are numberedin a uniprocess execution. In the unipro-cess execution, no node in L

n

incurs a cache miss except the rootnode, since all nodes in L

n

access the same memory blocks as theroot of L

n

. The same argument holds for R n

and the k leaves of G

n

. Hence the execution of the nodes in L n

, R n

, and the leavescauses 2 C misses. Since the root node causes C misses, the totalnumber of misses in the uniprocessexecution is 3 C . Now, considera 2 -process execution with the same work stealer and call the pro-cesses, process 0 and 1 . At time step 1 , process 0 starts executingthe root node, which enables the root of R

n

no later than time stepm . Since process 0 starts stealing immediately and there are noother processes to steal from, process 1 steals and starts workingon the root of R

n

, no later than time step m + S . Hence, the rootof R

n

executes before the root of L n

and thus, all the nodes in L n

execute before the corresponding symmetric node in R n

. There-fore, for any leaf of G

n

, the parent that is in R n

executes beforethe parent in L

n

. Therefore a leaf node of G n

is executed immedi-ately after its parent in L

n

and thus, causes C cache misses. Thus,the total number of cache misses is ( k C ) = ( n ) .

4

8/19/2019 Locality 2000

5/12

2

3

5

6

74 8

910

11

12 13

14

1

Figure 5: The structure for dag of a computation with futures thatcan incur a large cache overhead.

There exists computations similar to the computation in Fig-ure 4 that generalizesTheorem 1 for arbitrary number of processesby making sure that all the processes but 2 steal throughout anymultiprocess execution. Even in the general case, however, wherethe average parallelism is higher than the number of processes,Theorem 1 can be generalized with the same bound on expected

number of cache misses by exploiting the symmetry inG

n

and byassuming a symmetrically distributed steal-time. With a symmetri-cally distributed steal-time, for any , a steal that takes steps morethan meansteal-time is equally likely to happen as a steal that takes

less steps than the mean. Theorem 1 holds for computations withfutures as well. Multithreaded computing with futures is a fairlyrestricted form of multithreaded computing compared to comput-ing with events such as synchronization variables. The graph F inFigure 5 shows the structure of a dag, whose 2 -process executioncauses large number of cache misses. In a 2 -process execution of F , the enabling parent of the leaf nodes in the right subtree of theroot are in the left subtree and therefore the execution of each suchleaf node causes C misses.

5 Nested-Parallel Computations

In this section, we show that the cache overhead of an execution of a nested-parallel computation with a work stealer is at most twicethe product of the number of steals and the cache size. Our proof has two steps. First, weshow that the cacheoverhead is boundedbythe product of the cache size and the number of nodes that are exe-cuted “out of order” with respect to the uniprocess execution order.Second, we prove that the number of such out-of-order executionsis at most twice the number of steals.

Consider a computation G and its P -process execution, X P

,with a work stealer and the uniprocessexecution, X

1

with the samework stealer. Let v be a node in G andnode u be the node that exe-cutes immediately before v in X

1

. Then we say that v is drifted inX

P

if node u is not executed immediately before v by the processthat executes v in X

P

.Lemma 2 establishesa keyproperty of an executionwith simple

caches.

Lemma 2 Consider a process with a simple cache of C blocks. Let X

1

denote the execution of a sequence of instructions on the proc-ess starting with cache state S

1

and let X 2

denote the executionof the same sequence of instructions starting with cache state S

2

.Then X

1

incurs at most C more misses than X 2

.

Proof: We construct a one-to-one mapping between the cachelines in X

1

and X 2

such that an instruction that accesses a linel

1

in X 1

accesses the entry l2

in X 2

, if and only if l1

is mapped to

l

2

. Consider X 1

andlet l1

be a cache line. Let i be the rst instruc-tion that accesses or overwrites l

1

. Let l2

be the cache line thatthe same instruction accesses or overwrites in X

2

and map l1

tol

2

. Since the caches are simple, an instruction that overwrites l1

inX

1

overwrites l2

in X 2

. Therefore the number of misses that over-writes l

1

in X 1

is equal to the number of misses that overwrites l2

in X 2

after instruction i . Since i itself can cause 1 miss, the numberof misses that overwrites l

1

in X 1

is at most 1 more than the num-ber of misses that overwrites l

2

in X 2

. We construct the mappingfor each cache line in X

1

in the same way. Now, let us show thatthe mapping is one-to-one. For the sake of contradiction, assumethat two cache lines, l

1

and l2

, in X 1

map to the same line in X 2

.Let i

1

and i2

be the rst instructions accessing the cache lines inX

1

such that i1

is executed before i2

. Since i1

and i2

map to thesame line in X

2

and the caches are simple, i2

accesses the line thati

1

accessesin X 1

but then l1

= l

2

, a contradiction. Hence, the totalnumber of cache misses in X

1

is at most C more than the missesin X

2

.

Theorem 3 Let D denote the total number of drifted nodes in anexecution of a nested-parallel computation with a work stealer onP processes, each of which has a simple cache with C words. Thenthe cache overhead of the execution is at most C D .

Proof: Let X P

denote the P -process execution and let X 1

be theuniprocessexecution of the same computation with the same work stealer. We divide the multiprocess computation into D pieceseachof which can incur at most C more misses than in the uniprocessexecution. Let u be a drifted node let q be the process that executesu . Let v be the next drifted node executed on q (or the nal nodeof the computation). Let the ordered set O represent the executionorder of all the nodes that are executed after u (u is included) andbefore v (v is excluded if it is drifted, included otherwise) on q inX

P

. Then nodes in O are executed on the same process and in thesame order in both X

1

and X P

.Now consider the number of cache misses during the execution

of the nodes in O in X 1

and X P

. Since the computation is nestedparallel and therefore race free, a process that executes in paral-

lel withq

does not causeq

to incur cache misses due to sharing.Therefore by Lemma 2 during the execution of the nodes in O thenumber of cache misses in X

P

is at most C more than the numberof misses in X

1

. This bound holds for each of the D sequence of such instructions O corresponding to D drifted nodes. Since thesequence starting at the root node and ending at the rst driftednode incurs the same number of misses in X

1

and X P

X

P

takes atmost C D more misses than X

1

and the cache overhead is at mostC D .

Lemma 2 (and thus Theorem 3) does not hold for caches thatare not simple. For example, consider the execution of a sequenceof instructions on a cache with least-frequently-used replacementpolicy startingat twocache states. In the rst cache state, the blocksthat arefrequently accessedby theinstructions are in the cachewithhigh frequencies, whereas in the second cache state, the blocks that

are in the cache are not accessed by the instruction and have lowfrequencies. The execution with the second cache state, therefore,incurs many more misses than the size of the cache compared tothe execution with the second cache state.

Now we show that the number of drifted nodes in an executionof a series-parallel computationwith a work stealer is at most twicethe number of steals. The proof is based on the representation of series-parallel computations as sp-dags. We call a node with out-degree of at least 2 a fork node and partition the nodes of an sp-dagexcept the root into three categories: join nodes, stable nodes andnomadic nodes . We call a node that has an in-degree of at least2 a join node and partition all the nodes that have in-degree 1 into

5

8/19/2019 Locality 2000

6/12

s

t

z

u

v

Figure 6: Children of s and their merger.

G 2

G1

u

v

s t

Figure 7: The joint embeddingof u and v .

two classes: a nomadic node has a parent that is a fork node, and a stable node hasa parent that hasout-degree 1 . Theroot node hasin-

degree 0 andit doesnot belong to anyof these categories. Lemma 4lists two fundamental properties of sp-dags; one can prove bothproperties by induction on the number of edges in an sp-dag.

Lemma 4 Let G be an sp-dag. Then G has the following proper-ties.

1. The least common ancestor of any two nodes in G is unique.

2. The greatest common descendant of any two nodes in G isunique and is equal to their unique merger.

Lemma 5 Let s be a fork node. Then no child of s is a join node.

Proof: Let u and v denote two children of s and suppose u is a

join node as in Figure 6. Lett

denote some other parent of u

andz denote the unique merger of u and v . Then both z and u aremergers for s and t , which is a contradiction of Lemma 5. Hence u is not a join node.

Corollary 6 Only nomadic nodes can be stolen in an execution of a series-parallel computation by the work-stealing algorithm.

Proof: Let u be a stolen node in an execution. Then u is pushedon a deque and thus the enabling parent of u is a fork node. ByLemma 5, u is nota join nodeand hasan incomingdegree 1 . There-fore u is nomadic.

Consider a series-parallel computation and let G be its sp-dag.Let u and v be two independent nodes in G and let s and t denotetheir least common ancestor and greatest common descendant re-

spectively as shown in Figure 7. LetG

1

denote the graph that isinduced by the relatives of u that are descendants of s and also an-cestors of t . Similarly, let G

2

denote the graph that is induced bythe relatives of v that are descendants of s and ancestors of t . Thenwe call G

1

the embedding of u with respect to v and G 2

the em-bedding of v with respect to u . We call the graph that is the unionof G

1

and G 2

the joint embedding of u and v with source s andsink t . Now consider an execution of G and y and z be the childrenof s such that y is executed before z . Then we call y the leader andz the guard of the joint embedding.

w

G 1

G 2

r u

v z

x y

ts

Figure 8: The join node s is the least common ancestor of y and z .Node u and v are the children of s .

Lemma 7 Let G ( V E ) be an sp-dag and let y and z be two par-ents of a join node t in G . Let G

1

denote the embedding of y withrespect to z and G

2

denote the embedding of z with respect to y . Let s denote the source and t denote the sink of the joint embed-ding. Then the parents of any node in G

1

except for s and t is inG

1

and the parents of any node in G 2

except for s and t is in G 2

.

Proof: Since y and z are independent, bothof s and t are differentfrom y and z (see Figure 8). First, we show that there is not an

edge that starts at a node inG

1

except ats

and endsat a node inG

2

except at t and vice versa. For the sake of contradiction, assumethere is an edge ( m n ) such that m 6= s is in G

1

and n 6= t is inG

2

. Then m is the least common ancestor of y and z ; hence nosuch ( m n ) exists. A similar argument holds when m is in G

2

andn is in G

1

.Second, we show that there does not exists an edge that origi-

nates from a node outside of G 1

or G 2

and ends at a node at G 1

orG

2

. For the sakeof contradiction, let ( w x ) be an edge such that x is in G

1

and w is not in G 1

or G 2

. Then x is the unique merger forthe two children of the least common ancestor of w and s , whichwe denote with r . But then t is also a merger for the children of r .The children of r are independentand have a unique merger, hencethere is no such edge ( w x ) . A similar argument holds when x isin G

2

. Therefore we conclude that the parents of any node in G 1

except s and t is in G 1

and the parents of any node in G 2

except sand t is in G 2 .

Lemma 8 Let G be an sp-dag and let y and z be two parents of a join node t in G . Consider the joint embedding of y and z and let

u be the guard node of the embedding. Then y and z are executed in the same respective order in a multiprocess execution as theyare executed in the uniprocess execution if the guard node u is not stolen.

Proof: Let s be the source, t the sink, and v the leader of the joint embedding. Since u is not stolen, v is not stolen. Hence, byLemma 7, before it starts working on u , the process that executess executed v and all its descendants in the embedding except for tHence, z is executed before u and y is executed after u as in theuniprocess execution. Therefore, y and z are executed in the samerespective order as they execute in the uniprocessexecution.

Lemma 9 A nomadic node is drifted in an execution only if it isstolen.

Proof: Let u be a nomadic and drifted node. Then, by Lemma 5,u has a single parent s that enables u . If u is the rst child of s toexecute in the uniprocessexecution then u is not drifted in the mul-tiprocess execution. Hence, u is not the rst child to execute. Let vbe the last child of s that is executed before u in the uniprocessex-ecution. Now, consider the multiprocess execution and let q be the

6

8/19/2019 Locality 2000

7/12

s2

1s

u

y

1

2

1

x

x2

y

2

1

t

t

Figure 9: Nodes t1

and t2

are two join nodes with the commonguard u .

process that executes v . For the sake of contradiction, assume thatu is not stolen. Consider the joint embedding of u and v as shownin Figure 8. Since all parents of the nodes in G

2

except for s andt are in G

2

by Lemma 7, q executes all the nodes in G 2

before itexecutes u and thus, z precedes u on q . But then u is not drifted,because z is the node that is executed immediately before u in theuniprocesscomputation. Hence u is stolen.

Let us dene the cover of a join node t in an execution as the set

of all the guard nodes of the joint embedding of all possible pairsof parents of t in the execution. The following lemma shows that a join node is drifted only if a node in its cover is stolen.

Lemma 10 A join node is drifted in an execution only if a node inits cover is stolen in the execution.

Proof: Consider the execution and let t be a join node that isdrifted. Assume, for the sake of contradiction, that no node in thecover of t , C ( t ) , is stolen. Let y and z be any two parents of t asin Figure 8. Then y and z are executed in the same order as in theuniprocess execution by Lemma 8. But then all parents of t exe-cute in the same order as in the uniprocess execution. Hence, theenablingparent of t in the executionis the sameas in the uniprocessexecution. Furthermore, the enabling parent of t has out-degree 1 ,because otherwise t is not a join node by Lemma 5 and thus, theprocess that enables t executes t . Therefore, t is not drifted. Acontradiction, hence a node in the cover of t is stolen.

Lemma 11 The numberof driftednodes in an execution of a series- parallel computation is at most twice the number of steals in theexecution.

Proof: We associate each drifted node in the execution with asteal such that no steal has more than 2 drifted nodes associatedwith it. Consider a drifted node, u . Then u is not the root nodeof the computation and it is not stable either. Hence, u is either anomadicor join node. If u is nomadic, then u is stolen by Lemma9and we associate u with the steal that steals u . Otherwise, u is a join node and there is a node in its cover C ( u ) that is stolen byLemma 10. We associate u with the steal that steals a node in itscover. Now, assume there are more than 2 nodes associatedwith a

steal that steals nodeu

. Then there are at least two join nodest

1

and t2

that are associated with u . Therefore, node u is in the jointembedding of two parents of t

1

and also t2

. Let x 1

, y 1

be theseparents of t

1

and x 2

, y 2

be the parents of t2

, as shown in Figure 9.But then u has parent that is a fork node and is a joint node, whichcontradicts Lemma 5. Hence no such u exists.

Theorem 12 Thecache overheadof anexecutionof a nested-parallelcomputation with simple caches is at most twice the product of thenumber of misses in the execution and the cache size.

Proof: Follows from Theorem 3 and Lemma 11.

6 An Analysis of Nonblocking Work Stealing

The non-blocking implementation of the work-stealing algorithmdelivers provably good performance under traditional and multi-programmed workloads. A description of the implementation andits analysis is presented in [2]; an experimental evaluation is givenin [10]. In this section, we extend the analysis of the non-blockingwork-stealing algorithm for classical workloads and bound the ex-ecution time of a nested-parallel, computation with a work stealerto include the number of cache misses, the cache-miss penalty andthe steal time. First, we bound the number of steal attempts in anexecution of a general computation by the work-stealing algorithm.Thenwe bound the execution time of a nested-parallel computationwith a work stealer using results from Section 5. The analysis thatwe present here is similar to the analysis given in [2] and uses thesame potential function technique.

We associatea nonnegative potential with nodes in a computa-tion’ s dag and show that the potential decreases as the executionproceeds. We assume that a node in a computation dag has out-degree at most 2 . This is consistent with the assumption that eachnode represents on instruction. Consider an execution of a compu-tation with its dag, G ( V E ) with the work-stealing algorithm. Theexecution grows a tree, the enabling tree , that contains each node

in the computation and its enabling edge. We dene the distanceof a node u 2 V , d ( u ) , as T

1

; d e p t h ( u ) , where d e p t h ( u ) is thedepth of u in the enabling tree of the computation. Intuitively, thedistance of a node indicates how far the node is away from end of the computation. We dene the potential function in terms of dis-tances. At any given step i , we assign a positive potential to eachready node, all other nodes have 0 potential. A node is ready if it isenabled and not yet executed to completion. Let u denote a readynode at time step i . Then we dene,

i

( u ) , the potential of u attime step i as

i

( u ) =

3

2 d ( u ) ; 1 if u is assigned;3

2 d ( u ) otherwise.

The potential at step i , i

, is the sum of the potential of eachready

node at step i . When an execution begins, the only ready node isthe root node which has distance T 1

and is assignedto someproc-ess, so we start with

0

= 3

2 T

1

; 1 . As the execution proceeds,nodes that are deeper in the dag become ready and the potentialdecreases. There are no ready nodes at the end of an execution andthe potential is 0 .

Let us give a few more denitions that enable us to associatea potential with each process. Let R

i

( q ) denote the set of readynodes that are in the deque of process q along with q ’s assignednode, if any, at the beginning of step i . We say that each node u in R

i

( q ) belongs to process q . Then we dene the potential of q ’sdeque as

i

( q ) =

X

u 2 R

i

( q )

i

( u ) :

In addition, let A i

denote theset of processeswhose deque is emptyat the beginning of step i , and let D

i

denote the set of all otherprocesses. We partition the potential

i

into two parts

i

=

i

( A

i

) +

i

( D

i

)

where

i

( A

i

) =

X

q 2 A

i

i

( q ) and i

( D

i

) =

X

q 2 D

i

i

( q )

and we analyze the two parts separately.

7

8/19/2019 Locality 2000

8/12

Lemma 13 lists four basic properties of the potential that we usefrequently. The proofs for these properties are given in [2] and thelisted properties are correct independent of the time that executionof a node or a steal takes. Therefore, we give a short proof sketch.

Lemma 13 The potential function satises the following proper-ties.

1. Suppose node u is assigned to a process at step i . Then the potential decreasesby at least ( 2 = 3 )

i

( u ) .

2. Suppose a node u is executed at step i . Then the potentialdecreases by at least ( 5 = 9 )

i

( u ) at step i .

3. Consider any step i and any process q in D i

. The topmost node u in q ’s deque contributes at least 3 = 4 of the potentialassociatedwith q . That is, we have

i

( u ) ( 3 = 4 )

i

( q ) .

4. Suppose a process p chooses process q in D i

as its victim at time step i (a steal attempt of p targeting q occurs at step i ).Then the potential decreases by at least ( 1 = 2 )

i

( q ) due tothe assignment or execution of a node belonging to q at theend of step i .

Property 1 follows directly from the denition of the potentialfunction. Property 2 holds because a node enables at most twochildren with smaller potential, one of which becomes assigned.Specically, the potential after the executionof node u decreasesbyat least ( u ) ( 1 ; 1

3

;

1

9

) =

5

9

( u ) . Property 3 follows from a struc-tural property of the nodes in a deque. The distance of the nodes ina process’ deque decrease monotonically from the top of the dequeto bottom. Therefore, the potential in the deque is the sum of geo-metrically decreasing terms and dominated by the potential of thetop node. The last property holds because when a process choosesprocess q in D

i

as its victim, the node at the top of q ’ s dequeis assigned at the next step. Therefore, the potential decreases by2 = 3

i

( u ) by property 1 . Moreover, i

( u ) ( 3 = 4 )

i

( q ) by prop-erty 3 and the result follows.

Lemma 16 shows that the potential decreases as a computationproceeds. The proof for Lemma 16 utilizes balls and bins gamebound from Lemma 14.

Lemma 14 (Balls and Weighted Bins) Supposethat at least P ballsare thrown independently and uniformly at random into P bins,where bin i has a weight W

i

, for i = 1 : : : P . The total weight isW =

P

P

i = 1

W

i

. For each bin i , dene the random variable X i

as

X

i

=

n

W

i

if some ball lands in bin i ;0 otherwise.

If X = P

P

i = 1

X

i

, then for any in the range 0 < < 1 , we haveP r f X W g > 1 ; 1 = ( ( 1 ; ) e ) .

This lemma can be proven with an application of Markov’ s in-equality. The proof of a weaker version of this lemma for the caseof exactly P throws is similar and given in [2]. Lemma 14 alsofollows from the weaker lemma because X does not decrease withmore throws.

We now showthat whenever P or more steal attempts occur, thepotential decreases by a constant fraction of

i

( D

i

) with constantprobability.

Lemma 15 Consider any step i and any later step j such that at least P steal attempts occur at steps from i (inclusive) to j (exclu-sive). Then we have

P r

n

i

;

j

1

4

i

( D

i

)

o

>

1

4

:

Moreover the potential decreaseis because of the execution or as-signment of nodes belonging to a process in D

i

.

Proof: Consider all P processes and P steal attempts that occurat or after step i . For each process q in D

i

, if one or more of theP attempts target q as the victim, then the potential decreases by( 1 = 2 )

i

( q ) dueto the executionor assignmentof nodes that belongto q by property 4 in Lemma 13. If we think of each attempt as aball toss, then we have an instance of the Balls and Weighted BinsLemma (Lemma 14). For each process q in D

i

, we assign a weightW

q

= ( 1 = 2 )

i

( q ) , and for each other process q in A i

, we assigna weight W

q

= 0 . The weights sum to W = ( 1 = 2 ) i

( D

i

) . Using = 1 = 2 in Lemma 14, we conclude that the potential decreasesby at least W = ( 1 = 4 )

i

( D

i

) with probability greater than 1 ; 1 = ( ( 1 ; ) e ) > 1 = 4 due to the execution or assignment of nodesthat belong to a process in D

i

.

We now bound the number of steal attempts in a work-stealingcomputation.

Lemma 16 Consider a P -process executionof a multithreaded computation with the work-stealing algorithm. Let T

1

and T 1

denotethe computational work and the critical path of the computation.Then the expected number of steal attempts in the execution isO ( d

m

s

e P T

1

) . Moreover, for any " > 0 , the number of steal at-tempts is O ( d m

s

e P T

1

+ l g ( 1 = " ) ) with probability at least 1 ; " .

Proof: We analyze the number of steal attempts by breaking theexecution into phases of d m

s

e P steal attempts. We show that withconstant probability, a phase causes the potential to drop by a con-stant factor. The rst phase begins at step t

1

= 1 and ends atthe rst step t 0

1

such that at least d m s

e P steal attempts occur dur-ing the interval of steps t

1

t

0

1

] . The second phase begins at stept

2

= t

0

1

+ 1 , and so on. Let us rst show that there are at leastm steps in a phase. A process has at most 1 outstanding stealattempt at any time and a steal attempt takes at least s steps tocomplete. Therefore, at most P steal attempts occur in a periodof s time steps. Hence a phase of steal attempts takes at leastd ( d

m

s

e ) P ) = P e s m time units.Consider a phase beginning at step i , and let j be the step at

which the next phase begins. Then i + m j . We will show thatwe have P r f

j

( 3 = 4 )

i

g > 1 = 4 . Recall that the potential can

be partitioned as i = i ( A i ) + i ( D i ) . Since the phase containsd

m

s

e P steal attempts, P r f i

;

j

( 1 = 4 )

i

( D

i

) g > 1 = 4 dueto execution or assignment of nodes that belong to a process in D

i

,by Lemma 15. Now we show that the potential also drops by aconstant fraction of

i

( A

i

) due to the execution of assigned nodesthat are assigned to the processes in A

i

. Consider a process, sayq in A

i

. If q does not have an assigned node, then i

( q ) = 0 .If q has an assigned node u , then

i

( q ) =

i

( u ) . In this case,process q completesexecuting node u at step i + m ; 1 < j at thelatest and the potential drops by at least ( 5 = 9 )

i

( u ) by property2 of Lemma 13. Summing over each process q in A

i

, we have

i

;

j

( 5 = 9 )

i

( A

i

) . Thus, we have shown that the potentialdecreases at least by a quarter of

i

( A

i

) and i

( D

i

) . Thereforeno matter how the total potential is distributed over A

i

and D i

, thetotal potential decreases by a quarter with probability more than1 = 4 , that is, P r f

i

;

j

( 1 = 4 )

i

g > 1 = 4 .We say that a phase is successful if it causes the potential to

drop by at least a 1 = 4 fraction. A phase is successful with prob-ability at least 1 = 4 . Since the potential starts at

0

= 3

2 T

1

; 1

and ends at 0 (and is always an integer), the number of success-ful phases is at most ( 2 T

1

; 1 ) l o g

4 = 3

3 < 8 T

1

. The expectednumber of phases needed to obtain 8 T

1

successful phases is atmost 3 2 T

1

. Thus, the expected number of phases is O ( T 1

) , andbecause each phase contains d m

s

e P steal attempts, the expectednumber of steal attempts is O ( d m

s

e P T

1

) . The high probabilitybound follows by an application of the Chernoff bound.

8

8/19/2019 Locality 2000

9/12

Theorem 17 Let M P

( C ) be the number of cache misses in a P - process execution of a nested-parallel computation with a work-stealer that has simple caches of C blocks each. Let M

1

( C ) be thenumber of cache misses in the uniprocess execution Then

M

P

( C ) = M

1

( C ) + O ( d

m

s

e C P T

1

+ d

m

s

e C P l n ( 1 = " ) )

with probability at least 1 ; " . The expected number of cache missesis

M

1

( C ) + O ( d

m

s

e C P T

1

)

Proof: Theorem 12 shows that the cache overhead of a nested-parallel computation is at most twice the product of the number of steals and thecache size. Lemma 16 shows that the number of stealattempts is O ( d m

s

e P ( T

1

+ l n ( 1 = " ) ) ) withprobability at least 1 ; "and the expected number of steals is O ( d m

s

e P T

1

) . The numberof steals is not greater than the number of steal attempts. Thereforethe bounds follow.

Theorem 18 Considera P -process,nested-parallel,work-stealingcomputation with simple caches of C blocks. Then, for any " > 0 ,

the execution time is

O (

T

1

( C )

P

+ m d

m

s

e C ( T

1

+ l n ( 1 = " ) ) + ( m + s ) ( T

1

+ l n ( 1 = " ) ) )

with probability at least ( 1 ; " ) . Moreover, the expected runningtime is

O (

T

1

( C )

P

+ m d

m

s

e C T

1

+ ( m + s ) T

1

) :

Proof: We usean accountingargument to boundthe running time.At each step in the computation, each process putsa dollar into oneof two buckets that matches its activity at that step. We name thetwo buckets as the work and the steal bucket. A process puts a

dollar into the work bucket at a step if it is working on a node inthe step. The execution of a node in the dag adds either 1 or m dollars to the work bucket. Similarly, a process puts a dollar intothe steal bucket for each step that it spends stealing. Each stealattempt takes O ( s ) steps. Therefore, each steal adds O ( s ) dollarsto the steal bucket. The number of dollars in the work bucket at theend of execution is at most O ( T

1

+ ( m ; 1 ) M

P

( C ) ) , which is

O ( T

1

( C ) + ( m ; 1 )

l

m

s

m

C P ( T

1

+ l n ( 1 = "

0

) ) )

with probability at least 1 ; " 0 .The total number of dollars in steal bucket is the total number

of steal attempts multiplied by the number of dollars added to thesteal bucket for each steal attempt, which is O ( s ) . Therefore totalnumber of dollars in the steal bucket is

O ( s

l

m

s

m

P ( T

1

+ l n ( 1 = "

0

) ) )

with probability at least 1 ; " 0 . Each process adds exactly onedollar to a bucket at each step so we divide the total number of dollars by P to get the high probability bound in the theorem. Asimilar argument holds for the expected time bound.

0

0

0

0

1

1 1

1

0

1

1

1 1

0

0 0

01

0

1

Step 1

Step 2

Figure 10: The tree of threads created in a data-parallel work-stealing application.

7 Locality-Guided Work Stealing

The work-stealing algorithm achievesgood data locality byexecut-ing nodes that are close in the computation graph on the sameproc-ess. For certain applications, however, regions of the program thataccess the same data are not close in the computational graph. Asan example, consider an application that takes a sequence of stepseach of which operates in parallel over a set or array of values. Wewill call such an application an iterative data-parallel application .Such an application can be implemented using work-stealing byforking a tree of threads on each step, in which each leaf of the treeupdates a region of the data (typically disjoint). Figure 10 showsan example of the trees of threads created in two steps. Each noderepresents a thread and is labeled with the process that executes it.The gray nodesare the leaves. The threads synchronize in the sameorder as they fork. The rst and second steps are structurally iden-tical, and each pair of corresponding gray nodes update the sameregion, often using much of the same input data. The dashed rect-angle in Figure 10, for example, shows a pair of such gray nodes.

To get good locality for this application, threads that update thesame data on different steps ideally should run on the same proces-sor, even though they are not “close” in the dag. In work stealing,however, this is highly unlikely to happen due to the random steals.Figure 10, for example, shows an execution where all pairs of cor-respondinggray nodes run on different processes.

In this section, we describe and evaluate locality-guided work stealing , a heuristic modication to work stealing which is de-signed to allow locality between nodes that are distant in the com-putational graph. In locality-guided work stealing, each thread canbe given an afnity for a process, and when a process obtains work it gives priority to threadswithafnity for it. To enable this, in addi-tion to a dequeeach processmaintains a mailbox : a rst-in-rst-out(FIFO) queue of pointers to threads that have afnity for the proc-ess. There are then two differences between the locality-guidedwork-stealing and work-stealing algorithms. First, when creatinga thread, a process will push the thread onto both the deque, asin normal work stealing, and also onto the tail of the mailbox of the process that the thread has afnity for. Second, a process willrst try to obtain work from its mailbox before attempting a steal.Because threads can appear twice, once in a mailbox and once ona deque, there needs to be some form of synchronization betweenthe two copies to make sure the thread is not executed twice.

A number of techniques that have been suggested to improvethe data locality of multithreaded programs can be realized by thelocality-guided work-stealing algorithm together with an appropri-ate policy to determine the afnities of threads. For example, an

9

8/19/2019 Locality 2000

10/12

initial distribution of work among processes canbe enforced by set-ting the afnities of a thread to the process that it will be assignedat the beginning of the computation. We call this locality-guidedwork-stealing with initial placements . Likewise, techniques thatrely on hints from the programmer can be realized by setting theafnity of threads based on the hints. In the next section, we de-scribe an implementation of locality-guided work stealing for iter-ative data-parallel applications. The implementation described canbe modied easily to implement other techniques mentioned.

7.1 ImplementationWe built locality-guided work stealing into Hood. Hood is a multi-threaded programming library with a nonblocking implementationof work stealing that delivers provably good performance underboth traditional and multiprogrammed workloads [2, 10, 30].

In Hood, theprogrammerdenes a thread as a C++class, whichwe refer to as the thread denition . A thread denition has amethod named run that denes the code that the thread executes.The run method is a C++ function which can call Hood libraryfunctions to create and synchronize with other threads. A rope isan object that is an instance of a thread denition class. Each timethe run method of a rope is executed, it creates a new thread. Arope canhave an afnity for a process, and when theHoodrun-timesystem executes such a rope, the system passes this afnity to thethread. If the thread does not run on the process for which it hasafnity, the afnity of the rope is updated to the new process.

Iterative data-parallel applications can effectively use ropes bymaking sure all “corresponding” threads (threads that update thesame region across different steps) are generated from the samerope. A thread will therefore always have an afnity for the processon which it’ s corresponding thread ran on the previous step. Thedashed rectangle in Figure 10, for example, represents two threadsthat are generated in two executions of one rope. To initialize theropes,the programmer needs to create a tree of ropes before the rststep. This tree is then used on each step when forking the threads.

To implement locality-guided work stealing in Hood, we usea nonblocking queue for each mailbox. Since a thread is put to a

mailbox and to a deque, one issue is making sure that the threadis not executed twice, once from the mailbox and once from thedeque. One solution is to remove the other copy of a thread whena process starts executing it. In practice, this is not efcient be-cause it has a large synchronization overhead. In our implementa-tion, we do this lazily: when a process starts executing a thread, itsets a ag using an atomic update operation such as test-and-set orcompare-and-swap to mark the thread. When executing a thread, aprocess identies a marked thread with the atomic update and dis-cards the thread. The second issue comes up when one wants toreuse the thread data structures, typically those from the previousstep. When a thread’ s structure is reused in a step, the copies fromthe previous step, which can be in a mailbox or a deque needs tobe marked invalid. One can implement this by invalidating all themultiple copies of threads at the end of a step and synchronizingall processes before the next step start. In multiprogrammed work-loads, however, the kernel can swap a process out, preventing itfrom participating to the current step. Such a swapped out processprevents all the other processesfrom proceeding to the next step. Inour implementation, to avoid the synchronizationat the endof eachstep, we time-stamp thread data structures such that each processclosely follows the time of the computation and ignores a threadthat is “out-of-date”.

Benchmark Work Overhead Critical Path Average( T

1

) ( T 1T

s

) Length ( T 1

) Par. ( T 1T

1

)staticHeat 1 5 : 9 5 1 : 1 0heat 1 6 : 2 5 1 : 1 2 0 : 0 4 5 3 6 1 : 1 1lgHeat 1 6 : 3 7 1 : 1 2 0 : 0 4 4 3 7 2 : 0 5ipHeat 1 6 : 3 7 1 : 1 2 0 : 0 4 4 3 7 2 : 0 5staticRelax 4 4 : 1 5 1 : 0 8

relax4 3 : 9 3 1 : 0 8 0 : 0 3 9 1 1 2 6 : 4 1

lgRelax 4 4 : 2 2 1 : 0 8 0 : 0 3 9 1 1 3 3 : 8 4ipRelax 4 4 : 2 2 1 : 0 8 0 : 0 3 9 1 1 3 3 : 8 4

Table 1: Measured benchmark characteristics. We compiled allapplications with Sun CC compiler using -xarch=v8plus -O5-dalign ags. All times are given in seconds. T

s

denotes theexecution time of the sequential algorithm for the application andT

s

is 1 4 : 5 4 for Heat and 40.99 for Relax .

7.2 Experimental ResultsIn this section, we present the results of our preliminary experi-ments withlocality-guidedwork stealingon two small applications.

The experiments were run on a1 4

processor Sun Ultra Enterprisewith 4 0 0 MHz processorsand 4 M byte L2 cache each, and runningSolaris 2.7. We used the processor bind system call of Solaris2.7 to bind processes to processors to prevent Solaris kernel frommigrating a process among processors, causingthe process to looseits cache state. When the number of processes is less than num-ber of processors we bind one process to each processor, otherwisewe bind processes to processors such that processes are distributedamong processorsas evenly as possible.

We use the applications Heat and Relax in our evaluation.Heat is a Jacobi over-relaxation that simulates heat propagationon a 2 dimensionalgrid for a number of steps. Thisbenchmarkwasderived from similar Cilk [27] and SPLASH [35] benchmarks. Themain data structures are two equal-sized arrays. The algorithm runsin stepseachof whichupdatesthe entries in one array usingthe datain the other array, which was updated in the previous step. Relaxis a Gauss-Seidelover-relaxation algorithm that iterates over one a1 dimensional array updating each element by a weighted averageof its value and that of its two neighbors. We implemented eachapplication with four strategies, static partitioning, work stealing,locality-guided work stealing, and locality guided work stealingwith initial placements. The static partitioning benchmarks dividethe total work equally among the number of processes and makessure that each process accesses the same data elements in all thesteps. It is implemented directly with Solaris threads. The threework-stealing strategies are all implemented in Hood. The plainwork-stealing version uses threads directly, and the two locality-guided versions use ropes by building a tree of ropes at the begin-ning of the computation. The initial placement strategy assigns ini-tial afnities to the ropes near the top of the tree to achieve a goodinitial load balance. We use the following prexes in the names of

the benchmarks: static (static partitioning), none, (work steal-ing), lg (locality guided work stealing), and lg (lg with initialplacement).

We ranall Heat benchmarkswith -x 8K -y 128 -s 100parameters. With these parameters each Heat benchmark allo-cates two arrays of doubleprecision oating point numbers of 8 1 9 2columnsand 1 2 8 rows anddoes relaxation for 1 0 0 steps. We ran allRelax benchmarks with the parameters -n 3M -s 100 . Withthese parameters each Relax benchmark allocates one array of 3million double-precision oating points numbers and does relax-ation for 1 0 0 steps. With the specied input parameters, a Relax

10

8/19/2019 Locality 2000

11/12

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30 35 40 45 50

S p e e

d u p

Number of Processes

linearheat

lgHeatipHeat

staticHeat

Figure 11: Speedup of heat benchmarkson 1 4 processors.

0

20

40

60

80

100

0 5 10 15 20 25 30 35 40 45 50

P e r c e n

t a g e o

f d r i

f t e

d l e a v e s

Number of Processes

heatlgheatipheat

Figure 12: Percentage of bad updates forthe Heat benchmarks.

0

2000

4000

6000

8000

10000

12000

14000

0 5 10 15 20 25 30 35 40 45 50

N u m

b e r o

f S t e a

l s

Number of Processes

heatlgHeatipHeat

Figure 13: Number of steals in the Heatbenchmarks.

benchmark allocates 1 6 Megabytes and a Heat benchmark allo-cates 2 4 Megabytes of memory for the maindatastructures. Hence,the main datastructures for Heat benchmarkst into the collectiveL2 cache space of 4 or more processes and the data structures forRelax benchmarks t into that of 6 or more processes. The datafor no benchmark ts into the collective L1 cache space of the Ul-tra Enterprise. We observe superlinear speedupswith some of our

benchmarks when the collective cachesof the processeshold a sig-nicant amount of frequently accesseddata. Table 1 shows charac-teristics of our benchmarks. Neither the work-stealing benchmarksnor the locality-guided work-stealing benchmarks have signicantoverhead compared to the serial implementation of the correspond-ing algorithms.

Figures 11 and Figure 1 show the speedup of the Heat andRelax benchmarks, respectively, as a function of the number of processes. The static partitioning benchmarks deliver superlinearspeedupsunder traditional workloads but suffer from the perform-ance cliff problem and deliver poor performance under multipro-gramming workloads. The work-stealing benchmarks deliver poorperformance with almost any number of processes. the locality-guidedwork-stealing benchmarkswith or without initial placements,however, matches the static partitioning benchmarks under tradi-tional workloads and delivers superior performance under multi-programming workloads. The initial placement strategy improvesthe performance under traditional work loads, but it does not per-form consistently better under multiprogrammed workloads. Thisis an artifact of binding processes to processors. The initial place-ment strategy distributes the load among the processes equally atthe beginning of the computation but binding creates a load imbal-ance between processors and increases the number of steals. In-deed, the benchmarks that employ the initial-placement strategydoes worse only when the number of processes is slightly greaterthan the number of processors.

The locality-guided work-stealing delivers good performanceby achieving good data locality. To substantiate this, we countedthe average number of times that an element is updated by twodifferent processes in two consecutive steps, which we call a badupdate. Figure 12 shows the percentage of bad updates in our

Heat benchmarks with work stealing and locality-guided work-stealing. The work-stealing benchmarks incur a high percentageof bad updates, whereas the locality-guided work-stealing bench-marks achieve a very low percentage. Figure 13 shows the num-ber of random steals for the same benchmarks for varying numberof processes. The graph is similar to the graph for bad updates,because it is the random steals that causes the bad updates. Thegures for the Relax application are similar.

References

[1] Thomas E. Anderson, Edward D. Lazowska, and Henry M.Levy. The performance implications of thread managementalternatives for shared-memorymultiprocessors. IEEE Trans-actions on Computers , 38(12):1631–1644,December 1989.

[2] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton.Thread scheduling for multiprogrammed multiprocessors. InProceedings of the Tenth Annual ACM Symposium on Par-allel Algorithms and Architectures (SPAA) , Puerto Vallarta,Mexico, June 1998.

[3] F. Bellosa and M. Steckermeier. The performance implica-tions of locality information usage in shared memory multi-processors. Journal of Parallel and Distributed Computing ,37(1):113–121, August 1996.

[4] Guy Blelloch and Margaret Reid-Miller. Pipelining with fu-tures. In Proceedings of the Ninth Annual ACM Symposiumon Parallel Algorithms and Architectures (SPAA) , pages 249–259, Newport, RI, June 1997.

[5] Guy E. Blelloch. Programming parallel algorithms. Commu-nications of the ACM , 39(3), March 1996.

[6] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Prov-ably efcient scheduling for languageswith ne-grained par-allelism. In Proceedings of the Seventh Annual ACM Sympo-sium on Parallel Algorithms and Architectures (SPAA) , pages1–12, Santa Barbara, California, July 1995.

[7] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg,Charles E. Leiserson, and Keith H. Randall. An analysis of dag-consistentdistributed shared-memoryalgorithms. In Pro-ceedings of the Eighth Annual ACM Symposium on Parallel Algorithms andArchitectures (SPAA) , pages 297–308, Padua,Italy, June 1996.

[8] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kusz-maul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou.Cilk: An efcient multithreaded runtime system. Journalof Parallel and Distributed Computing , 37(1):55–69, August1996.

[9] Robert D. Blumofe and Charles E. Leiserson. Schedulingmultithreaded computations by work stealing. In Proceed-ings of the 35th Annual Symposium on Foundations of Com- puter Science (FOCS) , pages 356–368, Santa Fe, New Mex-ico, November 1994.

[10] Robert D. Blumofe and Dionisios Papadopoulos. The per-formance of work stealing in multiprogrammed environ-ments. Technical Report TR-98-13, The University of Texasat Austin, Department of Computer Sciences, May 1998.

11

8/19/2019 Locality 2000

12/12

[11] F. Warren Burton and M. Ronan Sleep. Executing func-tional programs on a virtual tree of processors. In Pro-ceedings of the 1981 Conference on Functional Program-ming Languages and Computer Architecture , pages 187–194,Portsmouth, New Hampshire, October 1981.

[12] David Callahan and Burton Smith. A future-based parallellanguage for a general-purpose highly-parallel computer. In

David Padua, David Gelernter, and Alexandru Nicolau, edi-tors, Languages and Compilers for Parallel Computing , Re-search Monographs in Parallel and Distributed Computing,pages 95–113. MIT Press, 1990.

[13] M. C. Carlisle, A. Rogers, J. H. Reppy, and L. J. Hendren.Early experiences with OLDEN (parallel programming). InProceedings 6th International Workshop on Languages and Compilers for Parallel Computing , pages 1–20. Springer-Verlag, August 1993.

[14] Rohit Chandra, Anoop Gupta, and John Hennessy. COOL: ALanguage for Parallel Programming. In David Padua, DavidGelernter, and Alexandru Nicolau, editors, Languages and Compilers for Parallel Computing , Research Monographs inParallel and Distributed Computing, pages 126–148. MITPress, 1990.

[15] Rohit Chandra, Anoop Gupta, and John L. Hennessy. Datalocality and load balancing in COOL. In Proceedings of theFourth ACM SIGPLAN Symposium on Principles and Prac-tice of Parallel Programming (PPoPP) , pages 249–259, SanDiego, California, May 1993.

[16] D. E. Culler and G. Arvind. Resource requirements of dataow programs. In Proceedings of the International Sym- posium on Computer Architecture , pages 141–151, 1988.

[17] Dawson R. Engler, David K. Lowenthal, and Gregory R. An-drews. Shared Filaments: Efcient ne-grain parallelism onshared-memory multiprocessors. Technical Report TR 93-13a,Department of Computer Science,The Universityof Ari-zona, April 1993.

[18] Mingdong Feng and Charles E. Leiserson. Efcient detectionof determinacy races in Cilk programs. In Proceedingsof the9th Annual ACM Symposium on Parallel Algorithms and Ar-chitectures (SPAA) , pages 1–11, Newport, Rhode Island, June1997.

[19] Robert H. Halstead, Jr. Implementation of Multilisp: Lispon a multiprocessor. In ConferenceRecord of the 1984 ACM Symposium on Lisp and Functional Programming , pages 9–17, Austin, Texas, August 1984.

[20] Robert H. Halstead, Jr. Multilisp: A language for concurrentsymbolic computation. ACM Transactions on Programming Languages and Systems , 7(4):501–538, October 1985.

[21] High Performance Fortran Forum. HighPerformanceFortran LanguageSpecication , May 1993.

[22] Vijay Karamcheti and Andrew A. Chien. A hierarchical load-

balancing framework for dynamic multithreaded computa-tions. In Proceedings of ACM/IEEE SC98: 10th Anniversary. High Performance Networking and Computing Conference ,1998.

[23] Richard M. Karp and Yanjun Zhang. A randomized parallelbranch-and-boundprocedure. In Proceedingsof the Twentieth Annual ACM Symposium on Theory of Computing (STOC) ,pages 290–300, Chicago, Illinois, May 1988.

[24] Richard M. Karp and Yanjun Zhang. Randomized parallelalgorithms for backtrack search and branch-and-bound com-putation. Journal of the ACM , 40(3):765–789, July 1993.

[25] David A. Krantz, Robert H. Halstead, Jr., and Eric Mohr. Mul-T: A High-Performance Parallel Lisp. In Proceedings of theSIGPLAN’89 Conference on Programming LanguageDesignand Implementation , pages 81–90, 1989.

[26] Evangelos Markatos and Thomas LeBlanc. Locality-basedscheduling for shared-memory multiprocessors. TechnicalReport TR-094, Institute of Computer Science, F.O.R.T.H.,

Crete, Greece, 1994.[27] MIT Laboratory for Computer Science. Cilk 5.2 Reference

Manual , July 1998.

[28] Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. Lazytask creation: A technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Dis-tributed Systems , 2(3):264–280, July 1991.

[29] Grija J. Narlikar. Scheduling threads for low space require-ment and good locality. In Proceedings of the Eleventh An-nual ACM Symposium on Parallel Algorithms and Architec-tures (SPAA) , June 1999.

[30] Dionysios Papadopoulos. Hood: A user-level thread li-brary for multiprogrammed multiprocessors. Master’s the-sis, Department of Computer Sciences, University of Texas at

Austin, August 1998.[31] James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas,

and Kai Li. Thread scheduling for cache locality. In Pro-ceedings of the Seventh ACM Conference on ArchitecturalSupport for Programming Languages and Operating Systems(ASPLOS) , pages 60–71, Cambridge, Massachusetts, October1996.

[32] Dan Stein and Devang Shah. Implementing lightweightthreads. In Proceedings of the USENIX 1992 Summer Con- ference , San Antonio, Texas, June 1992.

[33] Jacobo Valdes. Parsing Flowcharts and Series-ParallelGraphs. PhD thesis, Stanford University, December 1978.

[34] B. Weissman. Performance counters and state sharing an-notations: a unied aproach to thread locality. In Interna-

tional Conferenceon Architectural Support for Programming Languagesand Operating Systems. , pages 262–273, October1998.

[35] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie,Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2programs: Characterization and methodological considera-tions. In Proceedings of the 22nd Annual International Sym- posium on ComputerArchitecture(ISCA) , pages24–36,SantaMargherita Ligure, Italy, June 1995.

[36] Y. Zhang and A. Ortynski. The efciency of randomized par-allel backtrack search. In Proceedings of the 6th IEEE Sym- posium on Parallel and Distributed Processing , Dallas, Texas,October 1994.

[37] Yanjun Zhang. Parallel Algorithms for Combinatorial SearchProblems . PhD thesis, Department of Electrical Engineeringand Computer Science, University of California at Berkeley,November 1989. Also: University of California at Berke-ley, Computer Science Division, Technical Report UCB/CSD89/543.

12

locality 2000

Documents