task clustering and scheduling to multiprocessors …ligd/publications/ipdps03.pdf · task...

Task Clustering and Scheduling to Multiprocessors with

Duplication

Li Guodong Chen Daoxu Wang Darning Zhang Defu Dept. qf Computer Science, Nanjing University, Nuqjing 210093, China

([email protected])

Abstract Optimul tusk-duplication-based scheduling of tusks

represented by u directed ucyclic gruph (DAG) onto

u set of homogenous distributed memory processors,

is a strong NP-hard problem. In this puper we

present a clustering und scheduling ulgorifhm with

time complexity O(v310gv), where v is the number of

nodes, which is able to generate optimal schedule

for some specljk DAGs. For urhifrury DAGs, the

schedule generated is at most two times as the

optimul one. Simulution results show that Ihe

performunce of TCSD is superb to those of four

renowned algorithms: PX TDS, TCS und CPFD.

1. Introduction

Task scheduling problems are NP-complete in

the general case [I]. The non-duplication task

scheduling problem has been extensively studied

and various heuristics were proposed in the

literature 12-41. Comparative study and evaluation

of some of these algorithms have been presented [S].

These heuristics are classified into a variety of

categories such as list-scheduling algorithms [2],

clustering algorithms 131, guided random search

methods 141. A few research groups have studied the

task scheduling problem for the heterogeneous

systems [6]. Duplication based scheduling is a

relatively new approach to the scheduling problem.

There are several task duplication based scheduling

schemes [7-161 which duplicate certain tasks in an

attempt to minimize communication costs. The idea

behind duplication-based scheduling algorithms is

to schedule a task graph by mapping some of its

tasks redundantly to reduce the inter-task

communication overhead. They are usually for an

unbounded number of identical processors and have

much higher complexity values than their

alternatives.

Duplication based scheduling problems have

been shown to be NP-complete [lo]. Thus, many

proposed algorithms are based on heuristics. These

algorithms can be classified into two categories in

terms of the task duplication approach used 1211:

Scheduling with Partial Duplication

(SPD)[9][12][13][15] and Scheduling with Full

Duplication (SFD)[7][8][10][11][14][16]. SPD

algorithms only duplicate limited number of parents

of a node to achieve low complexity while SFD

algorithms attempt to duplicate all the parents of a

join node. When the communication cost is high,

the performances of SPD algorithms are low. SFD

algorithms show better performance than SPD

algorithms but have a higher complexity. Table 1

summarizes the characteristics of some well-known

duplication based scheduling algorithms.

Among these algorithms, CPFD algorithm

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

achieves the shortest makespan and uses the

relatively few processors, but at each step, it spends

prohibitively long time on testing all candidate

processors and scanning through the whole time

span of each processor.

Table 1. Task duplication scheduling algorithms

This paper proposes a novel SFD algorithm

called TCSD whose simulation study shows that the

proposed algorithm achieves considerable

performance improvement over existing algorithms

while having less time complexity. Theoretical

analysis shows that TCSD matches the lowest

bound of schedule achieved by other algorithms so

far. In addition, the number of processors consumed

by destination clusters is substantially decreased

and the performance degradation of TCSD

according to the number of processors available is

agreeable.

2. The Proposed TCSD Algorithm

2.1 Model and Notations

A parallel program is usually represented by a

Directed Acyclic Graph (DAG), which is defined by

the tuple (V, E, z, c), where V, E, z, c are the set of

tasks, the set of edges, the set of computation cost

associated with the tasks, and the set of

communication costs associated with the edges,

respectively. The edge ei.i E E represents the

precedence constraint between the task vi and v;. ri

is the computation cost for task vi and Ci.i is the

communication cost for edge ei.i. When two tasks, vi

and v;, are assigned to the same processor, Ci.i is

assumed to be zero since intra-processor

communication cost is negligible. Multiple entry

nodes or exit nodes are allowed in the DAG. Figure

1 depicts an example DAG. The underlying target

architecture is assumed to be homogenous and the

number of processors to be unbounded.

The term iparent is used to represent immediate

parent. The earliest start Ll Ll time, esti, and the earliest

completion time, t?Cti, are

the earliest times that a

task vi starts and finishes

its execution respectively.

A message arriving time

from V; to vi, matj,i, is the

time that the message

from V; arrives at Vi. If Vi

and vi are scheduled on the Fig. 1. Example DAG

same processor, matj,i becomes

ect;; otherwise, mat;,i = ect; + Ci,i. For a join node vi,

its arriving time l7Zclti = max {m&;,i / V; is Vi’s

iparent}. In addition, its critical iparent, which is

denoted as CIP(vJ, provides the largest mat to the

join node. That is, v; = CIP(VJ if and only if mat;,i 2

ma&i for all k where vk is the iparent of Vi and k # j

(if multiple nodes satisfy this constraint, arbitrarily

select one). The critical iparent of an entry node is

defined to be NULL. Among all vi’ iparents residing

on other processors, RIP(Y) is the iparent of vi

whose mat is maximal (arbitrarily select one for

breaking tie). Clearly, when CIP(vJ and vi are not

assigned onto the same processor, CIP(vJ = RIP(Y).

After a task vi is scheduled on a processor PE(vi),

the est of vi on processor PE(vJ is equal to the

actual start time, ast(i,PE(vJ), or l%Sfi if confusion

will not be incurred, of task vi. After all tasks in a

graph are scheduled, the schedule length, which is

also called makespan, is the largest finish time of

exit tasks. The objective of the scheduling problem


is to determine the assignment of tasks such that

minimal schedule length is obtained. A clustering of

DAG is a mapping of the nodes in V onto

non-overlapping clusters each of which contains a

subset of V. A schedule Si for a cluster C(vi) is

optimal if for each other schedule Si’ for C(vi),

makespan(SJ < makespan(Si’). The makespan of a

cluster C(vi) is defined as the makespan of its

optimal schedule. In addition, a schedule S is

optimal for a clustering ‘?? if for every other

schedule S’ for Y, makespan < makespan(

2.2 TCSD

The basic idea behind TCSD is similar to the PY

algorithm and TCS algorithm in such a way that

clustering is constructed dynamically, but TCSD

differ from them in selecting parent nodes to be

duplicated and computing start times of nodes in the

cluster under investigation. For each vi E V, we first

compute its esfi by finding a cluster C(vi) which

allows vi to start execution as early as possible

when all nodes in C(vi) are executed on the same

processor.

The esf values are computed in topological order

of the nodes in V. A node is processed after all of its

ancestors have been assigned est values. Consider

the cluster C(vJ and an arc ern.n crossing C(vJ such

that v, belongs to C(vJ while v,, doesn’t, obviously

esf, 2 est,, + z,, + c,,,, If v, and v,, are on the

critical path (at this time ern.n is called the critical

edge), then we attempt to cut the cost of ern.n down

to zero by assigning v,, and v, to the same

processor.

At each iteration, TCSD first attempts to identify

the critical edge of current cluster C(vJ, then

tentatively absorb into C(vJ the node from which

the critical edge emanates, e.g. v,, to decrease the

length of current critical path. However, this

operation may increase the cluster’s makespan even

if v,’ start execution at est,. In this case, previous

absorbing is canceled.

Before a critical edge em. n is absorbed into C(vi),

mut, ‘= est,+ znl+cn13 m After the insertion of vm, mat,,,

=max( (esf,+z, / ep.m6E and vpEC(vi)),(eStp+Zp+c~.,~

/ ep.mcE and vpEC(vi)). Denote the new mat of v,

after v,,‘s insertion as mat,, obviously the constraint

of maf,‘&nuf, must holds after v,,‘s insertion.

Furthermore, suppose v,=RIP(v,,), then mat, 2

mClt,~+Zm2esfp+Zp+-Cp.m+Zm, it follows that est,,+c,, 2

esfp+7p+cp.m. In general, assumed that vi=RIP(v&

v2=RIP(v,), . . . , ~n=RlP(vn.~), VoGC(Vi), VlPC(VJ,

V*PC(Vi), ...) vnPC(vi), after vi is absorbed into

cluster C(Vi), V2, V3, . . . , vk must be also absorbed

into C(vi) if the following equation is true:

k

es& +cl o < estk + t: z/ +Ck,k-l (1)

1=2

TCSD inserts these nodes into C(vi) in one step

rather than inserts them one by one to save the

running time. Here we introduce a term called

snapshot to model the temporary status of the

obtained cluster. Formally, a snapshot of C(vi), i.e.

SP(vi), is defined by a tuple (Vin, E,,, V,,t, Tin),

where Vim L, V,,t, Tin are the set of nodes already

in C(vi), the set of edges crossing C(vi), the set of

nodes from which edges in E,, emanate, the ests

associated with nodes in Vin, respectively. The edge

ei,i in E,, represents the precedence constraint

between the task vi in Vi, and the v; in VOUt.

Procedure Calcualte-EST(Y) constructs C(vJ

and calculates the value of esfi by altering the SP(vJ

gradually. For all entry nodes, their ests are zero.

Initially the critical path of C(vJ includes the edge

connecting vi and its CIP v;, thus the initial value of

esfi is est; + Ci,i and the initial critical edge is e;,i.

This procedure iteratively absorbs the node on

current critical edge (also called critical node) into

C(vJ, and then finds a new critical edge until the

schedule length of C(vJ cannot be improved any

more. Note that at each iteration, we first assume

that current critical node, i.e. v,, is scheduled on

PE(vJ and starts execute at es&. If such absorbing

cannot result in earlier start time of vi, this

procedure terminates and returns the set of nodes in


Vi, excluding v,; otherwise it attempts to insert v,

into Vi, and schedules v, to start execution at its ast.

Then v,‘s ancestors satisfying equation (1) will be

inserted in Vi, too. After all these ancestors are

identified and inserted into Vin, we re-compute their

mats. Then a new critical edge is identified by

calling compute-snapshot-critical-path (SP(vi)). If

the updated critical path’s length is less than vi’s

original est, then esfi is modified to be the new one.

Algorithm. Calcualte-EST (Vi)

begin

if Vi is an entry node then return esti=O and

C(Vi)=lVil;

Let vj=CIP(vi), esti=eclj+cj--i and critical-edge = ej.i;

For current SP(Vi), let Vi,={Vi}, V,,+ and E,,=Q;

Repeat {

Denote current critical-edge as e,.,

VintV,, st,=esl,; At, is v,‘s release time *I

esti”=compute_ criticalgath(SP(vi));

if esti”>e.~ti then return C(VJ and erri;

Let v,=CIP(v,) and span_sum=est,,,+c,.,

Initialize camp-sum to be zero, i.e. comp_sum=O;

repeat {

comp_sum=comp_ sum+z,;

if span-sumie,+c,.,+comp_sum then {

Ecs=Ecs-Cep.x I e,.,eE and VxEVinl;

Vin+Vp;

Vou~=Wou~-C~p~) u CV, I v~.~EE and VqPVin>;

Ecs’Lule,.y I e,.pEE, e,,sE and VyEVinl;

Let v,=vr and v,=CIP(v,);

I

until vP is NULL;

esli’=compute- criticalgath(SP(vJ);

&Sli=min{e.Yti, e.rli’};

if esli’&sti then C(Vi)=Vi,;

until critical_edge=NULL;

end

Given the melt of each node in C(vi), procedure

compute-criticalpath(C(vi)) computes the length

of critical path of C(vJ and identifies the new

critical edge. Procedure compute-node-mat(v,,

SP(vi)) calculates the new mats of nodes in C(vi).

Note that, when the mat of each node in C(vi) is

available, the optimal schedule of these nodes on

one processor to achieve minimal finish time of vi is

to execute the nodes in nondecreasing order of their

mats.

Procedure compute-criticalpath (SP(vi))

begin

Unmark all nodes in the Vi, of SP(vJ;

while there is a unmarked node in Vi, do {

Select an arbitrary unmarked node v,;

Call compute-node_mat(v,,SP(vi)) to compute mat,;

I

Supposed Vi, contains VI, va, . . . . v, (in decreasing

order of their mats).

Let astl=matI and critical-edge = NULL;

forj=2tondo{

if aS.l+Tj.l<matj then let astj=matj and

critical edge=earp(Gj.j; - else asli = asli. r+ti. r ;

I

return the critical-edge and the schedule length which

is equal to ast,;

end

Procedure compute-node-mat (vx, SP(vi))

begin

if v, is an entry node then mark v, and return 0;

if not all v,‘s iparents are marked then {

for each v,‘s unmarked iparent, i.e. vr, compute the

mat, by calling compute-node_mat(v,, SP(Vi));

Compute v,‘s message arriving time mat,. Note that if

vp is in Vi,, mat,., is equal to ect,; otherwise,

matp.x=ectp+cp.x;

mark v,;

return mat,;

end

The running trance in figure 3 illustrates the

working of TCSD for the DAG in figure 1.

The following algorithm constructs Y(G) by


visiting the nodes in G in reverse topological order.

Initially Y(G) = 0 and all exit nodes are marked.

Then we add clusters into Y(G) in the descending

order of their makespans. After a cluster C(vi) has

been scheduled onto a processor, if the actual start

time of a node in C(vi), e.g. v;, is less than that

stored in cur ast;, then this new value will substitute -

the old one. For any edge crossing C(vi) such as em. Tb

if a.~,,, - es&, > makespan - ecti, then mark es&,,

and assign C(v,,) to a new processor; otherwise, vn

receives data from the copy of v,~ starting execution

at asf,,. This eliminates the need for consuming a

new processor.

Algorithm. Duplication-Based-Schedule (DAG G)

begin

Compute the esl of each node by calling the procedure

Calcualte-EST.

Mark exit nodes in V and unmark all other nodes in V;

Initialize the value of each node’s cur-ast to be

makespan(

while there exists a marked node do

Select the marked node with largest ect, i.e. vi;

Add C(VJ into Y(G) and unmark Vi;

for each node in C(Vi), i.e. Vj, do {

Supposed the actual start time of Vj in C(Vi) is

asui7, PWJ),

Let cur-astj=min{asl@, PE(vJ), cur_astj};

1

for each edge crossing C(VJ, i.e. em.ra do {

if asl,-est,,,lmakespan(Y)-ecti then mark estj;

end

The schedule generated by algorithm

Duplication-Based-Schedule is shown in figure 4.

The first cluster being added into Y(G) is C(vJ,

then follows C(v,). Note that v5 and vg in C(v,) can

receive data from their iparents v1 and v4 in C(vlo)

respectively. Finally C(v,) and C(v,) are inserted

into Y(G).

Assumed that the number of nodes in G is v,

algorithm Duplication-Based-Schedule has

complexity O(v*) where O(v) comes from updating

the value of cut-asts and identifying next cluster.

The complexity of procedure compute-node-mat is

O(vlogv) where O(logv) comes from computing the

new mat of a node in C(vi). Procedure

compute-criticalpath runs O(vlogv) time. And, the

worst case is that v nodes are absorbed into C(vi),

making Calcualte-EST run O(v*logv) time. Hence,

the overall time complexity of constructing v

clusters is O(v”logv).

Lemma 1. If there are multiple nodes in cluster C(vi)

retarding their start time from original mats to new

mats, the incurred delay of makespan(C(vJ), i.e. Di,

satisfies

where mat, and mat,’ are the original value and

the new value of mat, respectively.

ProojI The proof of this (and following theorems) is

neglected due to limited space.

Theorem 1. Provided that any edge crossing C(Vi),

I.e. ern.n satisfies ast,, - est,, < makespan - ecfi,

v,, can start execution at time as&, while Y’s

makespan will not increase.

Theorem 1 justifies the operation of saving

processors in Duplication-Based-Schedule.

2.3 Properties of the Proposed Algorithm

Theorem 2. For out-tree DAGs (i.e. ordinary trees),

the schedules generated by TCSD are optimal.

Lemma 2. Considering a one-level in-tree (i.e.

inverse ordinary tree) consisting of vl, v2, . . . , vi-i, vi

such that vl, v2, . . . . vi-1 are the iparents of vi and

have individual release times esfi, est2, . . ., esti.1.

Provided that e.Qi+ri+ci.i 2 eSt2+r2+c2.i 2 . ..>

esfi.i+ri.i+Ci.i.i, TCSD generates an optimal schedule

whose length is equal to max(makespan( {vi, v2, . . . ,

Vi)), est;+l+Cjil_i} + xi


r

cst,=? * i 2

cst,= IO 7 * I

I-i

5

7 * x

4

* I /I

4

L=?l

cst3=5

c sti

cstc,=8

3

cst3=5

=>

Fig. 3. Computing the estof each node,

where makespan( (vl, ~2, . . ., vj>) 5 rj + cg and

makespan(( v~,v;?, . . ..V~.V&) 'T~+C-.

When estl, est2, . . . , esti-1 are all equal to zero, this

problem is reduced to the case of general single-level

in-tree which has been investigated in previous

researches[ 111.

Theorem 3. For fork-join DAGs (diamond DAGs),

the schedules generated by TCSD are optimal.

Then we adopt the definition of DAG granularity

given in [3][ 141.

Theorem 4. For coarse-grain DAGs, the schedules

generated by TCSD are optimal.

Theorem 5. For arbitrary DAGs, the schedules

generated by

TCSD are at most

twice as the

optimal ones.

Moreover, if the

granularity of

DAG is larger

than (1+)/s for

O<ESI, the

schedule length

generated is at

most (l+a) times

as the optimal

one.

PE3

Fig. 4 Schedule generated by TCSD

for the DAG depicted in Fig.1


3. Performance Evaluation

In this section we compare the performance of

TCSD with four existing scheduling algorithms, i.e.

CPFD, PY, TCS and TDS. Comparing results of some

of these algorithms with other algorithms including

LC, BTDH, LCTD, DSH, LWB, etc., can be found in

literature [11][16].

These four algorithms along with TCSD are applied

to diverse sets of application with varying

characteristics. In particular, the input DAGs are

generated randomly with the number of tasks ranging

from 200 to 800 nodes and the number of edges

varying from 2,00 to 4,000. Additionally, the number

of predecessors and successors varies from one to 100,

and the computation costs vary from one to 1,500 time

units. And, the CCR of a DAG is defined as its

average communication cost divided by its average

computation cost. In the test the CCR values used are

0.1, 1.0, 5.0, and 10.0.

We generate 2000 random DAGs for testing. The

first comparison is to compare the schedule lengths of

four algorithms with TCSD in terms of the normalized

schedule lengths (NSLs), which is obtained by

dividing the output schedule length by the sum of

computation costs on the critical path [ 1 1 ][ 131. Table 2

shows the average NSLs produced by five algorithms.

TDS is a SPD algorithm with relatively low time

complexity, but its schedule length is much longer

than those of its counterparts. TCS and PY behavior

rather similarly and achieve approximate performance

in terms of schedule lengths. CPFD is the most

time-consuming one but produce better schedule

results than TDS, TCS and PY. However, TCSD

outperforms all these algorithms in performance

metric, and its running time is logv times less than that

of CPFD.

Algorithms are generally more sensitive to the

value of CCR than to the number of nodes. Table 3

shows the ratio of makespan generated by TCS, TDS,

PY as well as CPFD over TCSD. It may be noted that

differences between the performances of various

algorithms become more significant with larger value

of CCR.

Table 2. Average NSLs of five algorithms for

random DAGs with various numbers of nodes

Table 3. Ratio of makespan generated by

Table 4 shows the result of the comparison between

each pair of algorithms. Each entry of the table

consists of three elements in “>x, =y, <z” format,

which means that the algorithm in a row provides

longer parallel time x times more than, same parallel

time y times as, and shorter parallel time z times more

than the algorithm in that column. For instance, among

1000 cases, TCSD outperforms CPFD in 503 cases,

achieves the same makespans as CPFD in 352 cases.

There are 145 cases in which TCSD is inferior to

CPFD in terms of performance.

Table 4. Algorithm comparison in terms of

better, worse, and equal performance

Number

of Nodes I Tcs I TDS I py

CPFD

>488,

=352,

~160


c ~effoyay Analysis yf a Compile-yme Optimization Approach for List Scheduling Algorithms

on Distributed-Memory Multiprocessors”, Proc.

Supercomputing ‘92, Nov. 1992, pp.5 12-521. 4. Conclusion

[S] B. Kruatrachue and T. G Lewis, “Grain Size

This paper presents a novel duplication-based Determination for Parallel Processing”, IEEE Software,

scheduling algorithm for clustering and schedules Jan.1988 pp.23-32,.

DAGs on fully connected homogenous [9] J. Y. Cohn and P. Chretienne, “CPM. Scheduling With

multiprocessors. This algorithm uses task duplication Small Computation Delays and Task Duplication”,

to reduce the length of the schedule. Our performance Operations Research, 1991, ~~680-684.

study showed that it has a relative low complexity [IO] C. Papadimitriou and M. Yannakakis, “Toward an

comparable to SFD algorithms, while outperforming Architecture Independent Analysis of Parallel

both SPD algorithms and SFD algorithms by Algorithms, SIAM J.Computing, Vol. 19, 1990,

generating schedules with much shorter schedule pp.322-328.

lengths. [I I] 1. Ahmad and Y.-K. Kwok. “On Exploiting Task

Duplication in Parallel Program Scheduling”, IEEE References

Trans. Parallel and Distributed Systems, Vol. 9, No. 9,

[I] V. Sarkar. “Partitioning and Scheduling Parallel Sep. 1998, pp. 872892.

Programs for Execution on Multiprocessors”. 1121 S. Darbha and D. P. Agrawal. “Optimal Scheduling

Cambridge, Mass: MIT Press, 1989. Algorithm for Distributed-Memory Machines”, IEEE

[2] A. Radulescu and A. Gemund. “Low-Cost Task Trans. Parallel and Distributed Systems, Vol. 9, No. I,

Scheduling for Distributed-Memory Machines”. IEEE Jan. 1998, pp. 97-95.

Trans. Parallel and Distributed Systems, Vol. 13, No. 6, 1131 G. L. Park, B. Shirazi and J. Marquis. “Mapping of

June 2002, pp. 648-658. Parallel Tasks to Multiprocessors with Duplication”,

[3] A. Gerasoulis and T. Yang. “On the Granularity and Proc. of the 12’h International Parallel Processing

Clustering of Directed Acyclic Task Graphs”. IEEE Symposium (IPPS’98), 1998.

Trans. Parallel and Distributed Systems, Vol. 4, No. 6, 1141 M. A. Palis, J.-C. Liou and D. S. L. Wei, “Task

1993, pp. 686-701. Clustering and Scheduling for Distributed Memory

[4] J. Gu, W. Shu and M.-Y. Wu. “Efficient Local Search Parallel Architectures”, IEEE Trans. Parallel and

for DAG Scheduling”, IEEE Transactions on Parallel Distributed Systems, Vol. 7, No. I, Jan. 1996., pp.

and Distributed Systems, Vol.12, No.6, June 2001, pp. 46-55,

617-627. [I51 S. Ranaweera and D. P. Agrawal. “A Scalable Task

[5] Y.-K. Kwok and 1. Ahmad. “Benchmarking and Duplication Based Scheduling Algorithm for

Comparison of the Task Graph Scheduling Heterogeneous Systems”, International Conference on

Algorithms”, J. Parallel and Distributed Computing, Parallel Processing (ICPP’OO), 2000.

Vol. 59, 1999, pp. 381-422. [Is] B. Shirazi, H. B. Chen and J. Marquis. “Comparative

[61 H. Topcuoglu, S. Hariri and M.-Y. Wu. Study of Task Duplication Static Scheduling versus

“Perfroamance-Effective and Low-Complexity Task Clustering and Non-Clustering Techniques”,

Scheduling for Heterogeneous Computing”, IEEE Concurrency: Practice and Experience, vol. 7, Aug.

Trans. Parallel and Distributed Systems, Vol. 13, No. 3, 1995, pp. 371-389.

Mar. 2002, pp. 260-274.

[7] Y.-C. Chung and S. Ranka. “Application and


task clustering and scheduling to multiprocessors …ligd/publications/ipdps03.pdf · task...

Documents