[ieee comput. soc 14th symposium on computer architecture and high performance computing - vitoria,...

9
An Evaluation of Processor Co-Allocation for Different System Configurations and Job Structures A.I.D. Bucur Faculty of Information Technology and Systems Delft University of Technology P.O. Box 5031, 2600 GA Delft, The Netherlands e-mail: A.I.D.Bucur, [email protected] D.H.J. Epema Abstract In systems consisting of multiple clusters of processors such as our Distributed ASCI 1 Supercomputer (DAS), jobs may request co-allocation, i.e., the simultaneous allocation of processors in different clusters. We simulate such sys- tems ignoring communication among the tasks of jobs, and determine the response times for different types and sizes of job requests, and for different numbers and sizes of clus- ters. In many cases we also compute or approximate the maximum utilization. We find that the numbers and sizes of the clusters and of the job components have a strong impact on performance and that in many cases co-allocation is a viable choice. 1. Introduction Over the last decade, clusters and distributed-memory multiprocessors consisting of hundreds or thousands of standard CPUs have become very popular. In addition, re- cent work in computational and data GRIDs [2, 7] enables applications to access resources in different and possibly widely dispersed locations simultaneously—that is, to em- ploy co-allocation —to accomplish their goals, effectively creating single multicluster systems. Most of the research on processor scheduling in parallel computer systems has been dedicated to multiprocessors and single-cluster sys- tems, but hardly any attention has been devoted to multi- cluster systems. In this paper we study the performance of processor co-allocation policies in multicluster systems em- ploying space sharing for rigid jobs [3] depending on such parameters as the structure of jobs and the system confi gu- ration. 1 In this paper, ASCI refers to the Advanced School for Computing and Imaging in The Netherlands, which came into existence before, and is un- related to, the US Accelerated Strategic Computing Initiative. Because it is a simple and yet often used and very prac- tical scheduling strategy, we only consider rigid jobs sched- uled by pure space sharing, which means that jobs require fi xed numbers of processors and are executed on them ex- clusively until their completion. Our two performance met- rics are the mean job response time as a function of the uti- lization and the maximum utilization at which multicluster systems are stable. We fi rst deduce (approximate) analytic expressions for the maximal utilization of multiclusters under co-allocation. Then using simulations we fi nd the response times for dif- ferent types of job requests and numbers of job components, and for different numbers and sizes of clusters. We con- clude that in many cases co-allocation yields good perfor- mance. The request type has a strong infl uence on the per- formance, and so do the numbers of jobs components. For unordered jobs, which provide the easiest and most practi- cal way of co-allocation, the best performance is obtained in systems with clusters of equal size, or with a large number of clusters relative to the number of job components. In the most general setting, GRID resources are very het- erogeneous and wide-area connections may be very slow. In this paper we restrict ourselves to homogeneous multi- cluster systems and ignore communication in order to iso- late the aspects mentioned above. However, if we assume that the job service times in our model do include commu- nication in a single cluster with a fast local network, one can get an indication of the performance of processor co- allocation when communication in multiclusters is taken into account in the following way. If the service times of all jobs are extended by (about) the same factor due to the relatively slow intercluster communication, multiply- ing our response-times curves by and in the hori- zontal and vertical directions, respectively, yields the mul- ticluster response times. In [9] it is shown that for a ratio of between the intercluster and intracluster communi- cation speeds, a value which holds for the DAS, is below Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 ' 2002 IEEE

Upload: dhj

Post on 26-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

An Evaluation of Processor Co-Allocation for Different System Configurationsand Job Structures

A.I.D. BucurFaculty of Information Technology and Systems

Delft University of TechnologyP.O. Box 5031, 2600 GA Delft, The Netherlandse-mail: A.I.D.Bucur, [email protected]

D.H.J. Epema

Abstract

In systems consisting of multiple clusters of processorssuch as our Distributed ASCI1 Supercomputer (DAS), jobsmay request co-allocation, i.e., the simultaneous allocationof processors in different clusters. We simulate such sys-tems ignoring communication among the tasks of jobs, anddetermine the response times for different types and sizesof job requests, and for different numbers and sizes of clus-ters. In many cases we also compute or approximate themaximum utilization. We find that the numbers and sizes ofthe clusters and of the job components have a strong impacton performance and that in many cases co-allocation is aviable choice.

1. Introduction

Over the last decade, clusters and distributed-memorymultiprocessors consisting of hundreds or thousands ofstandard CPUs have become very popular. In addition, re-cent work in computational and data GRIDs [2, 7] enablesapplications to access resources in different and possiblywidely dispersed locations simultaneously—that is, to em-ploy co-allocation —to accomplish their goals, effectivelycreating single multicluster systems. Most of the researchon processor scheduling in parallel computer systems hasbeen dedicated to multiprocessors and single-cluster sys-tems, but hardly any attention has been devoted to multi-cluster systems. In this paper we study the performance ofprocessor co-allocation policies in multicluster systems em-ploying space sharing for rigid jobs [3] depending on suchparameters as the structure of jobs and the system confi gu-ration.

1In this paper, ASCI refers to the Advanced School for Computing andImaging in The Netherlands, which came into existence before, and is un-related to, the US Accelerated Strategic Computing Initiative.

Because it is a simple and yet often used and very prac-tical scheduling strategy, we only consider rigid jobs sched-uled by pure space sharing, which means that jobs requirefi xed numbers of processors and are executed on them ex-clusively until their completion. Our two performance met-rics are the mean job response time as a function of the uti-lization and the maximum utilization at which multiclustersystems are stable.

We fi rst deduce (approximate) analytic expressions forthe maximal utilization of multiclusters under co-allocation.Then using simulations we fi nd the response times for dif-ferent types of job requests and numbers of job components,and for different numbers and sizes of clusters. We con-clude that in many cases co-allocation yields good perfor-mance. The request type has a strong infl uence on the per-formance, and so do the numbers of jobs components. Forunordered jobs, which provide the easiest and most practi-cal way of co-allocation, the best performance is obtained insystems with clusters of equal size, or with a large numberof clusters relative to the number of job components.

In the most general setting, GRID resources are very het-erogeneous and wide-area connections may be very slow.In this paper we restrict ourselves to homogeneous multi-cluster systems and ignore communication in order to iso-late the aspects mentioned above. However, if we assumethat the job service times in our model do include commu-nication in a single cluster with a fast local network, onecan get an indication of the performance of processor co-allocation when communication in multiclusters is takeninto account in the following way. If the service times ofall jobs are extended by (about) the same factor � > 1 dueto the relatively slow intercluster communication, multiply-ing our response-times curves by 1=� and � in the hori-zontal and vertical directions, respectively, yields the mul-ticluster response times. In [9] it is shown that for a ratioof 100 between the intercluster and intracluster communi-cation speeds, a value which holds for the DAS, � is below

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

1:4 for most of the algorithms considered.

2. The Model

In this section we describe our model of multicluster sys-tems based on the DAS system.

2.1. The DAS System

The DAS [1, 6] is a wide-area computer system consist-ing of four clusters of identical Pentium Pro processors, onewith 128, the other three with 24 processors each. The sys-tem was designed for research on parallel and distributedcomputing. On single DAS clusters a local scheduler isused that allows users to request a number of processorsbounded by the cluster’s size, for a time interval which doesnot exceed an imposed limit.

Although co-allocation is possible on the DAS, so far ithas not been used enough to let us obtain statistics on thesizes of the jobs’ components. However, from the log ofthe largest cluster of the system we found that over a periodof three months, the cluster was used by 20 different userswho ran 30; 558 jobs. The sizes of the job requests took 58values in the interval [1; 128], for an average of 23:34 anda coeffi cient of variation of 1:11; their density is presentedin Fig. 1. The results comply with the distributions weuse for the job-component sizes in that there is an obviouspreference for small numbers and powers of two.

0

1000

2000

3000

4000

5000

6000

0 20 40 60 80 100 120

Num

ber

of J

obs

Nodes Requested

powers of 2other numbers

Figure 1. The density of the job-request sizesfor the largest DAS cluster (128 processors)

From the jobs considered, 28; 426 were recorded in thelog with both starting and ending time, and we could com-pute their service time. Due to the fact that during workinghours jobs are restricted to at most 15 minutes of service,94:45% of the recorded jobs ran less than 15 minutes. Fig-ure 2 presents the density of service time values on the DAS,as it was obtained from the log. The average service timeis 356.45 seconds and the coeffi cient of variation is 5:37.Still, not all jobs in the log were short: the longest one tookaround 15 hours to complete.

0

200

400

600

800

1000

1200

1400

0 100 200 300 400 500 600 700 800 900

Num

ber

of J

obs

Service Time (s)

Figure 2. The density of the service times forthe largest DAS cluster (128 processors)

2.2. The Structure of the System

We model a multicluster system consisting of C clustersof processors, cluster i having Ni processors, i = 1; : : : ; C.We assume that all processors have the same service rate.

By a job we understand a parallel application requiringsome number of processors, possibly in multiple clusters(co-allocation). Jobs are rigid, so the numbers of processorsrequested by and allocated to a job are fi xed. We call a taskthe part of a job that runs on a single processor. We assumethat jobs only request processors and we do not include inthe model other types of resources. The system has a singlecentral scheduler, with one global queue. For interarrivaltimes we use exponential distributions.

2.3. The Structure of Job Requests

Jobs that require co-allocation specify the number andthe sizes of their components, i.e., of the sets of tasks thathave to go to the separate clusters. The distribution ofthe sizes of the job components is D(q) defi ned as fol-lows: D(q) takes values on some interval [n1; n2] with0 < n1 � n2, and the probability of having job-componentsize i is pi = qi=Q if i is not a power of 2 and pi = 3qi=Qif i is a power of 2, with Q such that the pi sum to 1. Thisdistribution favours small sizes, and sizes that are powers oftwo, which has been found to be a realistic choice [5]. A jobis represented by a tuple ofC values, each of which is eithergenerated from the distributionD(q) or is of size zero. Weconsider four cases for the structure of job requests:

1. A flexible request specifi es the total number of proces-sors needed, obtained as the sum of the values in thetuple, letting the scheduler spread the tasks over theclusters.

2. For an unordered request, by the components of thetuple the job only specifi es the numbers of processorsit needs in the separate clusters, allowing the schedulerto choose the clusters.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

3. In an ordered request the positions of the request com-ponents in the tuple specify the clusters from which theprocessors must be allocated.

4. For total requests, there is a single cluster and a re-quest specifi es the single number of processors needed,again obtained as the sum of the values in the tuple.

2.4. Scheduling Decisions

To determine whether an unordered request fi ts, we try toschedule its components in decreasing order of their sizeson distinct clusters. Possible ways of placement includeFirst Fit (FF; fi x an order of the clusters and pick the fi rstone on which a job component fi ts) and Worst Fit (WF; pickthe cluster with the largest number of idle processors), bothof which we use in our simulations. For a fl exible requestwe fi rst determine whether there are enough idle processorsin the whole system to serve the job. If so, the job will beplaced on the system in an arbitrary way.

In all the simulations the First Come First Served (FCFS)policy is used.

3. The maximal utilization

In the model described in Sect. 2, processors may beidle while there are waiting jobs because the job at the headof the queue does not fi t. As a consequence, when �m isthe maximal utilization, that is, the utilization such that thesystem is stable (unstable) at utilizations � with � < �m(� > �m), in general, we have �m < 1. We defi ne the(average) capacity loss l as the average fraction of the to-tal number of processors that are idle at the maximal uti-lization, so l = 1 � �m. Capacity loss may be due to thestructure of job requests and to the job-component-size dis-tribution (e.g., we expect it to be higher for ordered thanfor unordered jobs, and for larger component sizes). In thissection we present an expression for the average maximalMulti-Programming Level (MPL, i.e., the number of jobsbeing served simultaneously) in multicluster systems withordered requests, from which of course �m can be derived.We also deduce an approximation for the average maximalMPL in multiclusters with unordered requests and WF com-ponent placement, which we validate with simulations. Inthis section we assume that the service-time distribution isexponential.

3.1. Capacity loss with ordered requests

In this section we assume that requests are ordered. LetF be the (multidimensional) job-size distribution, which(allowing components of size zero) is defi ned on the set

SO =

CYi=1

f0; 1; : : : ; Nig

!n f(0; 0; : : : ; 0)g:

We have F (n) = P (s � n) for n 2 SO, where � denotescomponent-wise (in)equality. Let f be the job-size density,so f(n) is the probability of having a job of size n 2 SO .Denoting by G(i) the i-th convolution of a distribution Gwith itself, F (i)(N ) and F (i)(N )�F (i+1)(N ) are the prob-abilities that at least and exactly i random jobs fi t on themulticluster, respectively, where N = (N1; N2; : : : ; NC).When the job-component sizes are mutually independent,we have F (i)(N ) =

Qj F

(i)j (Nj) for i = 1; 2; : : :, with Fj

the distribution of the j-th components of jobs.In our treatment of multiclusters with ordered requests

below we follow [4]. There, a queueing model of a multi-processor with P processors and B memory blocks is stud-ied. The scheduling policy is First-Come-First-Loaded, inwhich a job is allowed from the head of the queue intothe multiprogramming set when its memory requirements,taken from some discrete distribution F on [1; B], can besatisfi ed. When the number of jobs does not exceed P , ev-ery job gets a processor to itself; otherwise processor shar-ing is employed. The service-time distribution (on a com-plete processor) is exponential. When P � B, and so everyjob has a processor of its own, this model coincides with oursingle-cluster model with memory blocks assuming the roleof processors. Under the assumption that there is alwaysa suffi ciently long queue, the Markov chain V with statespace (z1; : : : ; zB), where the zi’s are the memory sizes ofthe oldest B jobs in the system, and the MPL, both imme-diately after a departure and the entailing job loadings, arestudied. It turns out that the behaviour of V is as if FIFO isused, and, by solving the balance equations, that the asso-ciated probabilities are as if the zi are independently drawnfrom F . In addition, the time-average maximal MPL is de-rived in terms of convolutions of F .

In our multicluster model, we also consider the sequenceof the oldest jobs in the system such that it includes at leastall jobs in service. Let B be some upper bound of the num-ber of jobs that can be simultaneously in service (

PiNi

will certainly do). Let Z = (z1; z2; : : : ; zB) be the pro-cessor state vector, which is the (left-to-right ordered) se-quence of the sizes of the oldest jobs in the system. Somefi rst part of Z describes the jobs in service, and the remain-der the jobs at the head of the queue. When a job leaves,the new processor state vector is obtained by omitting thecorresponding element from the current vector, shifting therest one step to the left, and adding a new element at theend. Let V be the set of processor state vectors.

Because the service-time distribution is exponential, forv; w 2 V , the transition of the system from state v to statew only depends on v: each of the jobs in service has equalprobability of completing fi rst, and the job at the head of thequeue to be added to the state is random. So in fact, V isa Markov chain. The result of [4] explained above can beextended in a straightforward way to our situation—the im-

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

portant element is that the distributionF simply determineswhich sets of jobs can constitute the multiprogramming set,but the underlying structure of a single or of multiple re-sources does not matter. So also now, the stationary proba-bility distribution� on V satisfi es

�(Z) =BYi=1

f(zi); (1)

which means that the distribution of the oldestB jobs in thesystem is as if they are independently drawn from F . So,because the average length of the intervals with i jobs inservice is inversely proportional to i due to the exponentialservice, we fi nd for the average maximal MPL M :

M =

PB

i=1(F(i)(N ) � F (i+1)(N )) � (1=i) � iPB

i=1(F(i)(N )� F (i+1)(N )) � (1=i)

;

which can be written as

M =1

1�PB

i=2F(i)(N )=(i(i � 1))

: (2)

For single clusters this expression coincides with the for-mula in [4], p. 468. Denoting by s be the average total jobsize, the maximal utilization is given by

�m =M � sPiNi

: (3)

In Sect. 3.2 we will validate an approximation for themaximal utilization for unordered requests with simula-tions, by taking the utilization in these simulations that yieldan average response time of at least 1; 500 time units (theaverage service time is 1 time unit; for more details on thesimulations, see Sect. 4). We have validated this approachin turn by running simulations for single clusters and formulticlusters with ordered requests, for which we have theexact solution based on Eq. (3). In Tables 1 and 2 we showboth exact and simulation results for a single cluster and formulticluster systems with ordered and fl exible requests (andwith unordered requests in Table 2, see Sect. 3.2) for dif-ferent distributions of the job-component sizes (which aremutually independent). The exact and simulation results forthe single cluster agree extremely well. The same is true forthe simulations we did for the 4-cluster system (results notshown here).

3.2. Capacity loss with unordered requests

We now derive an approximation to the maximal uti-lization in multiclusters with unordered requests in case allclusters are of equal sizeN . For job placement, WF is used.

Table 1. The capacity loss in a single cluster(C = 1) of size 32 and in a multicluster with 4clusters (C = 4) of size 32 for ordered and flex-ible requests, with job-component-size distri-bution D(q) on [1; 32]

job comp. capacity losssize distr. C = 1 C = 4

ordered fle xibleq exact simulation (exact) (exact)

0.95 0.293 0.295 0.564 0.2260.90 0.249 0.251 0.555 0.1570.85 0.188 0.188 0.494 0.1120.80 0.134 0.135 0.416 0.0850.75 0.097 0.097 0.348 0.0670.70 0.073 0.074 0.296 0.0560.65 0.057 0.058 0.257 0.0480.60 0.046 0.047 0.227 0.0420.55 0.038 0.039 0.203 0.0380.50 0.032 0.032 0.182 0.034

The job-size distributionF is now defi ned on

SU = f(s1; s2; : : : ; sC) j 1 � s1 � N; si+1 � si;

i = 1; 2; : : : ; C � 1g;

that is, job requests are represented by a vector with non-increasing components. Compared to the case of orderedrequests, the case of unordered requests presents some dif-fi culty for two reasons. First, given a set of unordered jobs,it is not possible to say whether they simultaneously fi t ona multicluster, because that depends also on the order of ar-rival of the jobs. For instance, if C = 2 and N = 5, thenif three jobs of sizes (2; 1); (3; 1); (2; 1) arrive in this order,they can all be accommodated, while if they arrive in theorder (2; 1); (2; 1); (3; 1), only the fi rst two jobs can run si-multaneously. Second, a departure can leave the system ina state that cannot occur when it is simply fi lled with jobsfrom the empty state. If with the fi rst sequence of arrivalsabove the job of size (3; 1) leaves, both (2; 1)-jobs will havetheir largest component in cluster 1. Our result below dealswith the fi rst problem—by running over all possible arrivalsequences—but not with the second, which is why it is anapproximation.

We now defi ne the i-fold WF-convolution F ? G of twodistributionsF;G on SU in the following way. Let for anyC-vector s = (s1; s2; : : : ; sC) the reversed vector rev(s)be defi ned as rev(s) = (sC ; sC�1; : : : ; s1), and the orderedvector ord(s) as the vector with the elements of s permutedsuch that they form a non-increasing sequence. Now if f; gare the densities of F and G, respectively, F ?G has density

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Table 2. The capacity loss in a single cluster (C = 1) of size 32 and in a multicluster with 4 clusters(C = 4) of size 32 for ordered, unordered, and flexible requests, with job-component-size distributionU [n1; n2]

job comp. capacity losssize distr. C = 1 C = 4

ordered unordered fle xiblen1 n2 exact simulation (exact) approximation simulation (exact)

1 4 0.032 0.033 0.149 0.050 0.053 0.0381 5 0.043 0.044 0.176 0.065 0.067 0.0471 13 0.139 0.139 0.345 0.187 0.192 0.1201 16 0.169 0.169 0.380 0.233 0.239 0.1484 5 0.051 0.052 0.111 0.043 0.048 0.0434 13 0.145 0.145 0.302 0.186 0.188 0.1494 16 0.174 0.175 0.337 0.250 0.255 0.1675 13 0.149 0.150 0.292 0.170 0.175 0.1465 16 0.177 0.178 0.321 0.260 0.260 0.186

13 16 0.094 0.095 0.094 0.094 0.094 0.094

h defi ned by:

h(s) =X

t;u2SU ;s=ord(t+rev(u))

f(t) � g(u):

Then, putting F [2] = F ? F , we defi ne inductively F [i] =F [i�1]?F for i = 3; 4; : : :. What this amounts to is that F [i]

is the distribution of the non-increasingly ordered numbersof processors in the clusters of the multicluster that are inuse when i unordered requests are put on an initially emptysystem with WF. Now our approximation of the maximumaverage MPL M is (cf. Eq. (2)):

M =1

1�PB

i=2 F[i](N )=(i(i � 1))

; (4)

where again B is some upper bound to the number of jobsthat can simultaneously be served, from which an approxi-mation of the maximal utilization (or the capacity loss) canbe derived with Eq. (3). In addition, we have again per-formed simulations along the lines presented in Sect. 3.1 tofi nd the maximal utilization. The results, which are in Ta-ble 2, show that the approximation is very accurate. Also,the results for multiclusters in Table 2 show the reductionin capacity loss when going from ordered to unordered tofl exible requests.

Extending the results in this section to unequal clustersizes is cumbersome, because then adding an unordered re-quest to a multicluster depends on the numbers of idle ratherthan used processors. So one would have to replace the WF-convolution by a sort of convolution that depends on thecluster sizes.

4. Simulating Co-Allocation

To measure the performance of multicluster systemssuch as the DAS for different structures and sizes of re-quests, different numbers of clusters and different clusterssizes, we modeled the corresponding queuing systems andstudied their behaviour using simulations. The simulationprograms were implemented using the CSIM simulationpackage [8]. In all the simulations the mean service timeis 1. In all the graphs in this section we include confi denceintervals; they are at the 95%-level. In all cases where theservice times are exponential we can fi nd (an approximationto) the maximal utilizations based on our results in Sect. 3;we show these only for a limited number of cases.

In this section various multicluster systems with a totalof 128 nodes are considered; job-component sizes are ob-tained from the distributionD(0:9) on [1; 8].

4.1. Equal versus Different Cluster Sizes

A question we try to answer is whether a system withequal clusters provides a better performance than a systemwhere nodes are unevenly divided among the clusters, forequal total system sizes. For this we look at two systemswith 4 clusters, with total numbers of processors of 128: asystem with equal clusters of size 32, and one with unevenclusters of sizes 64; 32; 16; 16. (FF uses the clusters in thisorder; see Sect. 4.2 for the reverse order.) Simulations wereperformed for ordered and unordered requests, and for totalrequests in a single 64-processor cluster to see whether onlyusing the largest cluster in the uneven system can give bet-ter performance than co-allocation. For unordered requestsboth FF and WF were simulated; since the results were sim-ilar, only those for FF are shown.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

unord in eq cls ord in eq cls

unord in noneq cls total in a 64-proc cl

ord in noneq cls

Figure 3. The performance of ordered and un-ordered requests in a multicluster with equalclusters (32,32,32,32), and in a multiclusterwith unequal clusters (64,32,16,16)

Although the effect is less pronounced for unordered re-quests, for which the scheduler can deal well with unequalcluster sizes, Fig. 3 shows that for both ordered and un-ordered requests, the system with equal clusters providesa much better performance. This can be explained by thefact that both types of requests need nodes from all the 4clusters (the number of components is equal to the num-ber of clusters) and once the smallest clusters get fi lled tothe extent that they cannot accommodate components of anew job, the idle processors from the big clusters cannotbe used either. Since the small clusters behave as a bottle-neck for the utilization, we can expect that the performancedeteriorates with the increase in the difference between theclusters’ capacities. For ordered requests in uneven clus-ters the performance is even worse than for total requests ina system with half the capacity (64 processors, the utiliza-tion on the horizontal axis is then still relative to the whole128-processor system), which means that we can ignore thesmaller clusters and only use the largest.

4.2. Comparing Systems with DifferentNumbers of Clusters

For the same total size of its clusters, we can expect thatthe performance of a system varies with the number of clus-ters. We study this variation for systems with 128 nodeswith at least 4 clusters, in the presence of unordered re-quests consisting of 4 components, which the scheduler cannow put on any 4 different clusters. We compare the resultsfor FF and WF.

Figure 4 compares two system confi gurations with clus-ters of equal sizes: a system with 4 clusters of 32 processors,and a system with 8 clusters of 16 nodes. Although in thesecond case the clusters are much smaller, the results showthat the performance is similar at low utilizations and evenslightly better at moderate and high utilizations, which sug-

gests that the performance can be better when the numberof clusters is larger than the number of request components.This can be explained by the fact that FF, which we usehere, fi rst fi lls up the fi rst cluster with the largest job com-ponents. Then, when such a component does not fi t in thefi rst cluster anymore, FF in the 4-cluster case is still forcedto use the fi rst cluster for a job component. However, in the8-cluster case FF may skip the fi rst cluster(s) completelyfor a whole job once in a while, but will always try to putsmall job components on them. Apparently, this determinesa better packing when the number of clusters is larger, evenif those clusters are smaller.

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1A

vera

ge R

espo

nse

Tim

e

Utilization

(16,16,16,16,16,16,16,16) (32,32,32,32)

Figure 4. The influence of the number andsizes of clusters for equal clusters and un-ordered requests, with FF component place-ment

Figure 5 shows the performance of systems with differ-ent numbers of clusters of uneven sizes; the difference be-tween the graphs is the order in which clusters are used bythe scheduler. Taking as a reference the case (32; 32; 32; 32)in the two graphs, we notice that performance is better whenthe scheduler starts with the largest clusters. For a systemwhere the user can express preference for certain clusters,this can be translated into the fact that preferring the largerclusters is a good option and improves both the responsetime and the maximal utilization. From Figure 5 (and Fig-ure 4), we conclude that the performance is only good insystems with clusters of equal size and, for both ordersof using the clusters with FF, in systems with clusters ofunequal sizes whose numbers of clusters are considerablylarger than the number of job components.

Figures 6 and 7 show the results for the same confi gura-tions, when WF is used for placing the jobs. We concludethat systems with large equal clusters give much better per-formance. For systems with unequal clusters, we shouldstill prefer to have many (although relatively small) clus-ters, since the performance will be better than for few, largeclusters.

Comparing WF with FF (large-small order), we noticethat WF is much better for systems with few, equal, large

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

(32,32,16,16,16,16) (32,32,16,16,8,8,8,8)

(32,32,32,32) (64,16,16,16,16)

(64,32,16,16)

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

(32,32,32,32) (16,16,16,16,32,32)

(8,8,8,8,16,16,32,32) (16,16,16,16,64)

(16,16,32,64)

Figure 5. The influence of the number andsizes of clusters for unordered requests;component placement uses FF and startswith the large clusters (top), and with thesmall clusters (bottom)

clusters. Using WF instead of FF slightly improves the per-formance also for systems with unequal clusters, when thenumber of clusters is relatively small compared to the num-ber of job components. However, comparing Fig. 6 withFig. 4 we can see that for systems with small, equal clus-ters and with numbers of clusters considerably larger thanthe number of job components, WF has signifi cantly worseperformance than FF. This is also valid for systems withunequal clusters, as Fig. 8 shows. Summing up, FF shouldbe chosen as placement policy when dealing with systemswith many but relatively small clusters, while WF is to beprefered for multiclusters with few, large clusters.

5. Co-Allocation versusNo-Coallocation

In this section we study workloads consisting of jobswith different numbers of components. We consider a mul-ticluster system with 4 clusters of 32 processors each andevaluate its performance for jobs having between 1 and 4components. Mixes of jobs are also considered. This corre-sponds to a system in which co-allocation is combined withjobs requiring processors from a single-cluster, but both sin-

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

(16,16,16,16,16,16,16,16) (32,32,32,32)

Figure 6. The influence of the number andsizes of clusters for equal clusters and un-ordered requests, with WF component place-ment

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

(32,32,32,32) (32,32,16,16,16,16)

(32,32,16,16,8,8,8,8) (64,16,16,16,16)

(64,32,16,16)

Figure 7. The influence of the number andsizes of clusters for unordered requests;component placement uses WF

gle cluster and multicluster jobs are submitted to the samecentral scheduler. We model unordered requests and twoways of generating components: either each component isobtained from the distribution D(0:9), which means thatjobs with fewer components are in general smaller, or jobshave the same total request size and their components areobtained as a sum of values from D(0:9). In this secondcase a job with fewer components has in general larger com-ponents.

5.1. Jobs with Components from the Same Distri-bution

Figure 9 compares the performance of the system forjobs with 1, 2, 3 and 4 components, and mixes of such jobsin different percentages. Each component is obtained fromthe distribution D(0:9) and jobs are placed using FF. Theperformance is better for jobs with fewer components sincethey are smaller and also allow more ways of placement.For mixes the performance is somewhere between those of

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

(32,32,16,16,8,8,8,8) FF(32,32,16,16,8,8,8,8) WF

Figure 8. Comparison between FF and WF forplacing unordered requests, in systems withmany, unequal clusters

the participants in the mix, depending on the percentage ofeach of them. Figure 10 shows the performance for jobswith 1 and 4 components and for mixes of the two whenWF is used. Compared to FF, WF brings visible improve-ments for jobs with 4 components and mixes with high per-centages of such jobs, but has little effect on jobs with 1component or on the mix with 75% jobs with a single com-ponent.

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

1 100% 1 50% 2 50%

2 100% 3 100%

1 50% 4 50% 4 100%

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

2 100% 1 50% 2 25% 3 15% 4 10% 1 25% 2 25% 3 25% 4 25%

1 75% 4 25% 1 50% 4 50%

4 100%

Figure 9. Different number of components,the same job-component size; FF placement

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

1 100% 1 75% 4 25% 1 50% 4 50%

4 100%

Figure 10. Different number of components,the same job-component size; WF placement

5.2. Jobs with the Same Total Request Size

In this section each job has the same total request size.Figure 11 compares jobs with 1, 2 and 4 components andmixes of such jobs for the FF policy. For jobs with 2 com-ponents, each component is the sum of two values fromD(0:9). The worst performance is displayed by jobs withone component since that component is large comparedwith the clusters’ sizes. Signifi cantly better performance isobtained for jobs with four components, although they havethe disadvantage of placing a component in each cluster.For this reason, the best performance is shown by systemshosting jobs with two components since these componentsare still relatively small and need only two of the clusters(more possibilities for the placement). For mixes of 1 and4 the performance of the system is even worse than for jobswith 1 component since the requirements of jobs with 1 and4 components are opposed to each other: jobs with 4 com-ponents need to place one component in each cluster, whilejobs with 1 component need enough room in one cluster toplace their large component and tend to fi ll in individualclusters. When the mix also contains jobs with 2 compo-nents the performance of the system improves, since theyare easier to fi t. Figure 12 shows the results for systems withjobs consisting of 1 component, 4 components, and mixesof 1 and 4 when WF is used. WF improves the performancefor jobs with 4 components compared to FF, but leaves al-most unchanged the results for jobs with 1 component andfor mixes where such jobs are predominant.

6. Conclusions

In this paper we have studied the performance of proces-sor co-allocation in multicluster systems with space sharingfor rigid multi-component jobs of different request typesthat impose various restrictions on their placement. Ourmain conclusions are as follows:� Multiclusters with equal clusters provide better perfor-

mance. For the same number of clusters and total num-

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

2 100% 4 100%

1 50% 2 30% 4 20% 1 20% 2 20% 4 60%

1 100%

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

4 100% 1 100%

1 75% 4 25% 1 50% 4 50%

Figure 11. Different number of components,the same total request size; FF placement

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Res

pons

e T

ime

Utilization

4 100% 1 100%

1 75% 4 25% 1 50% 4 50%

Figure 12. Different number of components,the same total request size; WF placement

ber of processors, (un)ordered request types fi t betterin systems with equal clusters.

� In general, a system with few, large clusters shouldbe preferred. For the same total number of processorsand unordered requests, we can obtain a better perfor-mance if the system is only divided into a small num-ber of clusters, especially if jobs are placed using WF.

� For systems with few, large clusters WF should beused. For systems with many clusters, FF is better. Ifa system has a small number of clusters compared tothe number of job components, WF policy should be

chosen for placing the jobs on clusters. On the con-trary, for systems with many clusters FF gives betterperformance.

� For (un)ordered jobs the performance is better wheneither the cluster sizes are equal, or the number ofclusters is large relative to the number of job compo-nents.

� Large jobs or jobs with large components have lowerperformance. The performance is better when thejobs submitted to the system are small. For jobs withlarge (even if few) components compared to the clus-ter sizes the performance is worse than for jobs withmore but smaller components. Spreading large single-component jobs over more clusters (co-allocation) cangenerate better results.

� Jobs with many components yield poor performance.Even when the cost of communication is not consid-ered, the performance is lower for jobs with numbersof components close to the number of clusters in thesystem.

References

[1] The Distributed ASCI Supercomputer (DAS) site.http://www.cs.vu.nl/das.

[2] The Global Grid Forum. http://www.gridforum.org.[3] K. Aida, H. Kasahara, and S. Narita. Job Scheduling Scheme

for Pure Space Sharing Among Rigid Jobs. In D. Feitel-son and L. Rudolph, editors, 4th Workshop on Job Schedul-ing Strategies for Parallel Processing, volume 1459 of LNCS,pages 98–121. Springer-Verlag, 1998.

[4] R. Bryant. Maximum Processing Rates of Memory BoundSystems. J. of the ACM, 29:461–477, 1982.

[5] S.-H. Chiang and M. Vernon. Characteristics of a LargeShared Memory Production Workload. In D. Feitelson andL. Rudolph, editors, 7th Workshop on Job Scheduling Strate-gies for Parallel Processing, volume 2221 of LNCS, pages159–187. Springer-Verlag, 2001.

[6] H. B. et al. The Distributed ASCI Supercomputer Project.ACM Operating Systems Review, 34(4):76–96, 2000.

[7] I. Foster and C. K. (eds). The Grid: Blueprint for a NewComputing Infrastructure. Morgan Kaufmann, 1999.

[8] Mesquite Software, Inc. The CSIM18 Simulation Engine,User’s Guide.

[9] A. Plaat, H. Bal, R. Hofman, and T. Kielmann. Sensitivityof Parallel Applications to Large Differences in Bandwidthand Latency in Two-Layer Interconnects. Future GenerationComputer Systems, 17:769–782, 2001.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD�02) 0-7695-1772-2/02 $17.00 © 2002 IEEE