task mapping on supercomputers with cellular networks

Computer Physics Communications 179 (2008) 479–485

Contents lists available at ScienceDirect

Computer Physics Communications

www.elsevier.com/locate/cpc

Task mapping on supercomputers with cellular networks

Yongzhi Chen, Yuefan Deng ∗

Department of Applied Mathematics and Statistics, Stony Brook University, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 16 December 2007Received in revised form 31 March 2008Accepted 11 April 2008Available online 18 April 2008

Keywords:Mapping modelOptimizationDifferential equationsBlueGene

Several models are developed to map reasonably arbitrary application problems to computing platformsfor achieving high performance by minimizing communication and by balancing computation. In ourmodels, we assume that the underlying applications be appropriately decomposed to subtasks withknown computational loads and the inter-subtask communicational demands, and assume that thecomputing system’s specifications such as individual processor speed and inter-processor communicationcost be given or easily measurable. Therefore, the model abstracting the application as a demandmatrix and the computer as a network supply matrix can be designed with the objective function asminimum time to complete the application on the given computer. An application, 2D wave equation,was introduced to test our models on the BG/L supercomputer. The mappings generated by our models,reduced communication by 51% for 3D-mesh and 31% for 3D-torus over the default MPI rank ordermapping.

Published by Elsevier B.V.

1. Introduction

The high expectations of parallel computers are difficult to ful-fill due to algorithmic complexities rarely appearing in sequen-tial applications, such as synchronization, deadlock avoidance, loadbalance, communication overlap, network congestion and collision,etc. [1]. Tools to take full advantages of the parallel computer inreducing such complexities are in great demand. Parallel compil-ers that help decomposition, mapping and scheduling are one ofsuch tools [2]. As an important aspect of the parallel compilers,the mapping has long been studied.

Mapping is defined by many similar ways in the literature onparallel computing [3]. For example, the procedure of mappingeach one of the interacting subtasks to individual processors soas to minimize the total execution time can be considered as themapping problem [4,5]. The more elaborate mapping involves bothtask assignment and scheduling. The Tabli’s definition [6] is a pop-ular one in which the application and the parallel system are illus-trated as static graphs G A and G P , respectively. G A = (V A, E A) is agraph where V A represents subtasks and E A measures communi-cation requirements among them. G P = (V P , E P ) is an undirectedgraph where V P represents processors and E P means the linksamong processors with weights being the per unit communicationcost. A mapping problem can be defined as finding a mapping:V A → V P that minimizes the objective function value associatedwith the mapping.

* Corresponding author.E-mail address: [email protected] (Y. Deng).

0010-4655/$ – see front matter Published by Elsevier B.V.doi:10.1016/j.cpc.2008.04.011

The mapping problem is known to be NP-complete [7]. In staticmapping, the communication pattern and execution time for eachsubtask on each processor are static and known prior to programexecution. Thus the task can be represented by a static graph, ei-ther task precedence graph (TPG) or task interaction graph (TIG).

Task mapping mode is highly dependent on both applicationsand machine architectures. Under various assumptions on prob-lems and architectures, many approaches were developed [8,9].Many early papers focused on theoretical analysis for the smallcomputer systems while recent research targeted at large systemsuch as the IBM BG/L system for more realistic applications. Thepopular strategy was to minimize the total point-to-point commu-nication whose per unit cost was measured by hops or its variant.IBM recently published the best improvement timing results with-out the details of mappings [8].

In our paper, we adopt Tabli’s definition of mapping and con-sider the TIG case since we assume that subtasks can be exe-cuted independently and simultaneously. An application, 2D waveequation, was tested on several configurations of the BG/L super-computer. Both theoretical and experimental analysis were given.A better communication cost measure was introduced and typicalnear-optimal mappings were illustrated.

2. Parallel computing systems

The BlueGene/L supercomputer (BG/L) is a massively parallelcomputer with two relevant communication networks: a nearestneighbor network, with the topology of a 3D-torus, and a globaltree. The torus network is the primary medium for both point-to-point and many collective communications, while the tree networkis for some other collective communications. Each computational

http://www.ScienceDirect.com/

http://www.elsevier.com/locate/cpc

mailto:[email protected]

http://dx.doi.org/10.1016/j.cpc.2008.04.011

480 Y. Chen, Y. Deng / Computer Physics Communications 179 (2008) 479–485

Fig. 1. MPI latency (μs) of a 0 Byte packet on a 8 × 8 × 16 BG/L architecture. (For colors see web version of this article.)

node has six torus links connected to its six nearest neighborsin the ±x-, ±y-, ±z-directions [10]. All communications betweennodes must be routed to use the available physical connectionsand the cost of communications between nodes varies dependingon the location of the nodes. The actual measured communicationtime for sending a 0 byte message from all 1024 node to all othernodes form a matrix as shown in Fig. 1. Obviously, the diagonalelements are 0 because there is no cost to perform self communi-cation and the matrix is symmetrical.

All data in this paper were collected from the BG/L systemsat Argonne and Brookhaven National Laboratories. A series of pre-defined partitions, 32-node (4 × 4 × 2), 64-node (8 × 4 × 2), 128-node (8 × 4 × 4), 512-node (8 × 8 × 8) and 1024-node (8 × 8 × 16)were utilized to test the models at different sizes and structuresof application. The partitions with fewer than 512 nodes wereconnected as 3D-meshes while 512-node and 1024-node were 3D-torus partitions.

3. Mapping models

Two static models were proposed to consider mapping appli-cations to heterogeneous systems. Although our tests were per-formed on BG/L and wave equation, these models are generalizableto other platforms and applications.

3.1. Assumptions and notations

Let m be the number of processors and n the number of sub-tasks. The computational cost is qualified by a matrix Ln×m whoseentry L(t, p) denotes the time for executing subtask t on the pro-cessor p. For convenience, the application is assumed to be decom-posable as n subtasks with equal computing load. The inter-subtaskcommunication is described by the demand matrix Dn×n whoseentry D(t, t′) denotes the amount of data transferred from subtaskt to t′ and this D(t, t′) is decided by the underlying communicationpattern of the application and is assumed to be invariant through-out the solution process. The communication cost is expressed as asupply matrix Sm×m whose entry S(p, p′) denotes the cost of unitdata transfer from processor p to p′ . These to-and-from cost de-pends on their locations, available buffer, network congestion, and

other conditions. We assume that these matrices are known fromthe characteristics or by standard testing, or by simple analysis.

Our models introduced the latency matrix to quantify the com-munication cost instead of the hop matrix used in earlier models[8,9] to reduce the inaccuracy of the hop measure. The linear re-gression analysis of the latency with respect to the hop, illustratedin Fig. 2, showed there were many outliers (red dotes) and theseoutliers may mislead the optimization. Most of them resulted fromthe torus in Z-dimension of the BG/L system. For each hop num-ber, there were many different values of latency, corresponding todifferent cases, such as relative locations. The hop measure failedto distinguish these multiple states, while the latency measure canhandle such subtle situation more reliably.

Let {ti} be the set of subtasks and {pi} be the set of processorsto which the subtasks will be assigned. We assume n � m.

3.2. Basic model

We introduce two decision Boolean variables, Xtp and Ytt′ pp′ .

xtp ={

1, if subtask t is assigned to processor p;0, otherwise.

ytt′ pp′ ={

1, if subtask t and t′ are assigned to processor pand p′, respectively;

0, otherwise.

The mapping problem is formulated as a 0–1 programming prob-lem:

min

{n∑

t=1

m∑p=1

L(t, p) · xtp +n∑

t=1

n∑t′=1

D(t, t′)

·(

m∑p=1

m∑p′=1

S(p, p′) · ytt′ pp′

)}(1)

subject to:

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

∑mp=1 xtp = 1 (t = 1, . . . ,n),∑nt=1 xtp � 1 (p = 1, . . . ,m),∑nt=1 xtp � � Ap

T × n� (p = 1, . . . ,m),

xtp + xt′ p′ � 1 + ytt′ pp′ (t, t′ = 1, . . . ,n, t < t′;′
p, p = 1, . . . ,m).

Y. Chen, Y. Deng / Computer Physics Communications 179 (2008) 479–485 481

Fig. 2. Linear regression for T L = a ∗ Hop + b (95% confidence level).

This model of achieving minimal execution time requiressatisfaction of balanced computational load and minimal inter-processor communication. The first term represents the total com-putation time and the second the inter-task communication time.The constraints require each subtask to be assigned to one andonly one processor and each processor is assigned to at least onesubtask to maintain load balance. A p measures the computingspeed of the pth processor and T measures total speeds of a givenprocessor set. We set both lower and upper bounds of computingload for each processor to achieve the load balance. The mappingproblem is formulated as an Integer Linear Programming problemthus an ILP solver may solve it if the optimization is moderate insize. These two items need evaluation only if the system is hetero-geneous on both computation and communication. For mappingeven subtasks onto the BG/L system, only the second item affectsthe optimization since the first term, addressing the computingtime issue, is invariant during optimization. This was demonstratedby experiments on BG/L with application SAGE and UMT2000 [8].

This model can be easily implemented with the convenienceof obtaining the demand matrix by simple tracing. However, themodel lacks the sophistication in considering communication over-lap, network congestion, and collision, resulting in, in some subtlecases, misleading mappings.

3.3. Enhanced model

An enhanced model was proposed as follows:

min

{n∑

t=1

m∑p=1

L(t, p) · xtp +k∑

i=1

(Di S)max

}(2)

subject to:

⎧⎪⎪⎨⎪⎪⎩

∑mp=1 xtp = 1 (t = 1, . . . ,n),∑nt=1 xtp � 1 (p = 1, . . . ,m),∑nt=1 xtp � � Ap

T × n� (p = 1, . . . ,m),

where k � n2, is the total number of the communication batch.The term (Di S)max represents the maximum value in the ith batchcommunication. With more detailed communication information,

the objective function only focus on minimizing the sum of thedominant elements in each batch. With this added consideration ofcommunication overlapping, this new model represents more re-alistic communication scenario and thus more accurate. However,the overlapping information is difficult to obtain for many cases.

Most applications of realistic values are too big to solve by theILP solvers due to the memory limit, while heuristic techniques,such as simulated annealing (SA), are practical for near-optimal so-lutions [11].

4. Experiments: model assessment on applications

We consider the numerical solution of a 2D wave equation (3)

∂2u

∂t2(x, y, t) = α2

(∂2u

∂x2(x, y, t) + ∂2u

∂ y2(x, y, t)

),

0 < x < lx, 0 < y < l y, t > 0 (3)

with periodic boundary conditions and initial conditions{u(x, y,0) = sin(2πx) sin(2π y), 0 � x � lx, 0 � y � l y,

∂u∂t (x, y,0) = 0, 0 � x � lx, 0 � y � l y, t > 0,

where α, lx and l y are constants. Discretizing Eq. (3), we get thedifference equation (4):

wk+1i, j = 2

(1 − λ2

x − λ2y

)wk

i, j + λ2x

(wk

i+1, j + wki−1, j

)+ λ2

y

(wk

i, j+1 + wki, j−1

) − wk−1i, j (4)

whose communication pattern is illustrated in Fig. 3 when the ap-plication is decomposed to 32 × 4 subtasks.

When assigning 18×12 subtasks onto a 3×3×3 mesh, we get,through our models, the mapping shown in Fig. 4.

5. Experimental results

The models were examined by solving 2D wave equation pat-tern on both 3D-mesh and 3D-torus platforms. In fact, solutions


Fig. 3. The communication pattern of 2D wave equation decomposed as 32 × 4 subtasks.

Fig. 4. The model mapping of 18 × 12 2D wave equation subtasks to 3 × 3 × 3 computer mesh. (For colors, see web version of this article.)

of 2D wave equation require careful mapping because of the com-munication pattern. Mapping 3D wave equation to a 3D computerarchitecture is relatively easier.

5.1. Mapping 2D wave equation to 3D-mesh computer

The application, was evenly decomposed to N X × NY subtasksdefined on a XSize × YSize computing grids. Tables 1, 2, 3, and4 show the point-to-point communication time of running 1000

time iterations of the wave equation on 16-, 32-, 64-, and 128-node BG/L computer, respectively. The corresponding cost for eachmapping is estimated by the objective functions (1) and (2) withthe demand matrix and the published latency matrix. The com-munication time is measured as the total wall clock time. The lastcolumn shows the communication efficiency gain over the defaultrank order mapping. In our comparison, we focus on the communi-cation time as it was done by other researchers [8,9]. It would havebeen more desirable to measure the total time including computa-


Table 1Communication time results for running 1000 iterations 2D wave equation on a4 × 4 × 1 BG/L mesh

Dimensions Mappings (Unit:ms) Gain (%)

(N X , NY ) (XSize, YSize) Random(Cost = 23.27)

MPI rank(Cost = 12.80)

Model(Cost = 8.39)

(8,2) (600,600) 152 142 133 6(8,2) (800,800) 159 153 142 7(8,2) (1600,800) 177 163 152 7(8,2) (4800,800) 292 267 254 5





Model(Cost = 15.96)

(8,4) (176,176) 36 26 26 0(8,4) (336,336) 57 33 31 6(8,4) (1000,1000) 167 141 130 8(8,4) (2000,2000) 285 253 231 9(8,4) (2400,2400) 297 264 237 10





Model(Cost = 34.44)

(8,8) (600,600) 98 58 58 0(8,8) (1000,1000) 129 66 65 2(8,8) (1600,1600) 160 69 68 1(8,8) (2400,2400) 376 242 239 1



Model(Cost = 31.16)

(16,4) (1200,300) 103 71 55 23(16,4) (2000,500) 136 88 63 28(16,4) (3200,800) 170 102 66 35(16,4) (4800,1200) 378 306 237 23

Table 4Communication time results for running 1000 steps 2D wave equation on a 8×4×4BG/L mesh




Model(Cost = 71.30)

(16,8) (800,800) 79 36 33 8(16,8) (1600,1600) 130 47 40 15(16,8) (2000,2000) 230 142 128 10(16,8) (2400,2400) 236 145 128 12



Model(Cost = 63.41)

(32,4) (1600,400) 77 53 31 42(32,4) (3200,800) 126 80 39 51(32,4) (4800,1200) 230 203 129 36

tion and communication time. But the much longer computationaltime, in our test examples, diminishes the signal of the commu-nication efficiency gains. Many communication-dominant applica-tions should benefit from our mapping schemes.

The model mapping as illustrated in Fig. 5 that assigns 32 × 4subtasks to 8 × 4 × 4 partitions improved point-to-point commu-nication for up to 51%. Four columns are distinguished by colors.The subtasks in the same column are assigned to the processorsmarked dotted straight line of the same color while the dotted arcshows the continuation of the assignment.

Table 5Communication time results for running 1000 steps 2D wave equation on a 8×8×8BG/L torus




Model(Cost = 199.52)

(16,32) (1600,1600) 94 37 34 8(16,32) (1600,3200) 113 45 40 11(16,32) (2400,4800) 156 54 49 9

Table 6Communication time results for running 1000 steps 2D wave equation on a 8 × 8 ×16 BG/L torus




Model(Cost = 333.53)

(64,16) (6400,1600) 130 44 34 23(64,16) (6400,3200) 182 58 40 31(64,16) (5120,3840) 182 58 40 31(64,16) (3200,6400) 277 172 135 22

5.2. Mapping 2D wave equation to 3D-torus computer

Tables 5 and 6 show the point-to-point communication timingresults of running 1000 time iterations of the 2D wave equationon 512- and 1024-node BG/L torus network.

The model mapping as illustrated in Fig. 6 that assigns 64 × 16subtasks to 8 × 8 × 16 partition and improved point-to-point com-munication for up to 31%. Different columns are distinguished bycolors. The subtasks in the same column are mapped to the pro-cessors marked dotted straight line of the same color while thecolored arcs represent the torus connection along the Y -dimensionand the black arcs represent the torus connection along theZ -dimension.

6. Performance analysis

The basic questions to ask are: (1) if model can always resultin mapping for higher performance than the default mapping and(2) how much improvement. By analyzing the experimental results,we observed as follows:

First, in every case, the model always produced mappings thatrequired shorter communication time than those from a randommapping or the MPI rank order mapping. In many cases, the basicmodel also worked because the problem of the network congestionneglected in that model was considered, implicitly.

Second, the improvement of the model mapping over the de-fault one depends on: the communication patterns, the networkdiameter, the size of machine and the message being passed, etc.For example, many partitions (from 64-node to 1024-node) havea structure 8 × Y × Z . When the application was decomposed toN X × NY subtasks and NY = 8, the 3D machine network matchedwith the 2D application pattern precisely and thus rendered thelittle difference of the default mapping from model mapping, asevident in Table 3. For 512- and 1024-node cases with reducednetwork diameter, we achieved smaller improvement.

Third, our model can always produce mapping for 2D applica-tions to 3D computer with significant communication gain.

Hop, a convenient but rough measure used as the communica-tion cost of unit data transfer in other models, has two obviousdrawbacks. First, it cannot distinguish the better mapping amongmultiple patterns that produce the same objective function value.Second, timing result from the latency test reveals further inaccu-racy of the hop measure.

When 64 × 16 subtasks were mapped onto a 8 × 8 × 16 torus,our mapping models generated excellent improvement. The 8 ×


Fig. 5. Mapping 32 × 4 subtasks onto 8 × 4 × 4 mesh. (For colors see web version of this article.)

Fig. 6. Mapping 64 × 16 subtasks onto 8 × 8 × 16 torus. (For colors see web version of this article.)

8 × 16 torus can be treated as 16 sheets (along Z -dimension) andeach sheet consist of 8 × 8 2D-torus (in the X-Y plane). In eachsuch sheet, the assignment zigzags and winds off the 2D-torus toa 1D-torus, resulting in a precise mapping that matches exactly the2D-torus communication pattern.

7. Conclusions

Task mapping is complex due to the sophistication in manipu-lating both the application and the machine topologies. We haveintroduced a basic static and an enhanced mapping model to op-


timize the mapping of an application on a large heterogeneousparallel system. Unlike the fixed default mapping strategy, thesemodels, sensing the inherent structures of the application and thecomputer, achieved a significant gain over conventional manualmappings.

Our models were verified by achieving near-optimal values insolving a 2D hyperbolic equation on the BG/L 3D mesh and torusnetworks. This improvement over the default mapping mainly de-pends on the matching of the structure of the communicationpattern with computer network. Although the experiments wereconducted only on such chosen application with such chosen plat-form, our models are extendable conveniently to other applicationsand platforms by substituting the demand matrix from a given ap-plication and, similarly the supply matrix from a computer to ourmodels.

Acknowledgements

This research utilized resources at the New York Center forComputational Sciences at Stony Brook University/Brookhaven Na-tional Laboratory which is supported by the U.S. Department ofEnergy under Contract No. DE-AC02-98CH10886 and by the Stateof New York. We gratefully acknowledge use of “BGL”, a 1024-nodeIBM Blue Gene/L system operated by the Argonne Leadership Com-puting Facility at Argonne National Laboratory.

References

[1] L.V. Kalé, B. Ramkumar, A.B. Sinha, V.A. Saletore, The CHARM parallel program-ming language and system. Part II. The runtime system, Parallel ProgrammingLaboratory Technical Report #95-03, 1994.

[2] L.V. Kalé, B. Ramkumar, A.B. Sinha, A. Gursoy, The CHARM parallel program-ming language and system. Part I. Description of language features, ParallelProgramming Laboratory Technical Report #95-02, 1994.

[3] M.G. Norman, P. Thanisch, Models of machines and computation for mappingin multicomputers, ACM Comput. Surv. 25 (3) (1993) 263–302.

[4] T. Bultan, C. Aykanat, A new mapping heuristic based on mean field annealing,J. Parallel Distrib. Comput. 16 (4) (1992) 292–305.

[5] S.H. Bokhari, On the mapping problem, IEEE Trans. Comput. 30 (3) (1981) 207–214.

[6] E.-G. Talbi, T. Muntean, General heuristics for the mapping problem, in: WorldTransputer Conf., 1993.

[7] W.-K. Chen, E.F. Gehringer, A graph-oriented mapping strategy for a hypercube,in: Proc. the Third Conference on Hypercube Concurrent Computers and Appli-cations: Architecture, Software, Computer Systems, and General Issues, 1988,pp. 200–209.

[8] G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J.C. Sexton, R. Walkup, Optimiz-ing task layout on the BlueGene/L supercomputer, IBM Journal of Research andDevelopment 49 (2–3) (2005) 489–500.

[9] T. Agarwal, A. Sharma, L.V. Kalé, Topology-aware task mapping for reducingcommunication contention on large parallel machines, in: Proc. of IEEE Inter-national Parallel and Distributed Processing Symposium, April 2006, pp. 25–29.

[10] The BlueGene/L Team, An overview of the BlueGene/L super computer, in: Proc.of Super Computing 2002, 2002.

[11] M. Affenzeller, R. Mayrhofer, Generic heuristics for combinatorial optimizationproblems, in: Proc. of the 9th International Conference on Operational Re-search, 2002, pp. 83–92.

task mapping on supercomputers with cellular networks

Documents