broadcasting on meshes with w orm-hole routing...1 in tro duction in this pap er, w e discuss the...

�Broadcasting on Meshes with Worm-Hole RoutingMichael BarnettDepartment of Computer ScienceUniversity of IdahoMoscow, Idaho [email protected] David G. PayneSupercomputer Systems DivisionIntel Corporation15201 N.W. Greenbrier PkwyBeaverton, Oregon [email protected] A. van de GeijnDepartment of Computer SciencesThe University of Texas at AustinAustin, Texas 78712{[email protected] Jerrell WattsCenter for High Performance ComputingThe University of Texas SystemAustin, Texas 78758{[email protected] 2, 1993AbstractWe address the problem of broadcasting on mesh architectures with arbitrary (non-power-two) di-mensions. It is assumed that such mesh architectures employ cut-through or worm-hole routing. Themain results are an algorithm for performing an optimal minimum-spanning tree broadcast when mes-sages are not pipelined, a pipelined algorithm that is similar to Ho and Johnson's EDST algorithm forhypercubes, and a novel scatter-collect approach that is a natural choice for communication libraries dueto its simplicity. Results obtained on the Intel Touchstone Delta system are included.1 IntroductionIn this paper, we discuss the design of general purpose broadcast routines for mesh architectures like theSymult S2010, and the Intel Touchstone Delta and Paragon systems. These systems consist of a number ofprocessing nodes connected by a communication network that employs worm-hole routing, thereby allowinga programming model that assumes all nodes are directly connected under contention-free conditions.For most users of these machines, broadcasting a message from one node to all others is a matter of callingthe appropriate library routine. On hypercubes, such a library routine often embeds a minimum spanningtree in the network, allowing the broadcast to complete execution in a time proportional to log2(p)n, wheren is the vector length, and p equals the number of nodes in the network. For hypercubes, better algorithmsexist. In particular, if communication latency is ignored, asymptotically the cost of a broadcast can be�This research was sponsored in part by Intel Supercomputing Systems Division.1

reduced to be proportional to n, independent of p, by using Ho and Johnson's EDST algorithm [5]. Thisalgorithm is not widely used, probably because its complexity makes it less attractive for a library anddi�cult to modify for special cases.The two-dimensional (2D) mesh architecture with wormhole routing is an attractive interconnectionarchitecture for distributed-memorymulticomputers. A mesh can be scaled to arbitrarily large con�gurationswhile retaining high link bandwidth. Moreover, the number of nodes in a mesh does not inherently needto be a power of two, in contrast with the hypercube. However, the advent of the mesh architecture hasproduced some di�culties. Although the programming model for both hypercubes and meshes with worm-hole routing allow the user to assume total connectivity, many communication algorithms that do not incurnetwork con icts on hypercubes do incur such con icts on meshes.In this paper we demonstrate how to implement a minimum spanning tree as well as an asymptoticallyoptimal broadcast on a 2D mesh. Finally, a novel approach is introduced that scatters and then collects thedata | a natural choice for libraries due to its simplicity and performance.2 AssumptionsThe target architectures for our algorithms are distributed-memory mesh multicomputers, including MultipleInstruction Multiple Data (MIMD) machines such as the Symult S2000, and the Intel Touchstone Delta andParagon systems.For our theoretical analysis, we use the following model:1. The multicomputer consists of p nodes, labeled P0; : : : ;Pp�1.2. The nodes physically form an r � c two-dimensional grid.3. Communication is with only one node at a time; i.e., multi-cast communication is not implemented inhardware.4. A node can send and receive simultaneously to and from the same or di�erent nodes with no penalty.5. If no network con icts occur, exchanging messages of length n between two nodes requires time �+n�,where � and � represents the communication startup time and per item (byte) transfer time.6. A message between two nodes occupies the entire path between the two nodes. The logical path fromnode Pi to Pj is denoted by hi; ji. The physical path is denoted as [i; :::; j], listing the physical nodesthe path is routed on. On a linear array or sub-mesh, the logical path and the physical path coincide:hi; ji refers to both of them. 2

7. Links between nodes can carry one message at a time in each direction. If more than one messagetraverses a link in the same direction, they time-share that connection.8. Splitting and concatenation of vectors does not consume any time.3 Minimum Spanning Tree BroadcastingThe most popular broadcast algorithm is based on embedding a minimum spanning tree from the node thatoriginates the broadcast (the root) to all other nodes. Assuming both r and c (and hence p) are powers oftwo, and that Pi is the root, the broadcast forwards messages according to the algorithm in Fig. 1. Thenotation send from i to jis shorthand for if me = i thensend message to jelse if me = j thenreceive message from ielseskipThe algorithm consists of log2(p) steps; at each step, the current set of nodes is divided into two partitions.The root sends a message to the corresponding node in the other partition, after which both act as roots inthe partitions which independently perform the broadcast. Note that the algorithm operates as if the nodesare in a linear array. Under our assumptions, this algorithm completes execution in time log2(p)�+log2(p)n�on a hypercube. It is optimal in the sense that the constant multiplying � is optimal, making it a goodalgorithm for short messages.In earlier work [1], we showed how this exact algorithm can be implemented on mesh architectures: bothlinear arrays and two dimensional meshes. On hypercubes, the order that bits are toggled is immaterial, buton a linear array, network con icts occur unless the bits are toggled from most signi�cant to least signi�cant.When the number of nodes is a power of two and the mapping of the linear array is done by row-majorordering, column-major ordering, or serpentine ordering, then each partition is a rectangular sub-mesh. Aslong as messages are routed such that a message from Pi to Pj takes the shortest route, then the links usedlie within the rectangle de�ned by i and j, and no network con icts will occur.But meshes, unlike hypercubes, are not restricted to contain a number of nodes that is a power of two.The obvious extension to the algorithm in Fig. 1 is to modify it by dividing the nodes in half as closely aspossible and then proceed as before. While this extension works �ne on a linear array, it can lead to networkcon icts on a mesh. 3

MSTbcast(message, root)beginfor i= log 2(p)�1,0,�1if (root >> i) MOD 2 = 1 thensend from root to root � (1 << i)if (me >> i) MOD 2 = 1 thenroot := rootelseroot := root � (1 << i)elsesend from root to root + (1 << i)if (me >> i) MOD 2 = 1 thenroot := root + (1 << i)elseroot := rootendFigure 1: Minimum Spanning Tree Broadcast. Note that the bits are examined from most-signi�cant toleast-signi�cant.Consider the possible communication patterns generated by such an algorithm. All messages are indisjoint partitions (of the logical linear array) so no two messages overlap. Within the same step, any pairof messages is between nodes i, j, k, and l such that i < j < k < l. There are four possibilities:hi; ji and hk; li (1)hi; ji and hl; ki (2)hj; ii and hl; ki (3)hj; ii and hk; li (4)While it may seem counter-intuitive, given the simple routing schemes discussed in Lemmas 1 and 2, it isthe pair of messages hj; ii and hk; li, messages that are moving away from each other in the logical array,that can con ict. This is demonstrated in Figure 2 for a x-direction �rst routing scheme. This leaves twooptions. The �rst possibility sidesteps the issue and takes advantage of the fact that no con ict occurswithin a linear array: the broadcast could �rst operate within the rows, after it is executed independentlywithin each column. However, the number of steps required is dlog2(r)e+ dlog2(c)e and this may be one stepmore than the optimal number of steps, dlog2(rc)e. In general, if the mesh is d-dimensional, performing thebroadcast in this fashion may incur d� 1 extra steps.Instead, we show that the other three patterns (1){(3) do not create any con ict under reasonableassumptions about the routing algorithm. This is summarized by the following lemmas, which use the samevariables as in (1){(3). 4

s s s s s s� -0 1 2 3 4 5 s s ss s s�?0 1 23 4 5(a) (b)Figure 2: Example of the creation of con ict dependent on the routing algorithm and the mapping of thelinear array to the mesh. (a) The paths h1; 0i and h2; 3i do not con ict. (b) The logical path h1; 0i is routedon the physical path [1; 0], while the logical path h2; 3i is routed on the physical path [2; 1; 0; 3] inducingcon ict on the link between 0 and 1.Lemma 1 Assume the routing algorithm for the network is such that hi; ji takes the shortest path, changesdirection at most once, and traverses the network in counter-clockwise direction. Then the physical pathsused for the logical paths (1){(3) do not con ict.Proof: See Appendix A.1. 2Lemma 2 Assume the routing algorithm for the network is such that hi; ji takes the shortest path, changingdirection at most once, x-direction �rst. Then the physical paths used for the logical paths (1){(3) do notcon ict.Proof: See Appendix A.2. 2Thus, an algorithm that creates only the patterns described by (1){(3) can be used to extend the algorithmin Fig. 1 to arbitrary 2D-meshes without incurring network contention. The algorithm is presented in Fig. 3;it uses two auxiliary algorithms which are shown in Fig. 4. The algorithm proceeds by taking the currentpartition and splitting it into two approximately equal subpartitions. The root sends the message to theright(left)-most node of the subpartition that does not contain the root, if the new partition is to theright(left) of the root. Within the subpartition that contains the root, this procedure proceeds recursively.In the other subpartition, the node that received the message becomes the root of a minimum spanningtree broadcast that sends message only to the left(right), by invoking RS right(left). An illustration of thealgorithm is shown in Figure 5. At each stage, no message crosses between any two partitions, any pairof messages satis�es (1){(3), and hence no network con icts occur under the conditions of Lemma 1 or 2.Thus, any pair of messages within the left partition correspond to the pattern (1), any pair within the rightpartition correspond to the pattern (3), and any pair with one being the message in the middle partitioncorresponds to one of the patterns (1)-(3). We prove this in the following lemma.Lemma 3 In each step of RSbcast, all pairs of messages are between nodes Pi, Pj , Pk, and Pl, such thati < j < k < l, and the only (logical) paths that are used are (1){(3).Proof: See Appendix A.3. 25

RSbcast(message, root, left, right)beginif left = right then returnmiddle := (right - left + 1) div 2 + leftif root < middle thensend from root to rightif me < middle thenRSbcast(message, root, left, middle - 1)elseRS right(message, middle, right)elsesend from root to leftif me < middle thenRS left(message, left, middle - 1)elseRSbcast(message, root, middle, right)endFigure 3: Recursive Splitting Broadcast with general root.We summarize our results in the following theorem.Theorem 4 If the message routing algorithm is described by either Lemma 1 or 2, then the RSbcast algorithmproceeds without network con icts, and the broadcast completes in dlog2(p)e steps.For short messages, latency is often a dominating factor, which means an algorithm with the fewest steps(and thus the fewest messages) is appropriate. For a message of length n bytes, the total time for RSbcastis: TRSbcast = dlog2(p)e(�+ n�) (5)RS left(msg, left, right)beginif left = right then returnlet left < middle � rightsend from left to middleif me < middle thenRS left(msg, left, middle � 1)elseRS left(msg, middle, right)end RS right(msg, left, right)beginif left = right then returnlet left < middle � rightsend from right to middle � 1if me < middle thenRS right(msg, left, middle � 1)elseRS right(msg, middle, right)end(a) (b)Figure 4: Recursive Splitting Broadcasts: (a) with left node as root, (b) with right node as root.6

RSbcasth h h h h h h h h h h hleft root middle right?RSbcast RS righth h h h h h h h h h h hleft rootmiddle right? left middle right?RS left RSbcast RS right RS righth h h h h h h h h h h hleft middle right? left middle rightroot? left middle right? left middle right?RS left RS left RS left RSbcast RS right RS right RS right RS righth h h h h h h h h h h hleftright left middleright? leftright left middlerightroot? leftright left middleright? leftright left middleright?�Figure 5: An illustration of RSbcast. There are 12 nodes in the logical array, numbered from 0 to 11. Wemake no assumption on how the physical mesh is laid out, other than that the 12 nodes are in a rectangularsub-mesh. In the top picture, the algorithm is beginning with node 5 as the root of the broadcast. Sincenode 6 is the middle, node 5 sends the message to node 11. After this message, the nodes are partitioned intotwo groups: nodes 0 through 5 and nodes 6 through 11. All nodes in the latter group execute RS right (whichalways uses the right node as the root), while those in the former group continue to execute RSbcast. Forthose nodes, in the next step node 3 is the middle, so node 5 sends the message to node 0. In the thirdstep, nodes 0 through 2 execute RS left, while RSbcast is continued among nodes 3 through 5. Notice thatthe partition that had been executing RS right has split into two partitions, both of which are executingRS right. The �nal picture shows the last messages being sent: there is a set of partitions executing RS left,then a single partition executing RSbcast, and then a set of partitions executing RS right. Thus, all messagesto the left (right) of the partition executing RSbcast send messages to the right (left); the message sent bythe root in the RSbcast partition may travel either to the left or to the right.7

Pipelining can trade o� the � and � terms, as we shall see next.4 Pipelined BroadcastWhen the p nodes are arranged as a linear array then a broadcast fromP0 can be accomplished by partitioningthe message into k equal packets which are pipelined along the array [10]. In other words, the �rst packetis sent from the root to the next node. After this next node passes it to its neighbor, the second packet issent by the root, etc. The time for completion becomes:Tpipe = (p � 1)(�+ nk�) + (k � 1)(�+ nk �) (6)The �rst term re ects the time for the �rst packet to reach the end of the ring; the second term is the timefor receiving the remainder of the packets. The optimal k is equal to:kopt = 8<: 1 if p(p� 2)n�=� < 1n if p(p� 2)n�=� > np(p� 2)n�=� otherwiseOf course, a noninteger multiple of packets is not feasible; in practice, we use either bkoptc or dkopte. Asdiscussed in Section 3, the linear array can be embedded in a mesh. In this case, the algorithm has thesame execution time. The row major ordering of the nodes combined with the conditions of Lemma 1 andLemma 2 guarantees that there will not be network con icts in the pipe.Unfortunately, the pipe is too long, making the number of startups incurred in this scheme prohibitivelylarge. For reasonable � and �, the break even point is for messages of unreasonable length. This is over-come on hypercubes by embedding log2(p) edge-disjoint minimum spanning trees, with successive messagesalternating between the disjoint trees, yielding the EDST broadcast [5]. The resulting e�ective pipe lengthbecomes log2(p) + 1.When designing pipelined algorithms, it is important to restrict communication to nearest neighbor inthe physical mesh, in order to avoid undue network con icts. Indeed, the EDST broadcast inherently incurscommunication con icts when implemented on a mesh. Moreover, the EDST broadcast cannot be generalizedto arbitrary meshes. We now propose an alternative algorithm for meshes that uses pipelines.For simplicity, assume the root of the broadcast to be node P0. The nodes are coded in checkerboardfashion; the protocol is that black nodes send east and south during even and odd steps, respectively, whilewhite nodes alternate in the opposite order. We embed two edge-disjoint \fences" as illustrated in Fig. 4.The root alternates between sending packages east and south, �lling two pipelines of length r+ c. Totalexecution time becomes: Tedf = (k + r + c)(�+ nk �) (7)8

r r r r r rr r r r r rr r r r r rr r r r r r- - - - -??? ??? ??? ??? ??? ---0,2,: : : 1,3,: : : 2,4,: : : 3,5,: : : 4,6,: : : 5,7,: : :3,5,: : :4,6,: : :5,7,: : : 4,6,: : :5,7,: : :6,8,: : : 5,7,: : :6,8,: : :7,9,: : : 6,8,: : :7,9,: : :8,10,: : : 7,9,: : :8,10,: : :9,11,: : :9,11,: : :10,12,: : :11,13,: : :r r r r r rr r r r r rr r r r r rr r r r r r??? - - - - -- - - - -- - - - -? ? ? ? ?1,3,: : :2,4,: : :3,5,: : :4,6,: : : 4,6,: : : 5,7,: : : 6,8,: : : 7,9,: : : 8,10,: : :5,7,: : : 6,8,: : : 7,9,: : : 8,10,: : : 9,11,: : :6,8,: : : 7,9,: : : 8,10,: : : 9,11,: : : 10,12,: : :8,10,: : : 9,11,: : : 10,12,: : :11,13,: : :12,14,: : :

Figure 6: Pipelined broadcast on Edge Disjoint Fences. Two fences are embedded as given above. The rootalternates sending packets using the top and bottom fences. The notation i; i+2; : : : is used to indicate thatthe �rst packet sent through the given fence arrives at the node at time i, followed by another packet everytwo time steps. 9

with: kopt =8<: 1 if p(r + c)n�=� < 1n if p(r + c)n�=� > np(r + c)n�=� otherwise(Even though the pipelines are of length r + c, synchronization needed during the algorithm forces two\empty" steps, resulting in r + c+ k steps instead of r + c+ k � 2 steps.)For general root, the worm-hole routing property can be used to create a logical torus, in which case thealgorithm proceeds with the origin of the torus shifted to the actual root. That is, since all sends are eithereast or south in a single dimension, the wrap-arounds move west or north, respectively, and do not con ictwith other messages.The e�ective length of the pipe is within a constant of optimal for a uni-directional torus since a messagemust traverse a minimum of r + c� 1 links to get from the root to the most distant node.5 Alternative Algorithm: Scatter-CollectIdeas from our previous work done for performing the global combine can be used to obtain a furthertradeo� between the startup cost and the transfer cost [3]. We �rst present a simple hybrid algorithm forone-dimensional meshes, and then extend it for the two-dimensional case.5.1 1-D scatter-collectThe RSbcast algorithm can be modi�ed by splitting the vector in half at each step of the algorithm (the\scatter"). This leaves the vector distributed across the nodes, with each node possessing a piece of theoriginal vector. A ring is then logically embedded in the nodes, and the pieces are circulated until all of thenodes possess all of the original vector (the \collect"). The algorithm is depicted in Figure 7. It shows theinitial state with node 0 as the source for the broadcast, followed by dlog2(p)e steps for the scatter phase,and p�1 steps for the collect phase. Some of communications during the collect phase are redundant in thatnodes receive pieces of the vector they already possess; this is certainly true for the root of the broadcast.But since there are pieces that must travel p � 1 hops to arrive at all of the nodes, we keep the algorithmsymmetric by having all pieces circulate to all nodes during this phase. The resulting formula for p = 2dand n a multiple of p is: Tsb1 = Pd�1i=0 [�+ n2i+1 �] +Pp�1i=1 [�+ n2d �]= (d+ p� 1)�+ 2 p�1p n�= (p + log2(p) � 1)�+ 2 p�1p n� (8)(The formula for general p and n is more complicated. We present the simpli�ed version for clarity.)Compared to RSbcast, an extra p � 1 startups are incurred, but the coe�cient of n� in the transfer timehas been reduced from log2(p) to 2 p�1p . 10

initial(0) -(1) - -(1) - - --(2) - - --(3) - - --�nalFigure 7: Scatter-collect for p = 4 with source 0.11

5.2 2-D scatter-collectFor a two dimensional mesh, the 1-D version of scatter-collect can be used, but the ring of nodes that passesthe buckets around during the collect phase has a length of p� 1, which can be shortened by performing thealgorithm in each dimension separately:1. scatter in columns: Perform the scatter in the root node's column. At the end of this phase, theoriginal vector is split into r pieces distributed among the nodes in the column.2. scatter in rows: Each row performs a scatter independently. Each node in the root node's column isthe root for its row. At the end of this phase, the piece that \belongs" to this row is split into c piecesdistributed among the nodes in the row. Over the whole mesh, the original vector has been split intorc = p pieces and distributed across the mesh.3. buckets in rows: Each row independently forms a logical ring and circulates the pieces until every nodepossesses the entire piece belonging to that row.4. buckets in columns: At this point, each column independently forms a logical ring, and circulates thepieces until every node possesses the entire vector.We model the time for the algorithm as:Tsb2 = Pd1�1i=0 [�+ n2i+1 �] +Pd2�1i=0 [�+ n2d1 12i+1 �]+Pc�1i=1 [�+ n2d1 2d2 �] +Pr�1i=1 [�+ n2d1 �]= (d1 + d2 + r + c� 2)�+ 2 p�1p n�= (log2(p) + r + c� 2)�+ 2 p�1p n� (9)(Again, we have simpli�ed the equation by assuming that p is a power of two and n is an integer multiple ofp.) Instead of p � 1 extra startups (compared to RSbcast), there are r + c � 2 extra startups. When r = c(i.e., the mesh is a square), this becomes approximately 2pp.6 Comparison of AlgorithmsIn this section, we compare the performance of the di�erent algorithms. First, we examine the performanceunder the idealized model given in Section 2. Next, we adjust the model to more closely �t the TouchstoneDelta, the architecture available to us that resembles our model. Finally, we compare the predicted executiontimes under this new model with the times observed on the Delta.6.1 Theoretical ComparisonIn Figure 8, we report the predicted execution times of the algorithms on an idealized architecture thatsatis�es Assumptions 1-8 in Section 2. The machine constants are �xed to correspond approximately to12

100

102

104

106

108

10-3

10-2

10-1

100

101

16x32 grid

message length in bytes

tim

e in s

ec.

naive

rs

sc

edf

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2message length = 1Mbyte

grid sizetim

e in s

ec.

naive

rs

sc

edf(a) (b)Figure 8: Predicted time for the various algorithms on an idealized machine: performance as a function ofmessage length (a) and of grid size (b). The grid sizes were chosen to equal i � j, where i = 2; : : : ; 16 andj = i; 2i.those of the Delta, with � = 150�sec and � = :05�sec. In this graph, naive indicates the naive hypercubeMSTbcast that incurs network con icts. For this algorithm, and 2dr �2dc meshes, the following formula canbe easily derived: Tnaive = dr�1Xi=0 [�+ 2in�] + dc�1Xi=0 [�+ 2in�]= log2(p)�+ (r + c � 2)n� (10)For the other algorithms, Equations 5, 7 and 9 were used.On such an idealized architecture, all but the naive algorithm show merit for a region of vector lengths.6.2 Adjustments Necessary to Model the DeltaThe closest architecture available to us to check our theoretical results is the Intel Touchstone Delta [7, 8]. TheDelta routing scheme uses the x-direction �rst, as in Lemma 2, then the y-direction. It conforms to all of theAssumptions 1{8 except Assumption 4: The interconnection network on the Delta is bidirectional. However,in e�ect, the Delta can only send or receive one message at a time. As a result, for our implementations onthe Delta, we must adjust the model somewhat.We now describe the implemented algorithms and the adjusted predicted execution times:13

1. The naive minimum spanning tree broadcast, which embeds the tree by toggling least signi�cant bitto most signi�cant bit, as one would likely do on a hypercube. In this algorithm, at any given time,each node only sends or receives, but not both. However, at each step of the algorithm, the number ofcommunications that interfere doubles, leading to a predicted time ofp�1Xd=0[�+ 2in�] = log2(p)�+ (p� 1)n�on a linear array with p = 2d processors. On a r � c mesh, with r = 2dr and c = 2dc ,the broadcastproceeds automatically �rst within one dimension, followed by the other dimension, giving the predictedtime in Equation 10.2. The recursive splitting broadcast, RSbcast. In this algorithm, at any given time, each node only sendsor receives, but not both. The predicted time on the Delta is given by Equation 5.3. The pipelined EDF broadcast. At each step in this algorithm, most nodes either send, or send andreceive simultaneously. Due to the timing of the messages that wrap around, we have observed on anetwork simulator that for some nodes, two messages arrive in the same step. Moreover, the simulatoralso shows that the resulting interference creates \bubbles" in the process, leading to further degrada-tion of performance. As a result, a better estimate of the time for transferring an item (byte) is 3�.Moreover, the latency is approximately doubled, leading to a predicted time ofTedf = (kopt + r + c)(2�+ 3 nkopt �) (11)with: kopt = 8<: 1 if p3(r + c)n�=(2�) < 1n if p3(r + c)n�=(2�) > np3(r + c)n�=� otherwiseWhen the number of packets is small enough that the wrapping doesn't interfere with subsequent mes-sages, the 3 in these equations should be replaced by a 2. In our estimates we simply use Equation 11.4. The scatter-collect broadcast. During the scatter, each node only receives or sends. During the collect,each node receives and sends simultaneously. As a result, the e�ective cost per item (byte) during thissecond stage is 2� and the latency is 2�, yielding the formula(log2(p) + 2(r + c� 2))�+ 3 p� 1p n �Our implementations used forced messages, which means that the receiver is assumed to be ready for themessage when it arrives. This doubles the bandwidth between nodes, but also doubles the latency, since a14

synchronization message must be sent. As a result, � = 150�sec: and � = :05�sec:, for the fast NX operatingsystem kernel1.In Figures 9, 10 and 11, we report the observed versus predicted times for the various algorithms, fora broadcast rooted at node 0. The predicted and observed timings agree enough to claim that the modelsare useful. Figure 12 (a) reports the observed time for the EDF algorithm as a function of the number ofpackets, for a �xed message size. The very erratic behavior leads to the conclusion that the interference dueto contention of messages arriving and leaving a node has a very adverse e�ect on this algorithm. As a result,in Figures 9, 11, and 12 (b) we report the best observed time for the EDF algorithm, when a reasonable rangeof numbers of packets is tried, rather than using the theoretical optimal kopt. In Figure 12 (b), we reportthe observed time as a function of the root node of the broadcast. The naive broadcast is very dependenton the root due to network con icts, but the other algorithms are not noticeably a�ected.Some interesting observations can be made about the data:� It should be noted that we did not get very accurate timings for very small messages. For example,the �gures show the observed timings for short messages to be much better than predicted. By timingseveral broadcasts in order to get an average time, di�erent iterations become pipelined, yielding betterthan expected results.� The odd data points for the EDF algorithm in Figures 9, 10, and 11 are due to the erratic behavior ofthe EDF algorithm.� In practice the scatter/collect outperforms the theoretically better EDF algorithm.7 Related WorkAs mentioned previously, the state of the art in broadcasting on hypercubes is [5, 6]. Our approach to edge-disjoint fences is closely related to the work in [4], where the embedding of edge-disjoint trees in wraparound(torii) meshes is discussed. The true wraparound links provide a mesh that has roughly half the diameterof the worm-hole meshes we consider. Their trees are much more complicated than the ones presented here;it is not clear whether their construction would lead to network con ict in a worm-hole mesh. In [11], abroadcast is presented that has somewhat of a avor of our \scatter-collect". In essence, the author followedour suggestion that the broadcast can be implemented as a modi�ed global summation and used some of thetechniques for such algorithms developed in [1, 2, 3, 12]. The resulting algorithms are not asymptoticallyoptimal, but do avoid network con icts. They are limited to meshes that contain a power-of-two number ofnodes, with extensions for general meshes that double the cost of the algorithms.1For the more robust standard kernel, all times must be approximately doubled.15

0 0.5 1 1.5 2 2.5 3 3.5 4

x 106

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

216x32 grid


tim

e in s

ec.

naive

rs

sc

edf

0 0.5 1 1.5 2 2.5 3 3.5 4

x 106

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

216x32 grid


tim

e in s

ec.

naive

rs

sc

edf(a) (b)10

010

210

410

610

810

-3

10-2

10-1

100

101

16x32 grid


tim

e in s

ec.

naive

rs

sc

edf

100

102

104

106

108

10-3

10-2

10-1

100

101

16x32 grid


tim

e in s

ec.

naive

rs

scedf

(c) (d)Figure 9: Observed time vs. predicted time for the Delta, as a function of vector length.16

0 0.5 1 1.5 2 2.5 3 3.5 4

x 106

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

215x31 grid


tim

e in s

ec.

rs

sc

edf

Figure 10: Observed time time for an odd-sized partition of the Delta, as a function of vector length. Noticethe observed behavior is very similar to that of the slightly larger, power-two partition reported in theprevious graphs.0 100 200 300 400 500

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


grid size

tim

e in s

ec.

naive

rs

sc

edf

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


grid size

tim

e in s

ec.

naive

rs

sc

edf(a) (b)Figure 11: Observed time vs. predicted time for the Delta, as a function of grid size.17

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

216x32 grid, EDF for message length = 1Mbyte

number of packets

tim

e in s

ec.

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

216x32 grid, message length = 1Mbyte

root nodetim

e in s

ec.

naive

rs

sc

edf(a) (b)Figure 12: (a) Time for EDF algorithm as a function of the number of packets. (b) Time as a function ofthe root node of the broadcast.8 ConclusionsOur work makes clear that e�cient broadcast algorithms are possible for mesh architectures. Their non-recursive nature|compared to hypercubes|does require more careful analysis in order to arrive at e�cientimplementations. While the idealized model provides insight, a more detailed model is also presented tomore closely �t the speci�c architecture on which we performed our experiments.The conclusions that we can draw from this work are the following:� For short vectors, broadcasting on a mesh is as e�cient as on a hypercube.� Asymptotically, for long vectors, in theory one can broadcast in essentially the same time on a meshas on a hypercube.� In practice, we can conclude that as a general approach, this kind of pipelining is extremely architecturedependent, its performance is very erratic and unpredictable (see Figure 12 (a)), and it is an extremelydi�cult algorithm to implement e�ciently.� For long vectors, the scatter-collect algorithm has much nicer properties:{ It is within a factor two of optimal. 18

{ It is very predictable.{ The details of how the scatter and collect are implemented is architecture speci�c, but not thegeneral approach. Any scatter and collect that does not incur network con icts will su�ce, at thepotential expense of additional latency overhead.Ultimately, hybrid algorithms that combine the algorithm that is best for short vectors with an e�cientalgorithm for long vectors will need to be developed. We are currently investigating such hybrids.AcknowledgementsThis research was performed in part using the Intel Touchstone Delta System operated by the CaliforniaInstitute of Technology on behalf of the Concurrent Supercomputing Consortium. Access to this facility wasprovided by Intel Supercomputer Systems Division and the California Institute of Technology.References[1] M. Barnett, D. Payne, and R. van de Geijn. Optimal broadcasting in mesh-connected architectures.Technical Report TR-91-38, Department of Computer Sciences, The University of Texas at Austin, Dec.1991.[2] M. Barnett, R. Little�eld, D.G. Payne, and R. van de Geijn. E�cient Communication Primitives onMesh Architectures with Hardware Routing. Sixth SIAM Conf. on Par. Proc. for Sci. Comp., Norfolk,Virginia, March 22-24, 1993.[3] M. Barnett, R. Little�eld, D.G. Payne, and R. van de Geijn, Global Combine on Mesh Architectures withWormhole Routing, 7th International Parallel Processing Symposium, pages 156{162, IEEE ComputerSociety Press, Newport Beach, CA, April 13-16, 1993.[4] J.-C. Bermond, P. Michallon, and D. Trystram. Broadcasting in wraparound meshes with parallel monodi-rectional links. Parallel Computing, 18:639{648, 1992.[5] C.-T. Ho and S. L. Johnsson, Distributed Routing Algorithms for Broadcasting and Personalized Com-munication in Hypercubes. Proceedings of the 1986 International Conference on Parallel Processing,pages 640{648, IEEE, 1986.[6] C.-T. Ho and M.T. Raghunath. E�cient communication primitives on hypercubes. Technical Report RJ7932 (72915), IBM, Jan. 1991. 19

[7] S.L. Lillevik, The Touchsone 30 Giga op Delta Prototype. In Sixth Distributed Memory ComputingConference Proceedings, pages 671{677. IEEE Computer Society Press, 1991.[8] R. Little�eld. Characterizing and Tuning Communications Performance on the Touchstone Delta andiPSC/860. Proceedings of the 1992 Intel User's Group Meeting, Dallas, TX, October 4-7.[9] L. M. Ni and P. K. McKinley. A survey of wormhole routing techniques in direct networks. IEEEComputer, 26(2):62{76, Feb. 1993.[10] Y. Saad and M. H. Schultz. Data Communiciation in Parallel Architectures. Yale University ResearchReport YALEU/DCS/RR-461, 857{873, 1986.[11] S. R. Seidel. Broadcasting on Linear Arrays and Meshes. Oak Ridge National Laboratory TechnicalReport ORNL/TM-12356, Mar. 1993.[12] R. A. van de Geijn. E�cient Global Combine Operations. In Sixth Distributed Memory ComputingConference Proceedings, pages 291{294. IEEE Computer Society Press, 1991.

20

A Proofs of the LemmasFor lemmas 1 and 2, assume i < j < k < l and let (xr; xc) denote the row and column indices of nodex (x 2 fi; j; k; lg). Row-major ordering means that i < j � ((ir = jr ^ ic < jc) _ ir < jr). Thereforei < j < k < l implies ir � jr � kr � lr. If jr < kr, then the messages will be routed in disjoint sets of links,due to the minimum length path property (each pair is in a separate rectangular sub-mesh). Without lossof generality, assume jr = kr, which implies jc < kc. In each proof, the left column is for the pair of pathshi; ji and hk; li, the middle column is for the pair hi; ji and hl; ki, and the right column is for the pair hj; iiand hl; ki. Note that the pictures compress certain cases. For instance the �rst row of pictures show ir < jrand kr < lr , but they are also valid for ir = jr or kr = lr.

21

A.1 Proof of Lemma 1Assume counter-clockwise routing.� ic � jc and kc � lc.irjr = krlr ic jc kc lc irjr = krlr ic jc kc lc irjr = krlr ic jc kc lcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s- - - � � �� ic > jc and kc � lc. Note: ic > jc ) ir < jr.irjr = krlr icjc kc lc irjr = krlr icjc kc lc irjr = krlr icjc kc lcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s? - ? � 6�� ic > jc and kc > lc.irjr = krlr icjc kclc irjr = krlr icjc kclc irjr = krlr icjc kclcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s?? ? 6 66� ic � jc and kc > lc.irjr = krlr ic jc kclc irjr = krlr ic jc kclc irjr = krlr ic jc kclcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s-? - 6 � 622

A.2 Proof of Lemma 2Assume x-direction �rst routing.� ic � jc and kc � lcirjr = krlr ic jc kc lc irjr = krlr ic jc kc lc irjr = krlr ic jc kc lcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s? ? ? 6 6 6� ic > jc and kc � lc. Note: ic > jc ) ir < jr.irjr = krlr icjc kc lc irjr = krlr icjc kc lc irjr = krlr icjc kc lcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s? ? ? 6 66� ic > jc and kc > lc.irjr = krlr icjc kclc irjr = krlr icjc kclc irjr = krlr icjc kclcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s?? ? 6 66� ic � jc and kc > lc.irjr = krlr ic jc kclc irjr = krlr ic jc kclc irjr = krlr ic jc kclcs s s ss s s ss s s s s s s ss s s ss s s s s s s ss s s ss s s s?? ? 6 6 623

A.3 Proof of Lemma 3First, we add two auxiliary variables to RSbcast, LP and RP , to represent the boundaries of the partitions.The algorithm has the following invariant:LP � RP ^(send from i to j ) (0 � i < j < LP ) _ (LP � i; j < RP ) _ (RP � j < i < p))The invariant is initialized by assigning 0 to LP and p � 1 to RP . At the end of each step of RSbcast, ifthe current root is to the left of middle then RP is set to middle, else LP is set to middle. When RP ischanged, then the right partition increases in size; the algorithm RS right is executed between middle and(the current value of) right. Similarly, when LP is changed, then the left partition is increased, and RS leftis executed between (the current value of) left and middle - 1. Clearly, within the left (right) partition,messages move only to the right (left). At each step, there is exactly one message in the middle partition:this is the message that the root sends. The message is to another node in the middle partition, after whichthe partition boundaries are adjusted. Thus, any pair of messages within the left partition correspond to thepattern (1), any pair within the right partition correspond to the pattern (3), and any pair with one beingthe message in the middle partition corresponds to pattern (2). The algorithm terminates when LP = RP .

24

broadcasting on meshes with w orm-hole routing...1 in tro duction in this pap er, w e discuss the...

Documents