502 ieee transactions on computers, vol. 56, no. 4, …

9
Low Diameter Interconnections for Routing in High-Performance Parallel Systems Rami Melhem, Fellow, IEEE Abstract—A new class of Low Diameter Interconnections (LDI) is proposed for high-performance computer systems that are augmented with circuit switching networks. In these systems, the network is configured to match the communication patterns of applications, when these patterns exhibit temporal locality, and to embed a logical topology to route traffic that does not exhibit locality. The new LDI topology is a surprisingly simple directed graph which minimizes the network diameter for a given node degree and number of nodes. It can be easily embedded in circuit switching networks to route random traffic with high bandwidth and low latency. Index Terms—Interconnection networks, circuit switching, fixed diameter graphs, directed graphs, deterministic routing, low diameter networks. Ç 1 INTRODUCTION AND MOTIVATION N ETWORK interconnections for massively parallel sys- tems have been extensively studied and many routing algorithms have been designed to efficiently route mes- sages on these networks. The quality of routing algorithms is affected by two important characteristics of the network: its diameter and its node degree. These two measures are tightly related in the sense that a small node degree implies a large network diameter and small diameter leads to a large node degree. Specifically, if the maximum degree of the nodes in a directed graph is S, then any node cannot reach more than S other nodes in one hop, S þ S 2 other nodes in at most two hops, and, in general, S þ S 2 þ ... þ S h other nodes in at most h hops. Hence, to be able to reach all M 1 other nodes in at most h hops, the value of h should satisfy S þ S 2 þ ... þ S h M 1. In other words, for a given number of nodes, the relation between the node degree, S, and the diameter, h, is given by ðS hþ1 1Þ=ðS 1Þ M, which is called the Moore’s bound. It has been proven [2] that, except for the case of S ¼ 1 and h ¼ 1, there exists no graph that satisfies the equality in that bound. Asymptotically, however, the Moore’s bound indicates that, in order to satisfy a fixed diameter of h, the node degree should be at least equal to ffiffiffiffiffi M h p . In many parallel applications, scalability is affected by communication latency and, consequently, by the diameter of the network. Hence, bounding the number of hops required to communicate between any pair of nodes in the network leads to better scalability. In this paper, we introduce a low diameter directed graph that minimizes the node degree. This work is motivated by the recent interest in using circuit switching in scalable high-perfor- mance computing systems [1], [17], [18], [21]. Specifically, optical switches were recently proposed for establishing direct connections among processors to match the commu- nication requirement of high-performance applications [1], [17]. This amounts to embedding the communication graph of the application into the switching network and is ideal for applications that exhibit communication locality. In order to clarify the concept, consider a 9-node system connected by three 9 9 crossbar switches (called switching planes). If the communication pattern of an application requires the meshlike interconnection shown in Fig. 1a, then the crossbars can be set to match that pattern, as long as the number of crossbars (switching planes) in the system is at least equal to the degree of the communication pattern. If the pattern for another application (or in another phase of the same application) requires the ring interconnection shown in Fig. 1b, then that pattern can be realized by appropriately setting the crossbar switches. Note that time- division multiplexing of a single switch can be used to realize multiple switch settings, as proposed in [6], [16]. Also note that other direct networks, such as fat trees or multistage networks, can be used instead of the crossbars as switching planes to establish direct connections between the nodes in the system. Establishing direct connections between communicating nodes increases the communication bandwidth and de- creases its latency, compared to packet or wormhole switching. However, due to the large overhead of establish- ing connections, circuit switching is not suitable for communication that exhibit poor locality. The best way to deal with such communication in circuit switching net- works is to use the switches to embed a suitable topology which will allow multihop routing of messages on this topology. Given a specific node degree, implied by the number of available switching planes, the Low Diameter Interconnection (LDI) introduced in this paper provides a means for finding the topology which minimizes the maximum number of hops for routing a message between any two nodes in the network. The main result of this paper is to show that, for a given number of nodes, M, if M can be decomposed into 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007 . The author is with the Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260. E-mail: [email protected]. Manuscript received 12 Aug. 2005 revised 31 May 2006; accepted 12 Sept. 2006; published online 1 Feb. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TC-0266-0805. Digital Object Identifier no. 10.1109/tc.2007.1004. 0018-9340/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society

Upload: others

Post on 12-Nov-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

Low Diameter Interconnections for Routing inHigh-Performance Parallel Systems

Rami Melhem, Fellow, IEEE

Abstract—A new class of Low Diameter Interconnections (LDI) is proposed for high-performance computer systems that are

augmented with circuit switching networks. In these systems, the network is configured to match the communication patterns of

applications, when these patterns exhibit temporal locality, and to embed a logical topology to route traffic that does not exhibit locality.

The new LDI topology is a surprisingly simple directed graph which minimizes the network diameter for a given node degree and

number of nodes. It can be easily embedded in circuit switching networks to route random traffic with high bandwidth and low latency.

Index Terms—Interconnection networks, circuit switching, fixed diameter graphs, directed graphs, deterministic routing, low diameter

networks.

Ç

1 INTRODUCTION AND MOTIVATION

NETWORK interconnections for massively parallel sys-tems have been extensively studied and many routing

algorithms have been designed to efficiently route mes-sages on these networks. The quality of routing algorithmsis affected by two important characteristics of the network:its diameter and its node degree. These two measures aretightly related in the sense that a small node degreeimplies a large network diameter and small diameter leadsto a large node degree. Specifically, if the maximumdegree of the nodes in a directed graph is S, then anynode cannot reach more than S other nodes in one hop,S þ S2 other nodes in at most two hops, and, in general,S þ S2 þ . . .þ Sh other nodes in at most h hops. Hence, tobe able to reach all M � 1 other nodes in at most h hops,the value of h should satisfy S þ S2 þ . . .þ Sh �M � 1. Inother words, for a given number of nodes, the relationbetween the node degree, S, and the diameter, h, is givenby ðShþ1 � 1Þ=ðS � 1Þ �M, which is called the Moore’sbound. It has been proven [2] that, except for the case of S ¼1 and h ¼ 1, there exists no graph that satisfies the equality inthat bound. Asymptotically, however, the Moore’s boundindicates that, in order to satisfy a fixed diameter of h, thenode degree should be at least equal to

ffiffiffiffiffiMhp

.In many parallel applications, scalability is affected by

communication latency and, consequently, by the diameter

of the network. Hence, bounding the number of hops

required to communicate between any pair of nodes in the

network leads to better scalability. In this paper, we

introduce a low diameter directed graph that minimizes

the node degree. This work is motivated by the recent

interest in using circuit switching in scalable high-perfor-

mance computing systems [1], [17], [18], [21]. Specifically,

optical switches were recently proposed for establishingdirect connections among processors to match the commu-nication requirement of high-performance applications [1],[17]. This amounts to embedding the communication graphof the application into the switching network and is idealfor applications that exhibit communication locality. Inorder to clarify the concept, consider a 9-node systemconnected by three 9� 9 crossbar switches (called switchingplanes). If the communication pattern of an applicationrequires the meshlike interconnection shown in Fig. 1a, thenthe crossbars can be set to match that pattern, as long as thenumber of crossbars (switching planes) in the system is atleast equal to the degree of the communication pattern. Ifthe pattern for another application (or in another phase ofthe same application) requires the ring interconnectionshown in Fig. 1b, then that pattern can be realized byappropriately setting the crossbar switches. Note that time-division multiplexing of a single switch can be used torealize multiple switch settings, as proposed in [6], [16].Also note that other direct networks, such as fat trees ormultistage networks, can be used instead of the crossbars asswitching planes to establish direct connections between thenodes in the system.

Establishing direct connections between communicatingnodes increases the communication bandwidth and de-creases its latency, compared to packet or wormholeswitching. However, due to the large overhead of establish-ing connections, circuit switching is not suitable forcommunication that exhibit poor locality. The best way todeal with such communication in circuit switching net-works is to use the switches to embed a suitable topologywhich will allow multihop routing of messages on thistopology. Given a specific node degree, implied by thenumber of available switching planes, the Low DiameterInterconnection (LDI) introduced in this paper provides ameans for finding the topology which minimizes themaximum number of hops for routing a message betweenany two nodes in the network.

The main result of this paper is to show that, for a givennumber of nodes, M, if M can be decomposed into

502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007

. The author is with the Department of Computer Science, University ofPittsburgh, Pittsburgh, PA 15260. E-mail: [email protected].

Manuscript received 12 Aug. 2005 revised 31 May 2006; accepted 12 Sept.2006; published online 1 Feb. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-0266-0805.Digital Object Identifier no. 10.1109/tc.2007.1004.

0018-9340/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

Page 2: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

M ¼ Sh�1G, where 1 < G � S, then it is possible to build adirected graph with M nodes whose diameter is h andwhose node degree is S, which, as shown earlier, isasymptotically optimal. In addition, the class of LowDiameter Interconnections can be easily embedded incircuit switched architectures and yields a simple routingalgorithm which routes messages in at most h hops. Thepractical impact of the restriction on the decomposability ofthe number of nodes is minimal since, in most high-performance computing systems, the number of nodes is apower of 2, which always allows for the above-mentioneddecomposability.

Directed graphs have been extensively studied in theliterature. For instance, the Kautz graphs and the DeBruijngraphs are the best-known classes of graphs for maximizingthe number of nodes for given node degrees and diameters[3], [5], [14], [15]. Both the Kautz and the DeBruijn graphsmay be generated through line digraph iterations oncomplete directed graphs (with and without self-loops,respectively). Specifically, a node in a Kautz/DeBruijngraph is created for each link in the complete graph and a

link in the Kautz/DeBruijn graph is created for each path of

length 2 in the complete graph. Fig. 2a and Fig. 2b show

examples of the two graphs when node degree ¼ 2 and

diameter ¼ 3. Although optimally generated, the Kautz and

DeBruijn graphs are not useful for finding the graph with

the smallest diameter, given the number of nodes and the

node degree. The partial line digraph technique proposed

in [11] adds the flexibility of removing edges from the

complete graphs before applying the line digraph iterations,

thus obtaining graphs with small diameters for an arbitrary

number of nodes. The construction of the LDI graphs

introduced in this paper is simpler than the partial line

digraph technique. Moreover, the routing algorithm for LDI

graphs is uniform and depends only on the parameters of

the graph. Finally, there is an explicit formula for decom-

posing the links in LDI graphs such that their embedding in

circuit switched networks is straightforward. In Section 2, it

will be shown that, for the special case when the number of

nodes is M ¼ Sh, the LDI network is equivalent to the

DeBruijn graph.

MELHEM: LOW DIAMETER INTERCONNECTIONS FOR ROUTING IN HIGH-PERFORMANCE PARALLEL SYSTEMS 503

Fig. 1. A 9-node system connected by three crossbars. The crossbars are configured to realize the shown interconnection topology. The circles on

the left of the switches represent the output NICs (network interface cards) of the nodes and those on the right of the switches represent the input

NICs of the nodes.

Fig. 2. Example graphs with node degree ¼ 2 and diameter ¼ 3. (a) DeBruijin graph. (b) Kautz graph. (c) ShuffleNet.

Page 3: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

Both directional and bidirectional ShuffleNets [12], [13]have been proposed as interconnection networks. They are

based on a generalization of the Perfect Shuffle connection

patterns [19]. Specifically, a ShuffleNet with node degree Sand parameter k has M ¼ k Sk nodes arranged in k columns

of Sk nodes each, and the connections between the columns

form a Perfect Shuffle (see Fig. 2c for an example with

k ¼ S ¼ 2, where nodes 1, 2, 3, and 4 are duplicated forclarity). This leads to a network diameter of 2k� 1, which is

larger than the diameter of the LDI network with the same

number of nodes and node degree. For example, for S ¼ 4

and M ¼ 1; 024 nodes, the diameter of the Shufflenet is 7

and that of the LDI is 5.This paper is organized as follows: In the next section,

the LDI graphs are defined and, in Section 3, theirembeddings in circuit switched networks are demonstrated.

In Section 4, two simple cases of LDI networks are

presented to clarify the topologies and routing algorithmsfor these networks. In Section 5, the routing algorithm for a

general class of LDI networks is derived. Concluding

remarks are given in Section 6.

2 THE LDI TOPOLOGY

For any S > 1, h > 1, and Sh�1 < M � Sh, define the

LDIðM;SÞ topology as a directed graph, ðV ;EÞ, where V ¼f0; . . . ;M � 1g is a set of M nodes and E is the set of MS

directed edges (links), given by

E ¼ f< n; ðSnþ LÞ mod M >; for n ¼ 0; . . . ;M � 1;

and L ¼ 0; . . . ; S � 1g;ð1Þ

where < u; v > denotes a link from node u to node v.

The link < n; ðSnþ LÞmod M > will be called the

Lth link of node n. More descriptively, each node, n, inLDIðM;SÞ, has S output links connecting it to nodesðSnÞ mod M; ðSnþ 1Þ mod M; . . . ; ðSnþ S � 1Þ mod M.

Fig. 3 shows LDI(7, 2) and LDI(9, 3) as two examples ofLDI networks with node degrees 2 and 3, respectively. Notethat the definition in (1) includes links directed from a nodeto itself. These links are removed in Fig. 3a and Fig. 3b.Fig. 3c and Fig. 3d show the unfolded LDI graphs, wherethe circle labeled n on the left of the graph and the circlelabeled n on the right of the graph represent the same node(or represent the output Network Interface Card, NIC, andthe input NIC of node n, respectively). These unfoldedgraphs demonstrate the regularity of the connections.Further unfolding of the graphs, as in Fig. 3e and Fig. 3f,shows that any destination in LDI(7, 2) can be reached fromany source in three hops and that any destination inLDI(9, 3) can be reached from any source in two hops.

It is straightforward to argue that the diameter forLDIðM;SÞ is h if Sh�1 < M � Sh. Specifically, from anynode n, the definition (1) implies that we can reach Sconsecutive nodes from n in one hop, where two nodes uand v are said to be consecutive if v ¼ ðuþ 1Þ mod M. Intwo hops from n, we can thus reach S2 consecutive nodesand, in general, in h hops, we can reach up toSh consecutive nodes. Given that M � Sh, then any of theM nodes can be reached in at most h hops. For example,from Fig. 3c, it can be seen that node 2 can reach nodes 4and 5 in one hop, nodes 1, 2, 3, and 4 in two hops, and allseven nodes in three hops.

As indicated in the introduction, when M ¼ Sh,

LDIðM;SÞ is equivalent to the DeBruijn graph. Specifi-

cally, for any h and S, the DeBruijn graph with Sh nodes

may be constructed by labeling the nodes in the graph

using the Sh strings of h characters over an alphabet of

504 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007

Fig. 3. Examples of low diameter interconnections. (a) LDI(7, 2). (b) LDI(9, 3). (c) LDI(7, 2) unfolded. (d) LDI(9, 3) unfolded. (e) LDI(7, 2) unfolded

three times. (f) LDI(9, 3) unfolded twice.

Page 4: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

S symbols, say A ¼ fa0; . . . ; aS�1g. That is, nodes may be

labeled with strings of the form “xh�1; . . . ; x0,” where

each xi, i ¼ 0; . . . ; h� 1, is in A. Edges are then added from

each node X, labeled by “xh�1; . . . ; x0,” to every node Y

labeled by “xh�2; . . . ; x0, y,” where y is in A. That is, Y is

obtained by shifting the label of X one position to the left

and adding any of the symbols of A at the rightmost

position. The equivalence to the LDI network is obtained by

taking A as the set of integers f0; . . . ; S � 1g and interpret-

ing the label “xh�1; . . . ; x0” of a node X as the integer n ¼Ph�1i¼0 xiS

i (here, Si is S raised to the power i, that is, n is the

integer xh�1 . . .x0 in the base-S number system). Hence, the

label “xh�2; . . . ; x0, y” of node Y is interpreted as the integer

n0 ¼Pn�1

i¼1 xi�1Si þ y. Using simple arithmetic shows that

n0 ¼ ðSnþ wÞmod Sh ¼ ðSnþ wÞmodM, which is the same

relation governing the connectivity in LDI.

3 DECOMPOSITION OF LDI INTO PERMUTATIONS

The links in LDIðM;SÞ can be grouped into S sets, �y,y ¼ 0; . . . ; S � 1, defined as follows:

�y ¼ f< n; ðSnþ LÞ mod M >;n ¼ 0; . . . ;M � 1

and L satisfies y ¼ ðn div S þ LÞmod Sg:ð2Þ

Clearly, each set, �y, contains M links. Given that0 � L < S, any particular set, �y, contains exactly one linkfrom a given node, n, since, for the same value of n, twodifferent values of L give two different values ofðn div S þ LÞ mod S.

Next, it will be shown that any two links in the same set�y terminate at two different nodes. For this, defineDestðn; yÞ as the destination of the link in �y whose sourcenode is n. It can be shown that

Destðn; yÞ ¼ ðSnþ ðy� n div SÞ mod SÞ mod M: ð3Þ

Specifically, if link < n; ðSnþ LÞ mod M > is in �y, theny ¼ ðn div S þ LÞ mod S. By substituting this value of y in(3) and using Rule 1 from the Appendix, we get

Destðn; yÞ ¼½Snþ ððn div S þ LÞ mod S � n div SÞ mod S� mod M¼ ½Snþ ðn div S þ L� n div SÞ mod S� mod M¼ ½Snþ L mod S� mod M ¼ ðSnþ LÞ mod M:

Now, noting that 0 � ðy� n div SÞ mod S < S, we concludefrom (3) that, if n0 6¼ n, then Destðn0; yÞ cannot be equal toDestðn; yÞ.

There are two consequences for proving that the M linksin each set �y have different source nodes and differentdestination nodes. First, it proves that each of the in-degreeand the out-degree of a node in the LDI graph is equal to Sand, second, it shows that, because the sources anddestinations of the links in each of the S set �y,y ¼ 0; . . . ; S � 1, form a permutation, the realization of theLDI connectivity through any S nonblocking switches isstraightforward.

In addition to demonstrating the network topology andthe routing algorithm, we use the example of M ¼ 9 andS ¼ 3 (see Fig. 3) to illustrate the decomposition of the links

in the LDI topology into S ¼ 3 sets as specified in (2).Specifically,

�0 ¼f< 0; 0 >;< 1; 3 >;< 2; 6 >;< 3; 2 >;< 4; 5 >;< 5; 8 >;

< 6; 1 >;< 7; 4 >;< 8; 7 >g�1 ¼f< 0; 1 >;< 1; 4 >;< 2; 7 >;< 3; 0 >;< 4; 3 >;< 5; 6 >;

< 6; 2 >;< 7; 5 >;< 8; 8 >g�2 ¼f< 0; 2 >;< 1; 5 >;< 2; 8 >;< 3; 1 >;< 4; 4 >;< 5; 7 >;

< 6; 0 >;< 7; 3 >;< 8; 6 >g:

It is straightforward to check that each set is apermutation and, thus, the LDI connectivity can beaccomplished by appropriately setting three crossbarswitches, as shown in Fig. 4.

Although it was shown that the diameter of LDIðM;SÞ ish if Sh�1 < M � Sh, only the case where M ¼ Sh�1G forsome 0 < G � S results in a simple and regular routingalgorithm. In the next section, we first illustrate the routingalgorithm and prove its correctness for the special case thatleads to two-hop routing.

4 TWO-HOP ROUTING IN LDIðM;SÞ FOR M ¼ SGAND G � S

Let’s revisit LDI(9, 3). From Fig. 3d and Fig. 3f, it is easy tocheck that there is a 2-hop path between any two nodes.Specifically, the nine nodes are divided into three groups(nodes 0, 1, 2 are one group, nodes 3, 4, 5 are another group,and nodes 6, 7, 8 are a third group). Each group can connect toall nine nodes. The first node in any group can reach nodes 0,1, 2, the second node in any group can reach nodes 3, 4, 5, andthe third node in any group can reach nodes 6, 7, 8. Given thatany node, n, is connected to all the nodes in some group, saygroup �, then the first routing step (hop) for a connection fromn to a destination, d, is to reach the particular node in � that isconnected to the destination, d, namely, a node nð1Þ thatsatisfies nð1Þ mod S ¼ d div S. The second routing step (hop)is to reach the destination. This argument is formalized andgeneralized in the following theorem:

MELHEM: LOW DIAMETER INTERCONNECTIONS FOR ROUTING IN HIGH-PERFORMANCE PARALLEL SYSTEMS 505

Fig. 4. Switch setting to realize LDI(9, 3).

Page 5: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

Theorem 1. Assuming that M ¼ S2, then, for any source 0 �n < M and destination 0 � d < M, a node nð1Þ can be foundsuch that links < n; nð1Þ > and < nð1Þ; d > are inLDIðM;SÞ.

Proof. We will find two integers, 0 � �ð0Þ < S and0 � �ð1Þ < S, such that

ðS nþ �ð0ÞÞ mod M ¼ nð1Þ; ð4Þ

ðS nð1Þ þ �ð1ÞÞ mod M ¼ d: ð5Þ

Let �ð0Þ ¼ d div S and substitute this value in (4) to getnð1Þ ¼ ðS nþ d div SÞ mod M. Then, substituting nð1Þ

along with �ð1Þ ¼ d mod S, in the left-hand side of (5),we get

ðS nð1Þ þ �ð1ÞÞ mod M ¼½SððS nþ d div SÞ mod MÞ þ d mod S� mod M¼ ½S2 nþ S ðd div SÞ þ d mod S� mod M¼ d:

In the above simplification, we used Rule 1 from theAppendix and the fact that S ðd div SÞ þ d mod S ¼ d.Clearly, (4) specifies that < n; nð1Þ > is in LDIðM;SÞ and(5) specifies that < nð1Þ; d > is in LDIðM;SÞ. tuTheorem 1 specifies a path of length 2 in LDIðS2; SÞ

between any source node n and destination node d. Thus, itprovides a simple routing algorithm from n to d. Namely,

1. From node n, use the Lth link, where L ¼ d div S, toreach node nð1Þ ¼ ðSnþ d div SÞ mod M.

2. From node nð1Þ, use the Lth link, where L ¼ d mod S,to reach node d.

Note that the above routing will always take 2-hops, even ifthere is a direct link from n to d. In order to take advantage ofsingle-hop routes, a test should be performed at n before thefirst hop to test if ðSnÞ mod M � d < ðSnþ SÞ mod M and, ifso, link L, where L ¼ ðd� SnÞ mod M, should be used toroute directly to d. Moreover, if n ¼ nð1Þ, then the firstrouting step can be eliminated.

After considering the case of M ¼ S2, the more generalcase of M ¼ S G, for G � S, is considered. Fig. 5 showsthe unfolded connections for LDI(12, 4) which contains

M ¼ 12 nodes. The node degree for this case is S ¼ 4 andthe value of G is 3. The unfolded graph shows that any nodecan reach any other node in a maximum of two steps.Specifically, the nodes are divided into three groups of fournodes each and any node can be reached from any of thegroups. In fact, because there are fewer groups than nodeswithin a group, then some nodes can be redundantlyreached from other groups, resulting in multiple pathsbetween some sources and destinations. For example, twopaths from node 5 to node 9 are shown in bold in Fig. 5.

The proof of the following theorem is similar to that ofTheorem 1. It is omitted because it is a special case of themore general theorem proved in the next section.

Theorem 2. Assuming that M ¼ S G, for G � S, then, for anysource 0 � n < M and destination 0 � d < M, a node nð1Þ

can be found such that links < n; nð1Þ > and < nð1Þ; d > arein LDIðM;SÞ.

The routing algorithm from any source node n to anydestination node d is as follows:

1. From node n, use the Lth link, where L is an integerthat satisfies ðSnþ LÞ mod G ¼ d div S to reachnode nð1Þ.

2. From node nð1Þ, use the Lth link, where L ¼ d mod S,to reach node d.

Note that, when G < S, there may be two values of L thatsatisfy ðSnþ LÞ mod G ¼ d div S in the first routing step,thus leading to the multiple path property.

Given the equivalence established in Section 2 betweenLDIðS2; SÞ and DeBruijn graphs and using previouslyestablished results about DeBruijn graphs [8], [20], we canconclude that LDIðS2; SÞ remains connected in the presenceof any S � 2 faulty nodes and that the diameter of the faultynetwork only increases from h ¼ 2 to h ¼ 3.

Although it seems that the multiple path property of LDIwhen G < S may enhance its fault tolerance, it can be easilyshown that S � dS=G e faulty nodes can partitionLDIðSG; SÞ and, clearly, dS=G e can be equal to, and evenlarger than, 2. Specifically, the connectivity of LDIðSG; SÞfor G < S implies that there is at least one group of S nodessuch that dS=G e nodes in that group are only connected tonodes in the same group. Hence, those nodes can beisolated from the rest of the network if the other S � dS=G enodes in the group are faulty. For example, considerLDI(15, 5) shown in Fig. 6a, which is composed of G ¼ 3groups with S ¼ 5 nodes in each group. In this case,dS=G e ¼ 2 and, as shown in Fig. 6b, S � 2 ¼ 3 faults(namely, in nodes 1, 2, and 4) can partition the networkssuch that nodes 0 and 3 are isolated from the other nodes. Itcan be shown, however, that, in the presence of any S �dS=G e � 1 ¼ 2 faulty nodes, the LDI(15, 5) remains con-nected and its diameter increases from h ¼ 2 to h ¼ 3 (seeFig. 6c for an example). At this point, it should bementioned that, when nonblocking switches are used toembed an LDI network, as described in Section 3, theimportance of network connectivity to fault tolerancediminishes. This is because of the capability to reconfigurethe switches to restore connectivity. For example, in Fig. 6d,the connectivity of the network of Fig. 6b is restored by the

506 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007

Fig. 5. The unfolded graph for LDI(12, 4) showing multiple paths from

node 5 to node 9.

Page 6: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

addition of links from nodes 0 and 3 to nodes 5, 6, 7, and 8.In fact, for any number of faults, f , in LDIðM;SÞ, it isalways possible to restore connectivity by reconfiguring theswitches to embed LDIðM � f; SÞ.

5 ROUTING IN THE GENERAL LDIðSh�1 G; SÞ FOR

G � SIn this section, Theorem 2 is generalized. Specifically, anh-hop route in LDIðSh�1 G; SÞ will be found between anysource node, n, and destination node, d. In order to simplifythe notation, n and d will be denoted by nð0Þ and nðhÞ,respectively.

Theorem 3. Assuming that M ¼ Sh�1 G for some G � S andsome h > 1, then, for any source node, 0 � nð0Þ < M, and

destination node, 0 � nðhÞ < M, there are h� 1 nodes,0 � nðiÞ < M, i ¼ 1; . . . ; h� 1, s u c h t h a t l i n k s< nðiÞ; nðiþ1Þ > , i ¼ 0; . . . ; h� 1, are in LDIðM;SÞ.

Proof. We will prove the theorem by finding explicitexpressions for the links on an h-hop path from n to d.That is, finding h� 1 integers between 0 and M � 1,namely, nð1Þ; . . . ; nðh�1Þ and h integers between 0 andS � 1, namely, �ð0Þ; . . . ; �ðh�1Þ, that satisfy the following:

ðS nðiÞ þ �ðiÞÞ mod M ¼ nðiþ1Þ for i ¼ 0; . . . ; h� 1; ð6Þ

which proves that < nðiÞ; nðiþ1Þ > , i ¼ 0; . . . ; h� 1, arelinks in LDIðM;SÞ forming an h-hop path from n ¼ nð0Þto d ¼ nðhÞ. In other words, we will solve (6) for�ð0Þ; . . . ; �ðh�1Þ and nð1Þ; . . . ; nðh�1Þ in terms of nð0Þ and nðhÞ.

First, we divide both sides of (6) by S to obtain

ðS nðiÞ þ �ðiÞÞ mod Mh i

div S ¼ nðiþ1Þdiv S;

for i ¼ 0; . . . ; h� 1:

Using Rule 3 from the Appendix and noting thatM ¼ Sh�1 G, we get the following relations between thecandidate nodes on the path from n to d:

nðiÞ mod ðSh�2GÞ ¼ nðiþ1Þ div S for i ¼ 0; . . . ; h� 1: ð7Þ

The proof of the theorem follows directly from the proof

of the following two lemmas:

Lemma 1. The values of nðiÞ that satisfy (7) also satisfy the

following equations:

nðiÞ mod ðSi�1GÞ ¼ nðhÞ div Sh�i for i ¼ 1; . . . ; h� 1: ð8Þ

Proof. It will be shown that (8) are true if (7) are true by

backward induction on i. Clearly, for i ¼ h� 1, (8) is

equivalent to (7), which is true from the hypothesis.

Next, assuming that (8) is true for i ¼ a, where

1 < a � h� 1, we will prove that it is true for i ¼ a� 1.

Specifically, the induction hypothesis is obtained by

using i ¼ a in (8). That is,

nðaÞ mod ðSa�1 GÞ ¼ nðhÞ div Sh�a:

By dividing both sides of the equation by S (i.e., taking

div S), we get

½nðaÞ mod ðSa�1GÞ� div S ¼ nðhÞdiv Sh�aþ1: ð9Þ

Now, applying Rule 3 from the Appendix to the LHS of

(9) and using (7) gives

½nðaÞ mod ðSa�1 GÞ� div S ¼ ½nðaÞ div S� mod ðSa�2 GÞ¼ ½nða�1Þ mod ðSh�2 GÞ� mod ðSa�2 GÞ¼ nða�1Þ mod ðSa�2 GÞ:

Substituting back in (9) gives

nða�1Þmod ðSa�2 GÞ ¼ nðhÞdiv Sh�aþ1;

which shows that (8) is true for i ¼ a� 1 and,,thus,

completes the proof of Lemma 1. tuLemma 1 specifies the relation that should hold between

each of the intermediate nodes nð1Þ; . . . ; nðh�1Þ and the

MELHEM: LOW DIAMETER INTERCONNECTIONS FOR ROUTING IN HIGH-PERFORMANCE PARALLEL SYSTEMS 507

Fig. 6. LDI(15, 3): (a) with no faults, (b) with three faults that partition the network, (c) with two faults that do not partition the network, and (d) after

restoring connectivity for case (b).

Page 7: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

destination nðhÞ. The following lemma finds the links on the

path connecting these nodes:

Lemma 2. With nð1Þ; . . . ; nðh�1Þ given by (8), the values of

�ð0Þ; . . . ; �ðh�1Þ that solve (6) satisfy the following:

ðS nð0Þ þ �ð0ÞÞ mod G ¼ nðhÞ div Sh�1; ð10:aÞ

�ðiÞ ¼ ðnðhÞ div Sh�i�1Þ mod S for i ¼ 1; . . . ; h� 1: ð10:bÞ

Proof. From (6), we get

ðS nðiÞ þ �ðiÞÞ mod Mh i

mod ðSi GÞ ¼ nðiþ1Þ mod ðSi GÞ

for i ¼ 0; . . . ; h� 1:

By applying Rule 2 from the Appendix to the LHS, we get:

S nðiÞ þ �ðiÞ� �

mod Si G� �

¼ nðiþ1Þ mod ðSi GÞ

for i ¼ 0; . . . ; h� 1:

Applying Lemma 1 to the RHS of the above equations

gives

S nðiÞ þ �ðiÞ� �

mod Si G� �

¼ nðhÞ div Sh�i�1

for i ¼ 0; . . . ; h� 1:

Finally, leaving the equation for i ¼ 0 intact and taking

mod S of both sides of the equations for i ¼ 1; . . . ; h� 2

and applying Rule 2 gives

ðS nð0Þ þ �ð0ÞÞ mod G ¼ nðhÞ div Sh�1

ðS nðiÞ þ �ðiÞÞ mod S ¼ ðnðhÞ div Sh�i�1Þ mod Sfor i ¼ 1; . . . ; h� 1:

The proof of Lemma 2 follows directly. tuTo complete the proof of Theorem 3, we have to argue that

there exist values of �ð0Þ; . . . ; �ðh�1Þ between 0 and S � 1 that

satisfy (10). Clearly, the value obtained from (10.b) falls in

that range. Moreover, given that G � S, then there exists at

least one value of �ð0Þ which is between 0 and S � 1 and

satisfies (10.a), thus completing the proof of Theorem 3. tuIn addition to showing that an h-hop path exists from any

node n to any other node d in LDIðSh�1 G; SÞ, Theorem 3

specifies the actual route and, thus, a routing algorithm.

Specifically,

The routing Algorithm from n ¼ nð0Þ to d ¼ nðhÞ:

1) From the source node, nð0Þ, send the message to the

next node, nð1Þ, on link �ð0Þ, where �ð0Þ is an integer

between 0 and S � 1 which satisfies

ðS nð0Þ þ �ð0ÞÞ mod G ¼ d div Sh�1.

2) For i ¼ 1; . . . ; h� 1,

From the current node, nðiÞ, send the message to

the next node on link �ðiÞ ¼ ðd div Sh�i�1Þ mod STo clarify the routing algorithm with an example,

consider the unfolded LDI(18, 3) graph shown in Fig. 7

and apply the above algorithm for routing from node 7 to

node 14. Given that, for this example, S ¼ 3 and G ¼ 2,

508 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007

Fig. 7. Three-hop routing in LDI(18, 3) for which S ¼ 3 and G ¼ 2.

Page 8: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

the first routing step should be on link �ð0Þ, whichsatisfies ð21þ �ð0ÞÞ mod 2 ¼ 14 div 9. That is, �ð0Þ ¼ 0 or 2,which indicates that, from node 7, we can either route onlink 0 to node ð3�7þ 0Þ mod 18 ¼ 3 or on link 2 to nodeð3�7þ 2Þ mod 18 ¼ 5. From either node, the second routingstep should be on link �ð1Þ ¼ ð14 div 3Þ mod 3 ¼ 1 (to eithernode 10 or 15, respectively) and the third routing stepshould be on link �ð2Þ ¼ ð14 div 1Þ mod 3 ¼ 2 to the destina-tion, 14.

For the special case of S ¼ G, that is, M ¼ Sh, theformula in the first routing step (10.a) can be simplified,resulting in �ð0Þ ¼ d div Sh�1. Noting that d div Sh�1 < S,then the routing algorithm may be rewritten as follows tospecify the unique path between the source and destination:

The simplified routing Algorithm for the case of M ¼ Sh:

For i ¼ 0; . . . ; h� 1,

From the current node, nðiÞ, send the message

to the next node on link

�ðiÞ ¼ ðd div Sh�i�1Þ mod SIn order to show the flexibility of the LDI topology,

consider a system with 4,096 nodes connected by multipleplanes of circuit switches. For a specific number of switchplanes, S, the switches can be configured to minimize thenumber of routing hops. Table 1 shows the maximumnumber of routing hops that result from embedding the LDItopology in the given number of switch planes, S.

Note that, in an h-hop routing, a message traverses thenetwork h times. Hence, assuming that the maximumaggregate bandwidth of each switching plane is B, then themaximum aggregate effective bandwidth of using S switch-ing planes with h-hop routing is B S =h. In other words,increasing the number of switching planes results in alarger than linear increase in the maximum bandwidth aswell as a decrease in the maximum number of routing steps.For example, if a 4,096-node system has four switchingplanes, then the embedding of LDI(4,096, 4) results in adiameter of 6 and maximum aggregate bandwidth of4B=6 ¼ 0:75B. Increasing the number of planes to eightallows the embedding of LDI(4,096, 8) which reduces thediameter to 4 while increasing the maximum bandwidth to8B=4 ¼ 2B. Although this analysis is based on the diameterrather than the average number of hops, it is fairly accurate,especially in large systems where the maximum number ofhops is not much larger than the average number of hops.This point is made in Table 1 by showing the average

number of hops (obtained from simulations) for LDInetworks with 4,096 nodes.

The flexibility of the circuit switching architecture is alsodemonstrated by the ability to simultaneously embedpredetermined connections to take advantage of commu-nication locality and an LDI for routing random traffic. Forexample, with 4,096 nodes and eight switching planes, it ispossible to dedicate four planes for establishing 2D torusconnections and use the remaining four planes for embed-ding LDI(4,096, 4) for routing random traffic with amaximum of six hops. Alternatively, it is possible to doublethe bandwidth for the torus connections by embedding two2D tori in the eight planes, while routing the random trafficon the tori with a maximum of 32 hops. The choice dependson the ratio of mesh traffic to random traffic and on thedelay tolerance of each type of traffic.

6 CONCLUDING REMARKS

The goal of the LDI directed graphs introduced in thispaper is to minimize the network diameter for a givennumber of nodes and a given node degree. Specifically, fora given node degree, S, the LDI graph with M nodes, whereSh�1 < M � Sh�1 G and 1 < G � S, has a diameter of h,which is asymptotically optimal. The construction of theLDI networks is modular and surprisingly simple. More-over, for M ¼ Sh�1 G, the algorithm for routing betweenany two nodes in the LDI graph is straightforward.Deadlock-free routing in an LDI network with diameter his guaranteed if h virtual channels are used [7].

The main advantage of LDI graphs is their straightfor-ward embeddings in parallel computer systems that areinterconnected by circuit switching networks. However, inaddition to their applicability to routing random traffic incircuit switched networks, the LDI graphs may be appliedwhenever low diameter directed graphs are needed.Examples of such applications are overlay networks [4],WDM routing in light-wave networks [13], and theprescheduling of collective communication patterns [9],[10]. It is worth noting that routing in LDI is performedusing modular arithmetic operations, which leads toefficient hardware implementations.

Given that circuit switching networks are most beneficialfor establishing connections that match an application’sregular communication patterns and that LDI allows thosenetworks to efficiently route random traffic, the next step isto study the efficient simultaneous routing of both regularand random traffic. Specifically, given a number, S, ofswitching planes in an architecture, if K planes arededicated to regular traffic (e.g., mesh connections), an

MELHEM: LOW DIAMETER INTERCONNECTIONS FOR ROUTING IN HIGH-PERFORMANCE PARALLEL SYSTEMS 509

TABLE 1The Diameter of LDIðM;SÞ for M ¼ 4; 096 Nodes and Different Node Degree, S

Page 9: 502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, …

interesting problem is to find the configurations of theremaining S �K planes such that random traffic is routedon all S planes with the minimum number of hops.

Fault tolerance is a very important issue in the design ofinterconnection networks for high-performance systems.Using previously known results about DeBruijn graphs, wecan conclude that, when M ¼ Sh, LDIðM;SÞ remains con-nected in the presence of any S � 2 faulty nodes and that thediameter of the faulty network only increases from S to S þ 1[8], [20]. A detailed study of the fault tolerance capabilities ofthe LDI network when G < S is left for future work.

APPENDIX

In this appendix, we prove a few rules of modulusarithmetic used in the papers.

Rule 1. ðaðb mod kÞ þ cÞ mod k ¼ ðabþ cÞ mod k.

Proof. Let b ¼ x kþ y, where 0 � y < k. Hence, LHS ¼ða yþ cÞ mod k and

RHS ¼ ða x kþ a yþ cÞ mod k ¼ ða yþ cÞ mod k:tu

Rule 2. IfM ¼ k g, then ða mod M þ bÞ mod g ¼ ðaþ bÞ mod g.Proof. Let a ¼ x M þ y, where 0 � y < k. Hence, LHS ¼ðyþ bÞ mod g and

RHS ¼ ðx k gþ yþ bÞ mod g ¼ ðyþ bÞ mod g:ut

Rule 3. ða mod ðkgÞÞ div k ¼ ða div kÞ mod g.Proof. Let a ¼ x k gþ y kþ z, where 0 � y < g and

0 � z < k. H e n c e , LHS ¼ ðy kþ zÞ div k ¼ y a n dRHS ¼ ðx gþ yÞ mod g ¼ y. tu

ACKNOWLEDGMENTS

The author would like to thank Alex Jones, Ray Hoare,Eugen Schefeld, and Zhu Ding for discussing and com-menting on the work presented in the paper. He would alsolike to thank Seth Hornes for simulating the routingalgorithms and demonstrating their correctness before theactual proofs were obtained. This work was partiallysupported by the IBM PERCS project as part of the USDefense Advanced Research Project Agency’s HPCS pro-gram under contract NBCH3039004.

REFERENCES

[1] K. Baker, A. Benner, R. Hoare, A. Hoisie, A. Jones, D. Kerbyson, D.Li, R. Melhem, R. Rajamony, E. Schenfeld, G. Stunkel, and P.Walker, “On the Feasibility of Optical Circuit Switching for HighPerformance Computing Systems,” Proc. Supercomputing ’05, Nov.2005.

[2] W. Bridges and S. Toueg, “The Impossibility of Directed MooreGraphs,” J. Combinatorial Theory, vol. 29, no. 3, pp. 339-341, 1980.

[3] F. Comellas and M. Fiol, “Vertex Symmetric Digraphs with SmallDiameters,” Discrete Applied Math., vol. 58, pp. 1-11, 1995.

[4] V. Dalagiannis, A. Mauthe, and R. Steinmetz, “Overlay DesignMechanisms for Heterogeneous, Large Scale, Dynamic P2PSystems,” J. Networks and System Management, vol. 12, no. 3,pp. 371-395, 2004.

[5] N. De Bruijn, “A Combinatorial Problem,” Proc. Akademe VanWetenschappen, vol. 49, part 2, pp. 758-764, 1946.

[6] Z. Ding, R. Hoare, A. Jones, D. Li, S. Shao, S. Tung, J. Zheng, andR. Melhem, “Switch Design to Enable Predictive MultiplexedSwitching in Multiprocessor Networks,” Proc. Int’l Conf. Paralleland Distributed Processing (IPDPS), Apr. 2005.

[7] J. Duato, S. Yalmanchili, and L. Ni, Interconnection Networks, anEngineering Approach. IEEE CS Press, 1997.

[8] A. Esfahanian and S. Hakimi, “Faut-Tolerant Routing in De BruijnCommunication Networks,” IEEE Trans. Computers, vol. 34, no. 9,pp. 777-788, Sept. 1985.

[9] A. Faraj and X. Yuan, “Message Scheduling for All-to-AllPersonalized Communication on Ethernet Switched Clusters,”Proc. 19th IEEE Int’l Parallel and Distributed Processing Symp.(IPDPS), Apr. 2005.

[10] A. Faraj and X. Yuan, “Automatic Generation and Tuning of MPICollective Communication Routines,” Proc. 19th ACM Int’l Conf.Supercomputing (ICS ’05), June 2005.

[11] M. Fiol and A. Llado, “The Partial Line Digraph Technique in theDesign of Large Interconnection Networks,” IEEE Trans. Compu-ters, vol. 41, no. 7, pp. 848-857, July 1992.

[12] M. Gerla, E. Leonardi, F. Neri, and P. Palnati, “Routing in theBidirectional ShuffleNet,” IEEE Trans. Networking, vol. 9, no. 1,pp. 91-102, 2001.

[13] M. Hluchyi and M. Karol, “ShuffleNet: An Application ofGeneralized Perfect Shuffle to Multihop Lightwave Networks,”Proc. INFOCOM ’88, Mar. 1988.

[14] M. Imase and M. Itoh, “A Design for Directed Graphs withMinimum Diameter,” IEEE Trans. Computers, vol. 32, no. 8,pp. 782-784, Aug. 1983.

[15] W. Kautz, “Bounds on Directed (d, k) Graphs,” Theory of CellularLogic Networks and Machines, AFcKL-68-0668 final report, pp. 20-28, 1968.

[16] C. Qiao and R. Melhem, “Reconfiguration with Time DivisionMultiplexed MINs for Multiprocessor Communications,” IEEETrans. Parallel and Distributed Systems, vol. 5, no. 4, pp. 337-352,Apr. 1994.

[17] J. Shalf, S. Kamil, L. Oliker, and D. Skinner, “Analyzing Ultra-Scale Application Communication Requirements for a Reconfigur-able Hybrid Interconnect,” Proc. Supercomputing ’05, Nov. 2005.

[18] L. Smarr, A. Chien, T. DeFanti, J. Leigh, and P. Papadopoulos,“The OptIPuter,” Comm. ACM, vol. 46, no. 11, pp. 58-67, 2003.

[19] H. Stone, “Parallel Processing with the Perfect Shuffle,” IEEETrans. Computers, vol. 20, no. 2, pp. 153-161, Feb. 1971.

[20] M. Sridhar and C. Ragavendra, “Faut-Tolerant Networks Based onthe De Bruijn Graph,” IEEE Trans. Computers, vol. 40, no. 10,pp. 1167-1174, Oct. 1991.

[21] M. Veeraraghavan, X. Zheng, H. Lee, M. Gardner, and W. Feng,“CHEETAH: Circuit-Switched High-Speed End-to-End TransportArchitecture,” Proc. SPIE/IEEE Optical Networking and ComputerComm. Conf. (OptiComm), Oct. 2003.

Rami Melhem received the BE degree inelectrical engineering from Cairo University in1976, the MA degree in mathematics and theMS degree in computer science from theUniversity of Pittsburgh in 1981, and the PhDdegree in computer science from the Universityof Pittsburgh in 1983. He was an assistantprofessor at Purdue University prior to joiningthe faculty of the University of Pittsburgh in1986, where he is currently a professor of

computer science and electrical engineering and the chair of theComputer Science Department. His research interests include real-timeand fault-tolerant systems, optical networks, high-performance comput-ing, and parallel computer architectures. Dr. Melhem has served on theprogram committees of numerous conferences and workshops. He wason the editorial board of the IEEE Transactions on Computers and theIEEE Transactions on Parallel and Distributed Systems. He is serving onthe advisory boards of the IEEE Technical Committee on ComputerArchitecture. He is the editor for the Springer Book Series on ComputerScience and is on the editorial board of the Computer ArchitectureLetters, the International Journal of Embedded Systems, and theJournal of Parallel and Distributed Computing. Dr. Melhem is a fellow ofthe IEEE and a member of the IEEE Computer Society and the ACM.

510 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 4, APRIL 2007