designing interconnection networks multi-level...

VLSI Design1995, Vol. 2, No. 4, pp. 375-388Reprints available directly from the publisherPhotocopying permitted by license only

(C) 1995 OPA (Overseas Publishers Association)Amsterdam BV. Published under license byGordon and Breach Science Publishers SA

Printed in Malaysia

Designing Interconnection Networks forMulti-level Packaging

M.T. RAGHUNATH* and ABHIRAM RANADE*Computer Science Division, University of California at Berkeley, Berkeley, CA 94720

A central problem in building large scale parallel machines is the design of the interconnection network. Intercon-nection network design is largely constrained by packaging technology. We start with a generic set of packagingrestrictions and evaluate different network organizations under a random traffic model. Our results indicate thatcustomizing the network topology to the packaging constraints is useful. Some of the general principles that ariseout of this study are: 1) Making the networks denser at the lower levels of the packaging hierarchy has a significantpositive impact on global communication performance, 2) It is better to organize a fixed amount of communicationbandwidth as a smaller number of high bandwidth channels, 3) Providing the processors with the ability to toleratelatencies (by using multithreading) is very useful in improving performance.

Key Words: Parallel processing, interprocessor communication, network topology, packaging hierarchy

1 INTRODUCTION

nterconnection network design is a central issuein building large scale general purpose parallel

computers. The network takes up a significant frac-tion of the total cost; is often the hardest part of thesystem to engineer, and for many applications deter-mines the final performance. The issues in designinga network range from the very high level: what to-pology should we use, to the most mundane: whatshould be the widths of the different channels. Buteach of these questions can make or break a networkdesign. Each question is also more complex than itappears superficially: should we use hybrid networktopologies (e.g. mesh at board level, and butterflybetween boards)? Is it useful to have variable chan-nel widths in different parts of the machine? Is. ituseful to duplicate (also called dilate) channels?The interconnection network critically depends

upon the communication behavior of the applicationsrunning on the parallel computer. For instance, if theapplication consists of multiplying dense matrices, it

*This work is supported by the Air Force Office of ScientificResearch (AFSC/JSEP) under contract F49620-90-C-0029 andNational Science Foundation Grant Number CCR-9005448. M.T.Raghunath is also supported by a fellowship from IBM Corp. Thisresearch was also supported in part by National Science Founda-tion under Infrastructure Grant Number CDA8722788.

is conceivable that a two-dimensional grid will be anappropriate interconnection network. However, ourgoal is to design networks to support general purposeparallel programming, so our design cannot be un-duly influenced by the characteristics of a single pro-gram. For modeling general applications, it is cus-tomary to assume random communication patterns.In particular, we assume that every time a processoraccesses shared data, the location accessed is ran-domly distributed over the entire address space. Sim-ilar models have been used by many studies of net-work design [9, 16, 18].The choice of the interconnection network also de-

pends upon the packaging technology available. Inorder to evaluate different network design alterna-.tives, we need a model of packaging technology. Awell explored packaging technology is VLSI, and itis possible to evaluate different networks layed outon a VLSI chip using Thompson’s grid model [26].The principal cost measure in this model is the lay-out area. It has been shown that for fixed layout area,most of the common networks such as hypercubes,multidimensional grids, butterflies, etc., (with chan-nel widths adjusted to equalize area requirements)have the same bandwidth for delivering messageswith random destinations. Dally [9] argues that lowerdimensional networks have better latencies.The VLSI model is not adequate for large ma-

375

376 M.T. RAGHUNATH and A. RANADE

chines, since large parallel machines typically em-ploy several levels of packaging, e.g. racks, boardsand VLSI chips. To compare different network de-signs, we need a model that takes packaging hier-archies into account. Developing such a model is dif-ficult because cost is a function of a large numberof parameters, such as the number of wires at eachlevel of the hierarchy, the size of the printed circuitboards, the number of boards, number of connectorsetc. In addition to the costs associated with theseparameters, there may also be technology specificconstraints, e.g. standard printed circuit boards maybe limited to a pin-count of 500, standard IC pack-ages may have pin-counts of up to 250 etc. Further,advances in technology can significantly change thecost characteristics of each level of the hierarchy, oreven the number of levels.We would like to identify interconnection net-

works that are suitable for multilevel packaging tech-nologies. Although the precise characteristics of ex-isting or emerging packaging technologies are hardto model, a common characteristic of most technol-ogies is a progressive increase in costs and decreasein capacities as we go up the packaging hierarchy.For instance, it is clear that as we proceed up thehierarchy, wires usually get longer with larger inter-wire spacings. Metal wires on VLSI chips may be afew millimeters long and spaced about 1 micronapart, whereas wires connecting different boardsmay be as long as 1 meter and spacing betweenwires on a cable connector may be as much as1 mm. This results in a higher cost per wire at thehigher levels of the hierarchy, and the higher costtranslates into a reduction in number of wires avail-able at the higher levels. Further the rate at whichdata can be transferred on a wire usually decreasesas we go up the hierarchy. In other words, packaginghierarchies exhibit the property of packaging local-ity, i.e. the connections within any unit of packagingare cheaper, denser, and faster than connections athigher levels. This in turn means that design of theinterconnection network is more constrained byhigher level restrictions than those at lower levels.This suggests that interconnection networks shouldbe designed level-by-level, highest level first.

In this paper we consider a generic packagingtechnology with two levels, in which the processorsin the parallel computer are organized into a numberof modules. A larger number of levels might be moreaccurate (considered in [22]), but we feel that sig-nificant insights can be obtained with just two levels.We define a cost model for networks built using ourtwo level technology, and then go on to define andcompare a number of different interconnection net-

works. We compare the performance of different net-works using a random traffic model (section 3). Firstwe fix the high level (inter-module) network. As willbe seen, this can be done analytically (section 4).Next, we consider possible lower level (intra-mod-ule) networks (section 5), and these, we compare us-ing simulation. Our simulations are based on ashared-memory model, but we believe that ourobservations also apply to other models requiringglobal communication.Our results show that the networks used to connect

modules must be substantially different from theones considered customarily, e.g. multidimensionalgrids, hypercubes, butterflies, etc. We identify a newfamily of network topologies called n-hop networkswhich are important for the inter-module network.As it turns out, these topologies are also importantin the case of three-level packaging hierarchieswhich we consider elsewhere [22]. As far as theoverall network is considered, we find that a hybridnetwork (consisting of different networks at differentlevels, or having different channel widths at differentlevels) is likely to give better performance for fixedcost. Hybrid networks have so far received little at-tention in the literature.

Section 4.2 has a comparison of our work withother network design studies. Section 6 presentsresults of simulations for the case N 1024 proces-sors and M 32 modules. Section 7 gives ourconclusions.

2 A TWO-LEVEL MODEL OFPACKAGING

In our model, a parallel computer consists of Mpackaging modules. Each module holds N/M pro-cessors (for a total of N), the corresponding fractionof total system memory, as well as some communi-cation hardware. It is important to point out that weuse the term packaging module in an abstract sense.What a packaging module physically corresponds towill depend on the scale of the machine; it could bea board, a rack or even a cabinet consisting of mul-tiple racks. Further, a module itself will typicallyhave a hierarchical structure, which we ignore forsimplicity.As mentioned earlier, it is difficult to model costs

precisely, except in VLSI, where the layout area isa widely accepted metric. Informally, however, thereis general agreement among hardware designers asto what the main costs are at higher packaging levels.In this paper we consider the cost to be a function

INTERCONNECTION NETWORKS 377

of the following metrics, which most hardware de-signers would believe are among the most significantonesa:

1. Module Pin-count: This is the number of wiresleaving or entering each module.

2. Number of Wire Bundles: The wires connectingmodules can be grouped into bundles, such thatwires in each bundle connect the same pair ofmodules. The total nurnber of bundles in a net-work is another indication of the difficulty ofbuilding the network: it is easier to assemble anetwork with a given number of wirers if thewires are organized into a small number of fatbundles; as opposed to a large number ofskinny ones.

3. Bisection Width: This is the minimum numberof wires that need to be cut in order to dividethe network into two equal parts.

We may consider using a cost function that is a linearcombination of the above metrics, where the relativeweights of the metrics are determined according tothe precise characteristics of the underlying technol-ogy. However, in many cases, the cost function maybe strongly non-linear; at the extreme, there mayeven be hard constraints which cannot be overcomehowever much you are willing to pay, e.g. it mayjust be impossible, given a technology, to have amodule pin-count of 2000. The constraints and non-linearities may make it difficult to obtain an accurateestimate of the dollar cost of a particular design. Wedefer further discussion of cost estimation to sec-tion 4.

In principle, these cost metrics could be used forthe network at each level. In this paper we mainlyfocus on the cost of the inter-module network. Forthe intra-module network, we consider several can-didates with widely varying costs. However, we feelthat many of these candidates will be interesting.since we expect the cost of the intra-module networkto be dominated by the inter-module cost.

3 TRAFFIC MODEL

We evaluate networks using a traffic model in whicheach processor sends messages to randomly chosen

aThis is by no means an exhaustive list of possible cost mea-sures. For instance, a network with long wires is harder to engineerelectrically than a network with short wires. Many modem parallelcomputers use asynchronous differential signaling with several bitsbeing in transit on the same pair of wires at the same time. Thisovercomes some of the limitations imposed by long wires. Alsosee the paper by Scott and Goodman [25]..

destinations. For simplicity, we assume that the com-munication primitives are based on a shared-memorymodel, but we believe that our results apply quali-tatively to message passing models as well, so longas message destinations are random. We assume thatin each cycle, a processor either executes a local op-eration or issues a remote access to a random loca-tion. The average interval between remote accessesis a measure of the data locality in the computation;longer inter-access intervals imply greater locality.We also assume that accesses made by different pro-cessors are independent of each other. Several re-searchers have used similar models to evaluate net-work performance [1, 3, 9, 11, 15, 16, 18, 24].

In general, the distribution of accesses made by aprocessor is highly dependent on. the application.While some applications, such as those based ondense matrix computations, may have extremely reg-ular communication traffic, there is a large class ofapplications for which the communication traffictends to be highly irregular and dynamically varyingwith time. Sparse matrix computations and graph al-gorithms are examples of such applications. Further,many programs that initially had regul communi-cation patterns tend to lose this property when al-gorithmic optimizations are performed For instance,the O(n) version of the N-body program can bewritten so that processor only has to communicatewith processors (i + 1) and (i 1). However, whena better algorithm such as the O(n log n) Barnes andHut algorithm [6] or the O(n) Greengard and Rokh-lin algorithm 13] is used, the communication trafficceases to be as regular as in the O(n2) case. Further,for many programs, the regularity in communicationpatterns is lost when the size of the problem is muchlarger than the number of processors available. Forinstance, an FFT computation where the number ofprocessors is equal to the number of elements in theFFT, has a communication pattern that maps wellonto a hypercube, but if the number of elements inthe FFT is substantially greater than the number ofprocessors, the communication traffic involves everyprocessor communicating equally often with everyother processor [8]. In addition, many of the recentlydeveloped models of parallel programming also as-sume that each processor can communicate with allprocessors with equal ease [4, 8].

Access non-uniformities and hot-spots due to re-peated access to synchronization variables were top-ics of active research in the recent past. Pfister andNorton [21] demonstrated that buffered networkswere susceptible to tree-saturation in the presence ofhot-spot traffic. Solutions to the hot-spot problem in-clude hardware message-combining [12, 20, 23] and


software techniques such as the ones proposed in[27] and [19]. We assume that hot-spots are pre-vented by using some software mechanism, and donot deal with them explicitly in this paper.

Additional details regarding our evaluation meth-odology and assumptions made for the purposes ofsimulation, are provided in section 6.

4 DESIGNING THE INTER-MODULENETWORK

We first consider the design of the network at thetop level of the hierarchy, namely the inter-modulenetwork. We would like to compare several networksin terms of both cost and performance, identifypromising ones, and rule out those that are not com-petitive. We enumerated three cost metrics in section2: bisection width, pin-count, and number of bun-dies, but we also observed that it was difficult tocombine the three metrics to yield a dollar cost. Nev-ertheless, we can compare the costs of different net-works by comparing the three metrics separately, i.e.,we can think of cost as a three dimensional quantity.If we can show that all the three dimensions of net-work N are smaller than the corresponding threedimensions of N2, then we can conclude that N haslower cost than N2. If N also has an equal or greaterperformance than N2, we can clearly rule out N2. Inaddition, let us say that we can argue that no networkcan have greater performance than N, unless it hasa greater cost than N1 in at least one of the threedimensions. Then we would consider N to be apromising network. We would like to explore thethree dimensional design space and identify suchpromising networks and study them.

There is a simple observation that enables us toreduce the size of the design space to two dimen-sions. Suppose network N has a greater bisectionwidth than N2. Then, with reasonable confidence wecan say that the performance of N will be greaterthan the performance of N2 under random traffic. Un-der random traffic, half the traffic from each proces-sor must cross the bisection of the machine. The rateat which this traffic can be delivered is directly pro-portional to the bisection width. Therefore, in orderto determine which network has a lower cost for afixed performance, it is sufficient to compare net-works with the same bisection width.

Table 1 shows a comparison of standard and non-standard topologies (n-hop topologies, described in

section 5.2) keeping the bisection width constantb.The entries in the table are derived as follows: Firstthe number of bundles per module is the same as thenode degree of the graph topology. For example, thenumber of bundles per module in a complete graphtopology is M 1. We then compute the number ofbundles (edges) that cross the bisection, which isM2/4 for a complete graph. Since the bisection widthis B, each bundle must correspond to 4BIM2 wires.The pin-count per module is the product of the num-ber of bundles per module and the number of wiresper bundle, which is 4B(M 1)/M2 for a completegraph.

Unfortunately, Table 1 does not show a clear win-ner--the different topologies represent differenttradeoffs between module pin-count and the numberof bundles. Substituting absolute numbers for B andM can give us a better insight. Consider for examplethat we are designing a 1024 processor machine. Areasonable value of B for such a machine is 8192,corresponding to 8 bits of remote memory bandwidthper processor per cycle. We believe a reasonablevalue for M to be 32. Obviously, our conclusions willbe different for dramatically different values of M.We discuss the issue of choosing M at more lengthin section 4.1. Table 2 gives the comparison. Notethat for M 32, networks such as toroids or the 2or 3-hop networks will not have the same number ofmodules in each dimension because 32 is not a per-fect square or a perfect cube. The numbers in thetable for these networks correspond to the nearestpower of 2 partitions.As we go down the table we see that the pin-count

increases while the number of bundles decreases.Only in some cases, we can rule out some topolo-gies. For example, when we compare the butterflyand the 2-dimensional toroid, we see that both havethe same number of bundles, but the toroid has alower pin-count. Therefore, in this case, we can ruleout the butterfly. But, asymptotically, the butterfly isbetter than the toroid, and for larger values of M, thebutterfly will have a lower pin-count than the toroid.In the majority of the cases, however, we cannotdemonstrate such a domination in terms of both costmetrics. We need to know the relative weights of thetwo cost metrics to determine which network haslower cost. But the weights are highly technologydependent. We are interested in obtaining qualitativeresults that are applicable over a wide range oftechnologies.We believe that at the higher levels of the pack-

bAll logarithms are to base 2.


TABLECharacteristics of Topologies Used to Connect M Packaging Modules

Inter-Module Bisection Module Bundles TotalTopology Width Pin-count Per Module # Bundles

Complete graph B (4B/M)(M- 1)/M M- M(M 1)/22-hop Network B (8B/M)( 3M__- 1)//- 2(,/’- 1) M(VM- 1)3-hop Network B (12B/M)(_M- 1)/’ 3(/-- 1) 3M(.- 1)/2n-hop Network B (4nBIM)(/-- 1) n(f- 1) nMC/M 1) 2Butterfly B 4B log M/M 4 2MCCC B 6B log M/M 3 3MI2Hypercube B 2B log M/M log M M log M/2d-dim Toroid B dB M 2d Md3-dim Toroid B 3B/Mr2/3 6 3M2-dim Toroid B 2B/ 4 2MRing B B 2 M

aging hierarchy the number of pins per module hasa greater effect on cost than the total number of bun-dies. Therefore, we consider the top rows of the tablefor further investigation. We first consider the com-plete graph since it achieves the minimum pin-countfor a given bisection. However, as the table indicates,the complete graph also uses the maximum numberof bundles. Some technologies may have constraintswhich prohibit such a large number of bundles. Insuch cases, we must use a topology that has fewerbundles, therefore, we also consider the 2-hopnetwork.

4.1 Scalability

It is somewhat surprising that we chose the completegraph as a promising candidate; complete graphs areconsidered not to be scalable as an interprocessornetwork. The rationale for this belief, as we dis-cussed earlier, is that they have very large numberO(M2) of bundles. We note however, that we wishthe network design to be scalable for large values ofN rather than M. The number of modules is not di-rectly related to the number of processors: as webuild larger machines, we have to build them using

a taller hierarchy, involving larger modules at the toplevel. If we are building a network for a 100,000processors, it is unlikely that the top level of thepackaging hierarchy will consist of modules thatcontain 10 processors each. We feel that the numberof modules, M, will depend upon the technology, butnot substantially on the number N of processors.

4.2 Comparison to Previous Work

Most of the previous comparative studies of net-works deal with uniform networks, i.e. a single net-work topology is .used to connect together processorsrather than customizing the topology to suit the dif-ferent levels of the hierarchy. Further, almost noneof the previous work has considered the possibilityof having communication channels of differentwidths at different levels of the hierarchy. As ourresults demonstrate, customizing network topologiesand channel widths to packaging technology signif-icantly improves performance.

Several of the existing and proposed parallel ma-chines are based on hybrid networks that attempt tomatch the network to the packaging technology. Forinstance the CM-1 uses denser networks to connect

TABLE IICharacteristics of Different Topologies Used to Interconnect 32 Packaging

Modules

Inter-Module Bisection Module Bundles TotalTopology Width Pin-count Per Module # Bundles

Complete graph 8192 992 31 4962-hop Network 8192 1664 10 1603-hop Network 8192 2048 7 112Butterfly 8192 5120 4 64CCC 8192 7680 3 48Hypercube 8192 2560 5 803-dim Toroid 8192 2560 6 962-dim Toroid 8192 3072 4 64Ring 8192 8192 2 32


the 16 bit-serial processors on a chip, while the en-tire machine can be considered a hypercube overthese chips. The CEDAR system likewise organizesthe machine into Omega network connected clusters,with each cluster consisting of several processorsconnected by a cross-bar network. Estimates of thecommunication performance of the CM family canbe found in [7] and performance measurements ofthe CEDAR can be found in [2]. These measure-ments evaluate the performance of only a single de-sign point. In this paper we provide a comprehensiveevaluation of a variety of design choices, albeit in anabstract setting.The work closest to ours is that by Dandamudi

[10]. He evaluated the performance of a number ofhierarchical networks using the number of bundlesas a cost measure. His motivation and approach arehowever very different from ours. His work is drivenby the goal of trading off system performance toreduce the cost of the highest level connections ("In-ter-Module" connections as per our terminology).For this he considers different interconnection net-works for the highest level and evaluates their costperformance trade-offs. On the other hand, as willbecome clear soon, the bulk of our study is con-cemed with how to design a lower level network("intra-module" network) to efficiently exploit themaximum bandwidth provided by the inter-moduleinterconnect.Many researchers have examined the networks

corresponding to the lower rows of Table 1. Dally[9] has analyzed the class of multi-dimensional to-roids using bisection width as the cost measure.Abraham and Padmanabhan [1] have analyzed thesame class of networks (k-ary n-cubes) based on aconstant pin-count restriction, and Agarwal [3] hasanalyzed the same class of networks under three dif-ferent restrictions: constant bisection, constant chan-nel width and constant pin-count. There are also avariety of studies analyzing the performance of mul-tibutterflie [18] and other many other networks forrandom communication patterns. All of these studiesprimarily deal with uniform networks, where the to-pology is fixed before packaging is considered.

5 INTRA-MODULE NETWORKS

5.1 Inter-Module Topology" Complete Graph

The only standard N processor network known to usthat has a partitioning into M modules such that theinterconnection between the modules is a complete

graph is a Butterfly network. This occurs when Mand N are powers of 2 and M -< V. Figure l(a)shows a butterfly network connecting 16 processorsto 16 memories and Figure 1 (b) shows how this net-work can be partitioned into 4 modules, such thateach module has 4 processors and 4 memories. Alsonote that the nodes in columns 2 and 3 have beenreordered to make the connections between thesecolumns local to the module. The numbers on theprocessors and memories are shown primarily to il-lustrate the reordering and do not carry any specialsignificance.

Since we need to route access requests from theprocessors to the memories and the correspondingreplies back from the memories to the processors,we need two separate butterfly networks. Therefore,each link in the figure should be interpreted as a pairof directed links. Note that the bundle of wires goingbetween any two modules, A and B, actually consistsof 4 different channels, two in each direction. Onechannel carries access requests from the processorsin module A to memories in module B, another car-ries requests from B to A. Similarly two other chan-nels carry responses from A to B and B to Arespectively.

Although we started with a dance-hall type net-

0

0

2

4567

9101112131415Processors

Column2

0

23456789101112131415

Memories

(a) 16 processor Butterfly"’.’""’.’"’--’"’--’"’---’"’&"

o u,o 0

4’_’:i’"-l"""

713i-----tf"’"2"9 3" Module 2

!...g..!l:i

(b) Partitioning into 4 modules (c) Intra-Module connections

FIGURE Butterfly Network and its partition


work, it is useful to associate each memory with aprocessor from a practical point of view. In this case,the implementation is simplified because each pro-cessor could use a single memory that is partitionedinto private memory and shared memory. While theprivate memory would be accessed directly by theprocessor, the shared memory would be accessibleto all the processors via the network. We representthis in Figure l(c) by drawing the processors andtheir associated memories adjacent to each other.

5.1.1 Basic Butterfly Network

has 2 16-bit wide channels. If many messages needto go out in the same direction, they can be simul-taneously sent along the dilated channels. The firstnetwork requires the same number of wires per rout-ing node as Butterfly.16.8 while the other two usethe same as Butterfly.32.8.

In the case of network dilation, we do not dilatethe channels connecting into the memories or theprocessors. If we dilate the channels connecting tothe memories, we may have to use multi-portedmemories or add some complicated circuitry toprevent multiple accesses from being made simul-taneously.

The first network we consider is a butterfly connect-ing 1024 processors to 1024 memories; partitionedinto 32 modules in the manner of Figure l(b).A bisection width of B 8192 implies that pin-

count per module is 992 and that each bundle has32 wires. Since each bundle has 4 channels, eachchannel is 8 bits wide. Since the inter-module chan-nels are 8 bits wide, we get a uniform network if wemake the intra-module channels also 8 bits wide. Wecall this network Butterfly.8.8. The first suffix standsfor the intra-module channel width and the secondsuffix for the inter-module channel width.

5.1.4 Cross-Bar

To establish an upper bound on the best performanceachievable, we also consider the networks Xbar.8.8,and Xbar.32.8. Both networks connect the processorsand memories to the inter-module channels using 32by 32 cross-bars. In the first case both the input andoutput channels of the cross-bar are 8 bits wide. Inthe latter case the channels that connect to the pro-cessors and memories are 32 bits wide. The inter-module channels remain 8 bits wide as before.

5.1.2 Widening Channels

The previous network used a uniform topology thatdid not take advantage of the differences in pack-aging hierarchy. In general, at the lower levels of thepackaging hierarchy we can have more wires andalso transfer bits across these wires at greater speeds.Both these characteristics permit us to increase thecapacity of the networks at the lower levels of thehierarchy. Instead of separately modeling increasedwire density and clock rates, we consider networksthat keep the clock rate constant and just use higherwire densities. Accordingly, we consider networkswith wider intramodule channels: Butterfly. 16.8 andButterfly.32.8. Both networks have the same topol-ogy as Butterfly.8.8, but they have 16 and 32 bit wideintra-module channels respectively.

5.1.3 Dilation

Another method for utilizing more local wires is di-lation. We consider the networks Butterfly.8-2dil.8,Butterfly.8-4dil.8, and Butterfly.16-2dil.8. The firstnetwork has 2 8-bit wide channels between nodes,the second has 4 8-bit wide channels and the third

5.2 Inter-Module Topology: 2-Hop Network

A 2-hop network is a product graph of two completegraplrs. The nodes in the graph can be thought oflabeled using a pair of numbers (i, j) where eachnumber ranges from 0..V/-. There are completegraphs between all nodes of the form (*, j)X/j, andalso between all nodes of the form (i, *)’i. We callthis a 2-hop network because every pair of nodes isguaranteed to be connected by a shortest path oflength at most 2. In general a n-hop network is aproduct of n complete graphs, each of size /-. Acomplete graph interconnect is a 1-hop network inthis sense. When the number of hops is log M, theresulting topology is a hypercube.

Interestingly, it turns out that the starting point forgenerating a 2-hop intermodule network is also abutterfly. In fact, any butterfly network of size N2<+)* can be partitioned into 2 modules so that theinterconnection between the modules is a n-hop net-work. A 2</lx butterfly has (n + 1)x columns. We

CLet G1 (V1, El) and G2 (V2, E2) be two graphs. The productof G and G2 denoted by G G2 (V, E) where V V Vand E {((ul, U2), (01, D2))I((Ul 01) / (U2, 02) E2) / ((U2v2)/ (ul, vl) El)}.


separate these columns into (n + 1) groups of x col-umns each and shuffle the rows such that the con-nections within each group form butterflies of size2x. Next we partition the network into 2 modulessuch that each module has 2 rows and (n + 1)xcolumns. The edges that connect between the mod-ules form a n-hop network. Note that this construc-tion only applies when the number of processors isof the form 2(n+ 1)x, and the number of modules is 2nx.This construction can be easily extended to the casewhere the number of processors is not of the form2(/)x, by building a network of size 2(n+l)x, parti-tioning it into 2 modules and connecting multipleprocessors to each network input.We can also build a n-hop network when the num-

ber of modules M is not of the form 2", but thenumbers in Table 1 do not apply directly. The tableassumes that the n-hop network topology is a n-foldproduct of complete graphs of size /-. If thenth root of M is notan integer, the. sizes of the com-plete graphs that form the product will not be equal.The number of bundles and the pin-count will differslightly from that shown in the table.

In our present case, we need to build a 2-hop net-work on 32 Modules. Since 32 is not a perfectsquare, we use thetopology K4xK8, where Kg denotesa complete graph on nodes and x denotes graphproduct. Figure 2 shows this graph. For the sake ofsimplicity, the figure only shows one of the 4 K8s.The actual topology, will have 3 more Kss, one eachnode of the K4s.

As in the one-hop case, each edge in the figurecorresponds to 2 pairs of channels, one in each di-rection. Of each pair of links, one carries processorto memory traffic, and the other memory to proces-sor traffic. In order to achieve a bisection width of8192, the edges of the K4 will have to be 256 bitswide (4 64-bit wide channels), and the edges of theK8 will be 128 bits wide (4 32-bit wide channels).

5.2.1 Basic 2-Hop Network

The intra-module network of the basic 2-hop net-work is shown in Figure 3. The processors are con-nected to the nodes on the left. Since there are 32processors per module and only 8 inputs, we have toconnect 4 processors to each input. This is done bybuilding a 2 level tree as shown on the top node. Thememories are also connected in a similar fashion.Processors and memories are paired as in the earliernetwork topologies, but this is not illustrated pic-torially.The channels between columns 2 and 3 corre-

spond to K4. They connect between modules that dif-fer in their least significant 2 bits. The channels be-tweencolumns 5 and 6 correspond to K8 and connectbetween modules that differ in their most significant3 bits. All channels are 32 bits wide except thosebetween columns 2 and 3, which are 64 bits wide.As per our naming convention we call this net-

work 2hop.32.64-32 since the intra-module channelsare 32 bits wide and there are two kinds of inter-module channels, which are 64 and 32 bits wide.

5.2.2 Widening Channels and Dilation

Similar to the networks defined for the completegraph inter-module topology, we define the network2hop.64.64-32. This network is the same as the pre-vious network except that all the intra-module chan-nels are 64 bits wide, including those that connectto the processors and memories.We also define the network 2hop.32-2dil.64-32

with a 2-way dilation of the intra-module network.

FIGURE 2 Inter-Module 2-hop Network (Only one of the 4 com-plete graphs on 8 nodes is shown)

5.3 Routing Algorithms

Our routing algorithms use cut-through routing 14],with adequate buffer space in each routing node.Buffer space is allocated on a per-message basis. Weassume that all the wires in the network can transferone bit per clock cycle and that each routing nodeimposes a single cycle pipeline delay on any mes-


3 4 5

FIGURE 3 Intra-Module Network: Basic 2 hop network

sage that does not suffer link or queue space conten-tion. When a routing node’s outgoing channels arewider than its incoming channels, the node cannotsend messages in a pipelined fashion. We assumethat these nodes use packet-switched routing, i.e.,they wait for the entire message to come in beforesending out any flits. For brevity, we omit a detaileddescription of the routing nodes.

For the cross-bar networks discussed earlier, wetreat the entire cross-bar as a single routing node thatimposes a single cycle pipeline delay. This assump-tion is overly optimistic, but it is justified since westudy the cross-bar networks only to establish an up-per bound on performance.

6 SIMULATIONS

We simulated the networks described above undertwo kinds of random traffic models. In both models,we assumed that each access is sent over the networkto the memory where the data resides, the access isperformed and then returned to the processor. Forsimplicity we assumed that both the access requestsand responses are 128 bits long, inclusive of the data,memory/network addressing information, and con-trol information.

6.1 Open-Network Model

The first model we consider is the commonly usedopen-network model [17]. In this model, in each cy-cle of execution, each processor generates a messagewith a fixed probability of A, independent of otherprocessors and previous cycles of execution. The in-terval between accesses is geometrically distributedwith mean 1/A.

Models similar to the open-network model havebeen used by many researchers to evaluate networkperformance, either using, queueing theory (wheremessages are usually assumed to be Poisson streams)or simulation [1, 3, 9, 11, 15, 16, 18, 24].

Figure 4 shows the results for the butterfly net-works. The latencies are round-trip latencies, i.e., thetime taken for the access request to go from the pro-cessor to the memory, make the access and return tothe processor. The x-axis is the normalized load ex-pressed as a percentage of the peak load. The curvesextend along the x-axis until the point at which thecorresponding generation rate no longer yielded astable latency.

Figure 5 shows the results for the cross-bar net-works. The graphs corresponding to Butterfly.8.8,Butterfly.8-2dil.8 and Butterfly.32.8 are included forcomparison.

Figure 6 shows the results for the networks witha 2-hop network between modules. The graphs cor-responding to Butterfly.&& and Butterfly.32.8 are in-cluded to establish the relative performances.We defer detailed discussion of the results to Sec-

tion 6.3.

6.2 Multithreading Model

The preceding model is unrealistic in that it allowseach processor to have an unbounded number of out-standing requests. Instead, in the multithreadingmodel, each processor runs a fixed number ofthreads. Each thread is a sequence of instructionsthat can either be local operations or remote ac-cesses. When a thread issues a remote access, theprocessor suspends the thread until the access com-pletes. Each thread can have at most one outstandingaccess at any time. We assume a simple round-robinscheduling of threads, and also assume that the pro-cessor stalls when the previous access made by thenext thread (in round-robin order) has not completed.We assume a single cycle thread-switch time, in themanner of [5].We consider two inter-access interval distributions

for the multithreading workload: geometric and pe-riodic. In the geometric model the inter-access timesare geometrically distributed similar to the open-net-work model. In the periodic model, the inter-access


400

151

50

0

0

FIGURE 4

20 40 60 80 O0

Normalized LoadOpen-Network Model; Butterfly Networks

interval is a constant. If the processor is executing atight inner loop, the interval between remote ac-cesses is likely to be a constant and the periodicmodel attempts to capture this effect. In reality thedistribution of accesses is likely to be somewhere in

4OO

350

250

50

Butterfly.8.816ft i-lrl--d’ii.’g

iiigi:fl’’.’’!’""X6i?.’8":X6/ff.3"2".

0 20 40 60 80

Normalized LoadFIGURE 5 Open-Network Model; Cross-bar networks

100

between. Note that in both the periodic and geomet-ric models, the accesses are directed to random ad-dresses and therefore will follow random paths in thenetwork.The performance metric under this work-load

model is the average processor utilization, as a func-tion of number of threads and of the rate at whichthreads make memory accesses. Based on the bisec-tion width of the networks we can see that the peakbandwidth that each network can support is 8 bitsper processor per cycle. This corresponds to a meanaccess rate of one every 16 cycles, since each accessis 128 bits long. We considered mean access inter-vals of both 16 and 32 cycles.A mean access interval of 16 cycles can poten-

tially saturate the bisection and corresponds to run-ning the network at 100% load. Since message in-sertion is linked with the delivery of responses,trying to run the network at 100% load does not posea problem. The processors themselves merely slowdown until the injection rate matches the responserate. As the number of threads per processor in-creases, the processors become increasingly latencytolerant and are able to get more out of the network.A mean access interval of 32 cycles corresponds

to running the network at 50% load. Irt this case,since the network is lightly loaded, the processorsshould be able to achieve almost 100% utilization ifthey can successfully hide the network latency usingmultithreading.

Figure 7 shows the processor utilizations for thedifferent networks when the mean access interval is16 cycles. The processor utilization is shown on thex-axis and the different networks are shown on they-axis. For each network, there are five data pointsalong each horizontal line, corresponding to multi-threading factors of 1, 2, 4, 8, and 16. In order tomake these data points easy to read, line segmentsconnect points corresponding to the same multi-threading factor. Figure 7(a) shows processor utili-zations under the geometric model and Figure 7(b)corresponds to the periodic access model. Figure 8shows similar curves for a mean access interval is32 cycles.

6.3 Discussion of Results

We see that in all the open-network graphs, as theapplied load increases, the latency initially increasesgradually, and as the network nears saturation, thelatency increases rapidly.The graphs for the multithreading model under

both geometric and periodic access patterns are sim-ilar to each other, except that the utilizations for the


200

40

20

FIGURE 6

20 40 60 80

Normalized LoadOpen-Network Model; 2-hop networks

100

0 20 40 60 80 100

Processor Utilization

Butterfly.8.8

Butterfly.8-2dil.8

Buttrfly.8-4dil.8

Butterfly. 16.8

Butterfly. 16-2dil.8.

Butterfly.32.8

Xbar.8.8

Xbar.32.8

2hop.32.64-32

2hop.32-2dil.64-32

2hop.64.64-32

#Threads

#Threads 2

(a) Geometric access model #Threads 4#Threads 8#Threads 16

20 40 60 80

Processor UtilizationlOO

(b) Periodic access model

FIGURE 7 Processor utilization, mean access interval 16cycles

periodic model are higher. This is to be expectedbecause, under the geometric model, there is agreater variance in inter-access times, and if manyshort inter-access intervals occur together, the in-stantaneous load increases and the processor utili-zation gets reduced.

Dilation versus Wider Channels

The data corresponding to Butterfly.8.8, Butterfly.8-2dil.8 and Butterfly.8-4dil.8 in Figures 4 and 7 in-dicate that dilation of channels helps in improvingperformance. This improvement results from the re-duced contention for the intra-module channels. In-creasing the dilation from 2 to 4 does not seem tohave much of an effect, indicating that a dilationfactor.of 2 is sufficient to eliminate most of the intra-module contention.As opposed to dilation, widening the channels

within a module affects the network performance inthree different ways. First, it reduces the contentionfor intra-module channels. Secondly, it increases theeffective memory service rate. Dilation does not in-crease the memory service rate because we do notdilate the channels to the memory units. Thirdly itincreases message latency because of the non-uni-formity in channel widthsd. It is clear that the firsteffect has a positive influence on network perform-ance. The second effect also helps improve perform-ance because conflicts between accesses at the mem-ories have less of an effect. The third effect hurtsnetwork performance because it increases thelatency.When the load is small (in the open-network case)

or there are few threads (multithreading), the per-formance of the networks with wider channels isslightly worse than Butterfly.8.8. This is because ofthe third effect mentioned above. But, when the loadon the network is increased, the effects of reducedcontention and the higher memory service rate be-come more predominant. In the long run we can ex-pect all memories to receive approximately the samenumber of requests because the access distribution israndom, but in the short run the access distributioncan be uneven. A higher memory service rate helpsto reduce the effect of this non-uniformity.

dWhen a message transits through a node whose output channelis wider than its input channel, the message suffers a larger delaybecause it must be entirely buffered before it can be forwarded.


Butterfly.8.8

Butterfly.8-2dil.8

Butterfly.8-4dil.8

Butterfly. 16.8

Butterfly. 16-2dil.8

Butterfly.32.8

Xbar.8.8

Xbar.32.8

2hop.32.64-32

2hop.32-2dil.64-32

2hop.64.64-32

#Threads

#Threads 2

#Threads 4

#Threads 8#Threads 16

0 20 40 60 80 100 0 20 40 60 80 100Processor Utilization Processor Utilization

(a) Geometric access model (b) Periodic access model

FIGURE 8 Processor utilization, mean access interval 32cycles

Cross-bar

The intra-module cross-bar network completelyeliminates the possible contention for links within amodule. It also has a lower minimum latency be-cause we assume that it imposes only a single cyclepipeline delay as opposed to one cycle per stage ina multi-stage network.

In Figure 5 we see that the graph for Xbar.8.8 isapproximately parallel to that of Butterfly.8-2dil.8 in-dicating that both networks have similar performancewith respect to contention. The graph for the cross-bar network is lower because of its lower minimumlatency. This difference does not show up under themultithreading workload (Figure 7) indicating thatthe effect of this difference in latency is small. Thecomparison between Xbar.32.8 and Butterfly.32.8 issimilar. This indicates that providing extra localbandwidth either through dilation or wider channelsis successful in eliminating almost all the contentionfor the intra-module channels.

2-hop Networks

The results for the 2-hop networks indicate similarrelative behavior as the Butterfly. *. * networks; dila-tion improves the performance over the base net-work, and widening channels is even better. How-ever, the difference between dilation and widerchannels is not as pronounced since the network in-

itially has 32-bit wide memory channels. Increasingthe channel width from 32 to 64 leads to a smallermarginal increase in performance.

Since the 2-hop and complete graph networks usedifferent pin-counts and number of bundles, it is notreally fair to compare them. However, since both net-works are based on the butterfly topology, the resultsindicate that, for a given bandwidth, it is better tobuild a smaller butterfly with wider channels andload each input of the butterfly with more processors.When there are fewer channels that are wider, it iseasier to balance the load on them and achieve betterutilization.On the other hand, if we compare the two net-

works based solely on pin-count rather than bisectionwidth, we should consider a complete graph networkwith double the pin-count, since a 2-hop networkuses approximately twice the number of pins (Table1). In this case, the complete graph networks wouldhave twice the bisection bandwidth. Therefore, it isreasonable to compare the performance of the 2-hopnetworks at 100% load with that of the completegraph networks at 50% load. Under this comparison,we can clearly see that the complete graph networksout-perform the 2-hop networks because the com-plete graph is better than other inter-module topol-ogies when pin-count is the only limitation.

Muitithreading

It can be seen that all networks achieve significantimprovement in processor utilization as the numberof threads is increased. This demonstrates that mul-tithreading is successful in masking the access laten-cies. Increasing the number of threads per processorcorresponds to operating the network farther downthe x-axis on the load-latency graphs of the open-network model. Though this increases the latency ex-perienced by the messages, the processor is able tobetter overlap the latency with useful computation.It is important to note that even when the number ofthreads is small (2-4), there is a significant improve-ment in processor utilization in relation to the utili-zation corresponding to a single thread. As we in-crease the number of threads further, the processorutilizations improve, but the marginal increase peradded thread diminishes because increasing the num-ber of threads also increases the loading of the net-work which increases latency.

Performance at Low Network Loads

The graphs corresponding to an inter access time of32 cycles (Figure 8) show that at high levels of mul-


tithreading, it is possible to get processor utilizationsof almost 100%. Since the load never gets so highas to make the latencies increase dramatically, mul-tithreading can completely hide the network latency.For the networks we evaluated we can see that about4 threads can achieve this. Naturally, as we scale themachine to more processors we expect that we willneed more threads, but we expect that the number ofthreads will not grow faster than log N since thenetwork latencies under cut-through routing growsslower than log N.We also see that at low levels of multithreading,

the effects due to non-uniform channel widths aremore pronounced. Since the network load is small,the increase in latency due to non-uniform channelwidths dominates the effects due to reduced conten-tion and increased memory service rates.

7 CONCLUSIONS

We considered interconnection network design froma packaging point of view and developed a costmodel for the highest level of the packaging hierar-chy. We adopt a hierarchical network design strategyin which we design different networks for each levelof the hierarchy starting at the top level and pro-ceeding down to the lower levels.We have identified a family of networks which are

promising candidates for interconnecting the top lev-els of the hierarchy since they exhibit a reasonabletrade-off between pin-count and number of bundles.We call this family n-hop networks. A n-hop networktopology is a n-fold product of complete graphs.Our hierarchical design strategy yields hybrid net-

work architectures which exploit packaging localitybetter than network architectures which start with auniform topology that is later partitioned into a pack-aging hierarchy. Our results show that we can getsignificant improvements in performance by exploit-ing packaging locality. We observed that increasingthe connections local to a module by using eitherdilation or wider channels improved performance. Itwas better to increase the channel widths rather thanuse dilation, especially if our processors were capa-ble of a high level of multithreading. With 8 threadsper processor (geometric distribution of access inter-vals), we observed that increasing the intra-modulechannel width from 8 to 32 improved the processorutilization of the butterfly network from around 55%to almost 80%. Also the performance of the networkwith 32 bit wide intra-module channels came veryclose to that of the best possible cross-bar network.

All our results indicate that the performance im-

provement achieved by multithreading the proces-sors is quite significant. Processor utilizations in-creased by about a factor of 2 with 4 threads andalmost 3 with 8 threads when compared to the util-ization with a single thread. Designing processorscapable of running multiple threads and rapidlyswitching between threads is therefore useful forlarge scale parallel computers.

References

[1] S. Abraham and K. Padmanabhan. Constraint based evalu-ation of multicomputer networks. In Proceedings of the In-ternational Conference on Parallel Processing, 1990.

[2] S.G. Abraham and E.S. Davidson. A Communication Modelfor Optimizing Hierarchical Multiprocessor Systems. InProceedings of the International Conference on ParallelProcessing, pages 467-474, 1986.

[3] Anant Agarwal. Limits on Interconnection Network Per-formance. IEEE Transactions on Parallel and DistributedSystems, 2(4):398-412, October 1991.

[4] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Oncommunication latency in pram computations. In Proceed-ings of the ACM Symposium on Parallel Algorithms andArchitectures, pages 11-21, 1989.

[5] R. Alverson et al. The TERA Computer System. In 1990International Conference on Supercomputing, pages 1-6,1990.

[6] Joshua E. Barnes and Piet Hut. A Hierarchical O(n log n)Force Calculation Algorithm. Nature, 324(4):446-449,1986.

[7] Connection Machine Model CM-2 Technical Summary,1987. Technical Report No: TR HA87-4.

[8] David Culler, Richard Karp, David Patterson, Abhijit Sahay,Klaus Erik Schauser, Eunice Santos, Ramesh Subramoniam,and Thorsten von Eicken. LogP: Towards a Realistic Modelof Parallel Computation. Technical Report UCB/CSD 92/713, University of California, Berkeley, 1992.

[9] William J. Dally. Performance analysis of k-ary n-cube in-terconnection networks. IEEE Transactions on Computers,39(6):775-785, June 1990.

[10] Sivarama Dandamudi. Hierarchical lnterconnection Net-worksfor Multicomputer Systems. PhD thesis, University ofSaskatchewan, Canada, 1988. Department of ComputationalScience 88-18.

[11] N.J. Davis IV and H.J. Siegel. Performance studies of mul-tiple-packet multistage cube networks and comparison tocircuit switching. In Proceedings of the International Con-ference on Parallel Processing, pages 108-114, 1986.

[12] A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Ru-dolph, and M. Snir. The NYU Ultracomputer--designingan MIMD shared memory parallel computer. IEEE Trans-actions on Computers, C-32:175-189, February 1983.

[13] Leslie Greengard and Vladimir Rokhlin. A Fast Algorithmfor Particle Simulation. Journal of Computational Physics,73(325), 1987.

[14] P. Kermani and L. Kleinrock. Virtual Cut-Through: A NewComputer Communication Switching Technique. ComputerNetworks, 3:267-286, 1979.

[15] Richard Koch. Increasing the size of a network by a con-stant factor can increase performance by more than a con-stant factor. In Proceedings of the IEEE Annual Symposiumon The Foundations of Computer Science, pages 221-230.IEEE Computer Society, 1988.

[16] C.P. Kruskal and M. Snir. The performance of multistageinterconnection networks for multiprocessors. IEEE Trans-actions on Computers, C-32(12):1091-1098, December1983.


17] Gyungho Lee. A performance bound of multistage combin-ing networks. IEEE Transactions on Computers, C-38(10):i387-1395, October 1989.

[18] Tom Leighton, Derek Lisinski, and Bruce Maggs. Empiricalevaluation of randomly-wired multistage networks. In In-ternational Conference on Computer Design, 1990.

[19] John M. Mellor-Crummey and Michael L. Scott. Algorithmsfor Scalable Synchronization on Shared-Memory Multipro-cessors. ACM Transactions on Computer Systems, 9(1):21-65, Feb. 1991.

[20] G.E Pfister et al. The IBM Research Parallel Processor Pro-totype (RP3): Introduction and Architecture. In Proceedingsof the International Conference on Parallel Processing,pages 764-771, 1985.

[21] G.F Pfister and V.A. Norton. Hot-spot contention and com-bining in multistage interconnection networks. In Proceed-ings ofthe International Conference on Parallel Processing,pages 790-797, 1985.

[22] M.T. Raghunath. Interconnection Network Design Based onPackaging Considerations. PhD thesis, University of Cali-fornia, Berkeley, 1993. CSD-93-782.

[23] Abhiram G. Ranade. Fluent Parallel Computation. PhD the-sis, Yale University, 1988. Department of Computer ScienceTR-663.

[24] R.D. Rettberg et al. The Monarch Parallel Processor Hard-ware Design. IEEE Computer, pages 18-30, April 1990.

[25] Steven L. Scott and James R. Goodman. The Impact ofPipelined Channels on k-ary n-cube Networks. To appear inIEEE Transactions on Parallel and Distributed Systems.

[26] C.D. Thompson. A Complexity Theoryfor VLSI. PhD thesis,Carnegie-Mellon University, 1980.

[27] P. Yew, N. Tzeng, and D.H. Lawrie. Distributing hot-spotaddressing in large-scale multiprocessors. In Proceedings ofthe International Conference on Parallel Processing, pages51-58, 1986.

Biographies

M.T. RAGHUNATH is a member of the Development Staff atthe Highly Parallel Supercomputing Systems Laboratory of IBMin Kingston, NY. He received his B. Tech. degree in ComputerScience and Engineering from the Indian Institute of Technology,Madras, India in 1987, and his M.S. and Ph.D. degrees in Com-puter Science from the University of California at Berkeley, in1988 and 1993 respectively. His research interests include archi-tectural and engineering issues associated with the development ofparallel computers, interconnection network design, and softwaresupport for interprocessor communication.

ABHIRAM RANADE received the B.Tech degree in ElectricalEngineering from the Indian Institute of Technology, Bombay, In-dia in 1981 and the doctorate in Computer Science from YaleUniversity, New Haven, CT in 1989.

Currently he is an Assistant Professor of Electrical Engineeringand Computer Science at the University of California, Berkeley.His research interests include parallel architectures and algorithms,parallel programming techniques and data structures.

Dr. Ranade is a member of the ACM.

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks


designing interconnection networks multi-level...

Documents