[ieee 2009 annual ieee india conference - ahmedabad, india (2009.12.18-2009.12.20)] 2009 annual ieee...

Conjoined Irregular Topology and Routing Table Generation for Network-on-Chip

Naveen Choudhary Department of Computer Science & Engineering

College of Engineering and Technology Udaipur, India

[email protected]

Virendra Singh SERC

Indian Institute of Science Bangalore, India

[email protected]

M.S.Gaur, V. Laxmi Department of Computer Engineering

Malaviya National Institute of Technology Jaipur, India

[email protected], [email protected]

Abstract—Scalable Networks on Chips (NoCs) are needed to match the ever-increasing communication demands of large-scale Multi-Processor Systems-on-chip (MPSoCs) for multi media communication applications. The heterogeneous nature of application specific on-chip cores along with the specific communication requirements among the cores calls for the design of application-specific NoCs for improved performance in terms of communication energy, latency, and throughput.

In this work, we propose a methodology for the design of customized irregular networks-on-chip. The proposed method exploits a priori knowledge of the applications communication characteristic to generate an optimized network topology and corresponding routing tables.

Keywords-NP-hard; Network-on-Chip; Core; Optimization; Performance

I. INTRODUCTION

International Technological Roadmap for Semiconductors [3] predicts that future generations of the high end SoC architectures will be implemented in less than 50 nm technologies, and clocked in the 10 20 GHz range. NoC has been proposed as a solution for the communication challenges in the nanoscale regime [1, 2]. This trend is accompanied by revolutionary changes of the employed design methodology where a paradigm shift from a computation-centric view to a communication-centric becomes evident. In order to tackle design complexity and to facilitate reuse, systems are typically required to be built from pre-designed and pre-verified heterogeneous building blocks like programmable RISC cores, DSPs, memory blocks etc. Some of the most important phases in designing the NoC are the design of the topology or structure of the network and setting of various design parameters (such as frequency of operation, link-width, etc). Several early works [2, 13, 14] favored the use of standard topologies such as meshes, tori, k-ary n-cubes or fat trees under the assumption that the wires can be well structured in such topologies. These topologies are adequate for general purpose

systems where the traffic characteristics of the system cannot be predicted statically, as in homogeneous chip multiprocessors.

Regular interconnection architecture assumes that all cores are of the same size, which may not be the case for custom SoC. Although regular topologies are known to support design reuse. However even in custom topologies, the router architecture itself is regular and can be easily parameterized (on number of ports, width of physical links, etc). In addition, they also offer better scalability and incremental expansion capabilities for practically any number of processing nodes.

The popular deadlock free topology agnostic routing algorithms are up*/down* [4], L-turn [5], and down/up [6]. These algorithms are based on turn prohibition [10], a methodology which avoids deadlock by prohibiting a subset of all turns in the network. In the proposed work, a genetic algorithm based methodology has been developed for the design of customized irregular networks-on-chip and corresponding routing tables. Irregular NoC communication model and architecture is defined in Section II. Our contribution, Genetic Algorithm based Irregular NoC topology generation methodology is presented in Section III. Section IV summarizes experimental results followed by a brief conclusion of our work.

II. IRREGULAR NOC COMMUNICATION MODEL &ARCHITECTURE

In the following paragraphs, communication model and the NoC architecture are explained.

Task graphs are generally used to model the behavior of complex SoC applications on an abstract level [7]. The tasks Tare mapped to a set of IP cores V which communicate through unidirectional point to-point abstract channels. In our work, it is assumed that the mapping of tasks to IP cores has already been carried out by a separate process. The core graph is considered as the primary input for our application specific

978-1-4244-4859-3/09/$25.00 ©2009

Irregular NoC topology generation methodology. The connectivity and the available link bandwidth of irregular NoC are represented by the NoC topology graph. A generic communication model is shown in Figure 1.

Figure 1. Modelling application specific communication

Definition 1 The core graph is a directed graph, G (V, E) with each vertex i V representing an IP core and the directed edge ( i, j)denoted as ei,j E, representing the communication between the cores

i and j. The weight of the edge ei,j denoted by bwi,j, represent the desired bandwidth of the communication from i and j.

Definition 2 The NoC topology graph is a directed graph N (U, F) with each vertex i U representing a node/tile in the topology and the directed edge ( i, j) denoted as fi,j F represents a direct communication link/channel between the vertices i and j. The weight of the edge fi,j denoted Abwi.j represent the available link/channel bandwidth across the edge fi, j.

A discrete event, cycle accurate simulator supporting up*/down* [4] routing with wormhole switching called IrNIRGAM is deployed for customized Irregular NoC. IrNIRGAM is an extended version of Regular topology based NoC Simulator NIRGAM [11, 12] for supporting irregular topology with table based routing. For chip layout, we assume a tile based floorplan in which each IP core is mapped to one of the equally sized rectangular tiles of unit size length. Each tile is organized in a regular grid structure, which is independent of the chosen interconnect topology. It is worth mentioning that this assumption is not an essential requirement of our approach. As we do not focus on floorplanning as a prime concern, it essentially simplifies the following ideas about the proposed network generation technique.

III. IRREGULAR NOC TOPOLOGY GENERATIONMETHODOLOGY

As shown in Figure 2 floorplanning with the objective to minimize area can be done as the first step. Non-Slicing floorplannners such as B*-Trees [9] can be used for doing floorplanning for the chip layout. The irregular topology construction is started by creating a Breadth First Search spanning tree using Prim's algorithm [15] based on Manhattan distance among the IP cores. The permitted node degree (nd_treemax), i.e., number of allowed ports per IP core at this stage is kept less than the actual permitted node degree (ndmax).The initial tree topology is strongly connected, and thus provides a path between every pair of nodes and therefore this property is retained throughout the topology generation process. Based on the constructed minimum spanning tree and using Dijkstra's shortest path algorithm [15], the routing table

entries for the routers of the NoC is generated for each edge (source-destination ) of the core graph. At this stage the traffic load to these tree paths is assigned according to the bandwidth requirement in the core graph, i.e., for < source( i),destination( j) >, the basic tree path for ( i, j) in the NoC topology graph is assigned the traffic load of bwi, j and similarly the edges of the path ( i, j) are assigned the traffic load as the summation of their previously assigned traffic load and bwi, j .

Once the first phase is completed, in the next phase of the methodology a genetic algorithm [16] based heuristic is used for the design of customized irregular networks-on-chip with optimized bandwidth load distribution. We expect improved energy requirements in the bandwidth balanced configuration. The proposed genetic algorithm formulation of the problem is presented in the following paragraphs.

A. Solution Representation Each chromosome is represented by an array of genes with

the maximum size of the gene array to be equal to the number of edges in the core graph. The chromosome set C can be represented as under

},|{ , VvvEeC jiji ∈∀∈=

Where E is set of edges in the core graph & V is set of vertices in the core graph.

Each gene contains the information regarding the various possible paths in the NoC topology graph between the <source( i), destination( j)> pair where i = i and j = j.and < i, j> are <source, destination> pair for the gene. A gene is only permitted to have a maximum of n (configurable parameter) number of paths. In these n paths at least one path is the shortest path through the edges exclusively of the minimum spanning tree only and rest of the paths are generated by adding shortcut. Keeping at least one path from the minimum spanning tree guarantees the connectivity between the source destination pair for the gene. A random number of shortcuts are added to the chromosomes of the initial population to bring gene pool variety in the population.

B. Mutation Operators Three mutation operations called Topology-Extension,

Topology-Reduction, and Energy-Reduction in equal proportion are applied in each generation of the genetic algorithm.

For Topology-Extension-Mutation, a random number of genes are picked from the selected chromosomes and their paths are checked for the traffic load assigned to them. If any of the edges/channels of this path are heavily loaded then a suitable shortcut channel is inserted in the topology. The added shortcut is constrained by the maximum permitted channel length emax due to physical signaling delay and so prevents the algorithm from inserting wires that span long distances across the chip. Similarly shortcut is not added between the IP cores if it exceeds a given maximum permitted node-degree ndmax of either its source or target core. This constraint prevents the algorithm from instantiating slow routers with a large number of I/O-channels which would decrease the achievable clock

frequency due to internal routing and scheduling delay of the router.

Figure 2. Network construction flow using genetic algorithm

A new deadlock free path is formed including the added shortcut channel using Dijkstra's shortest path algorithm in combination with the routing rules of up*/down* routing [4] and the desired bandwidth load is distributed among these paths. The excess load of the selected path is transferred to the channels of the new path if it does not lead to overloading of the new path channels otherwise the shortcut is rejected. The mutation is accepted only if it results in improvement of the load distribution, i.e., reduction in required average channel bandwidth overflow in the corresponding NoC topology graphof the chromosome.

The Topology-Reduction mutation tries to remove such channels from the topology which are very lightly loaded and are being used by a single path. The load of the path to be removed is transferred to an existing path of the gene having minimum load on its channels. This mutation operation is not allowed to remove the channels which are part of the minimum spanning tree.

Energy-Reduction mutation is done on randomly selected chromosome with the bias towards the best class of the chromosome population in each generation. In this mutation each path of every gene of the chromosome is traversed and we try to find a replacement shorter path by adding suitable shortcut. This mutation helps in reducing the energy requirement [7] of the topology as well as help in reducing the latency of the traffic. The change is accepted only if it improves the cost.

C. Crossover Operator Crossover is done on a large size of the population with the

bias towards the best class of the chromosome population. For achieving crossover two chromosomes and a random crossover point is selected and then genes of these two new chromosomes are mixed over the crossover point to produce chromosomes. The crossover is accepted only if it improves cost and the permitted node-degree ndmax of the constructed NoC topology is not violated. Moreover the new chromosome should have the valid channels available to satisfy all the paths /routes of the chromosome.

D. Measure of Fitness & Output The fitness measure essentially has two components: {(1)

average bandwidth requirement overflow, (2) dynamic energy requirement} of the traffic for the customized topology. Based

on the fitness requirement, the cost function is formulated as under.

Let X1 is maximum chromosome energy requirement among all the chromosome in the population, X2 is maximum possible bandwidth requirement from a channel of the NoC topology graph among all chromosomes in the population, Eciis the energy requirement for chromosome ci and Bci is the average bandwidth requirement overflow per channel of the NoC topology graph of the chromosome ci. Cost of the chromosome ci can be formulated as under.

)/()/( 21 XBcXEcCost iii ×+×= βα

Where and are two empirically determined constants. Through exhaustive experimentation, a suitable value of and were fixed as 0.25 and 0.75 respectively. Fitness of

chromosome is regarded as high if it’s Cost is close to 0.0. It may be noted that, the best 10% chromosomes at any generation are directly transferred to the next generation, so that the solution does not degrade between the generations.

IV. EXPERIMENTAL RESULTS To evaluate the suitability of our generated application

specific topology for the use in heterogeneous NoC platforms, we carried out various performance comparisons. We consider communication to be highly irregular caused by the diversity of hardware components in such platforms. In order to obtain a broad range of different irregular traffic scenarios, we randomly generated multiple core graphs using TGFF [8] with diverse bandwidth requirement of the IP cores. In this way the heterogeneity of the cores is taken into account where communication intensity varies between different pairs of cores.

For performance comparison a NoC simulator IrNIRGAMsupporting irregular topology is deployed. Each core injects one flit every 2 clock cycles into the network i.e. flit interval is kept as 2 clock cycles. For performance comparison IrNIRGAM was run for 1000 clock cycles and network throughput in flits and average flit latency were used as parameters for comparison. Network throughput is the number of flits received by various cores of the NoC during the simulation run. The flit latency determines the number of clock cycles it takes from entering the network until the reception at the target node. It is worth mentioning that these latency values do not include the source queuing time which is the time span, a flit waits at the source node in case of a congested out-channel. Therefore, latency & throughput is expected to saturate towards high injection loads. All data queues in the network routers are sized to buffer eight flits per channel.

The proposed genetic algorithm was run for 1000 generation with population size of 200 for obtaining the customized irregular topology. Mutations are done on 15% of the population and crossover on 30% of the population in each generation. Each generated irregular topology was optimized with respect to the applied traffic pattern, where the maximum channel length was set to be twice the length of a tile. Along with the topology the genetic algorithm was used to generate the routing tables and the required load assignment for each path. The proposed algorithm optimizes the load distribution

by trying to keep each channel load to 75% to 100% of the available bandwidth.

Figure 3. Average performance comparison of IrNoc with 2-D Mesh topology with X-Y and OE routing

Figure 3 summarize the performance results averaged over 50 generated irregular topologies (IrNoC) with permitted node/core degree of 4 with number of cores varying between 16 to 81 and 2D-mesh of equal number of cores as in IrNoCwith X-Y(dimension order routing) and OE(odd-even routing) routing. For IrNoC table based routing was used. Figure 3 shows that optimized IrNoC sustain a higher throughput and lower transmission latency in all cases. IrNoC with permitted node degree of 4 achieves 12.3% and 16.8% more throughput on average with decrease in average flit latency of 17.9 and 58.2 clock cycles in comparison to corresponding 2-D Mesh with X-Y and OE routing respectively. Figure 4 shows throughput and latency comparison of IrNoC and 2-D mesh with X-Y and OE routing with varying packet injection interval in clock cycles.

Figure 4. Average performance comparison of IrNoC with 2-D Mesh topology with X-Y and OE routing with varying packet injection interval

V. CONCLUSION The genetic algorithm based methodology was

implemented to tailor the network topology to the requirements of the application especially where traffic workloads can be characterized to large extent. The generated topology achieves improved load distribution among the channels leading to better throughput and improved average latency as the congestion is reduced. All the routes generated through the presented methodology are kept deadlock free for efficient traffic flow. The methodology proved to be better in terms of performance but even more due to their true scalability compared to traditional approaches. We believe that the combined treatment of the routing algorithm and topology generation offers a huge potential of optimization for future application-specific NoC architectures.

REFERENCES

[1] W. J. Dally, B.Towles,,“Route Packets, Not Wires: On-Chip Interconnection Networks,” in IEEE Proceedings of the 38th Design Automation Conference (DAC), pp. 684–689, 2001.

[2] L. Benini, G. DeMicheli., “Networks on Chips: A New SoC Paradigm,” IEEE Computer Vol. 35, No. 1 pp. 70–78, January 2002.

[3] International Technical Roadmap for Semiconductors website.[online].Available: http://public.itrs.net/, 2004.

[4] e. a. M. D. Schroeder, “Autonet: A High-Speed Self-Configuring Local Area Network Using Point-to-Point Links,” Journal on Selected Areas in Communications, vol. 9, Oct. 1991.

[5] A. Jouraku, A. Funahashi, H. Amano, M. Koibuchi, “L-turn routing: An Adaptive Routing in Irregular Networks,” in International Conference on Parallel Processing, pp. 374-383, Sep. 2001.

[6] Y.M. Sun, C.H. Yang, Y.C Chung, T.Y. Hang, “An Efficient Deadlock-Free Tree-Based Routing Algorithm for Irregular Wormhole-Routed Networks Based on Turn Model,” in International Conference on Parallel Processing, vol. 1, pp. 343-352, Aug. 2004.

[7] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC Architectures Under Performance Constraints,” ASP-DAC 2003, Jan 2003.

[8] R. P. Dick, D. L. Rhodes, W. Wolf, “TGFF: task graphs for free,” in Proc Intl. Workshop on Hardware/Software Codesign, March 1998.

[9] Y. C. Chang, Y. W. Chang, G. M. Wu and S. W. Wu, “B*-Trees : A New Representation for Non-Slicing Floorplans,” in Proc. 37th Design Automation Conference, pp. 458-463, 2000.

[10] C. Glass and L. Ni, “The Turn Model for Adaptive Routing,” in Proc.19-th International Symposium on Computer Architecture, pp. 278– 287, May 1992.

[11] Lavina Jain, B.M.Al-Hashimi, M.S.Gaur, V.Laxmi, A.Narayanan, “NIRGAM: A Simulator for NoC Interconnect Routing and Application Modelling, DATE 2007, 2007.

[12] http://www.nirgam.ecs.soton.ac.uk (Last viewed on August 30, 2009) [13] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J.

Oberg, K. Tiensyrja, and A. Hemani, “A Network on Chip Architecture and Design Methodology”, In Proceedings of VLSI Annual Symposium (ISVLSI 2002), pp. 105–112, 2002.

[14] L. Natvig, “High-level Architectural Simulation of the Torus Routing Chip”, In Proceedings of the International Verilog HDL Conference, California, pp. 48–55, Mar. 1997.

[15] T. Cormen, C.Leiserson, and R. Rivest, Introduction to Algorithms,Prentice Hall International, 1990.

[16] T. Mitchell, Machine Learning, McGraw Hill International Editions, 1997.

[ieee 2009 annual ieee india conference - ahmedabad, india (2009.12.18-2009.12.20)] 2009 annual ieee...

Documents