design of a network-on-chip platform for mpsocs using tlm

Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation

Fernando Adolfo Escobar Juzga

Electric and Electronic Engineering Department

APPROVED:

Antonio Garcıa Rozo, Ph.D.

Mauricio Guerrero, MSc.

Alain Gauthier, Ph.D.Dean of Faculty

to my

MOTHER, FATHER and SISTERS

with love

Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation

by

Fernando Adolfo Escobar Juzga

Thesis

Presented to the Academic Faculty of the Graduate School of

Universidad de los Andes, Bogota

in Partial Fulfilment

of the Requirements

for the Degree of

Master Of Electronic Engineering

Electric and Electronic Engineering Department

Universidad de Los Andes

January 2011

Acknowledgements

I wish to thank my advisers Antonio Garcıa and Mauricio Guerrero for their guide and

support throughout the project; it was their experience and knowledge what helped me

choose and love this research area from years before. To my parents and sisters that have

unconditionally supported me at all times and without whom I wouldn’t have got here.

Additionally I want to thank all my friends who continuously inspire and demonstrate me

how far can one go with hard work, dedication and passion.

This thesis wouldn’t have been possible without the support of the OSCI TLM working

group and all its members. Finally, I want to thank CMUA group for providing me with

the necessary resources and tools that were required.

iv

Abstract

Complex systems that include a great variety of modules inside the same dice require higher

level design techniques that allow obtaining accurate models suitable to test hardware as

well as software at early stages; multiprocessors Systems On-Chip (MPSoCs) are scaling

to levels where it is possible to embed tens and up to hundreds of cores on the same chip.

Such architectures cannot be integrated with traditional bus structures as they are not

scalable; as a solution to that, a new paradigm called Network on Chip (NoC) has gained

strength to solve this issue.

SystemC, an IEEE standard for electronic level design (ESL) is used here to build a

NoC functional model; to simplify hardware details and speed up simulations, the new

Transaction Level Modelling standard (TLM 2.0) is also adopted. Relying on different

design constrains, variables such as router and network interfaces architectures, routing

algorithms, message and flit size, etc, are evaluated.

At a final stage, a VHDL synthesis is done and compared with other implementations.

Results prove this design flow to be adequate and helpful for this kind of systems due to

its size and complexity.

v

Table of Contents

Page

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Chapter

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Networks On Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1 Parallel Computing Memory Model . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Networks On Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Physical Layer: Topology . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.2 Data Link Layer: Flow Control . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Network Layer: Switching Policy and Routing Algorithm . . . . . . 10

1.2.4 Transport Layer: Network Interface Card . . . . . . . . . . . . . . . 13

1.3 SystemC and Transaction Level Modelling TLM 2.0 . . . . . . . . . . . . . 17

1.4 Open Core Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 NoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Flit and Message structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Router TLM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2 Traffic Evaluation and Routing Algorithm Testing . . . . . . . . . . 33

2.2.3 Router VHDL Model . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3 Network Interface Card Architecture . . . . . . . . . . . . . . . . . . . . . 47

2.3.1 Network Interface TLM Model . . . . . . . . . . . . . . . . . . . . . 49

vi

2.3.2 Network Interface VHDL Model . . . . . . . . . . . . . . . . . . . . 55

2.4 Software Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.4.1 4× 4 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 59

2.4.2 8× 8 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 60

3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Significance of the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

vii

List of Tables

1.1 Flow Control Techniques for NoCs . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Generic Payload Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3 Basic OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.4 Burst OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Flit fields explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Router Arbitration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 TLM 2.0.1 Phases Interpretation for Routers . . . . . . . . . . . . . . . . . 31

2.4 Router Area Consumption on Virtex5 (XC5VFX30T-1FF665) . . . . . . . 49

2.5 VHDL-SystemC equivalence of NIC blocks . . . . . . . . . . . . . . . . . . 56

viii

List of Figures

1 OSI Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Shared Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Distributed Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Common Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Ad-Hoc Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Packet Switching on NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Guidelines for selecting a Routing Algorithm . . . . . . . . . . . . . . . . . 12

1.7 Turn model for Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . 14

1.8 West First Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.9 North Last Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.10 Negative First Routing Examples . . . . . . . . . . . . . . . . . . . . . . . 15

1.11 Transaction Level Modelling Use Cases, Coding Styles and Mechanisms . . 18

1.12 TLM Transaction Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.13 TLM Base Protocol Phases . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.14 OCP Read Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.15 OCP Burst Write Transaction . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1 Head Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Message Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Torus Topology NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Block diagram of a Router with Virtual Channels . . . . . . . . . . . . . . 29

2.5 Virtual Channel connections to Router . . . . . . . . . . . . . . . . . . . . 30

2.6 General Router Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 Link Utilisation for Hotspot 10 %. Traffic going Right - Down . . . . . . . 38

2.8 Link Utilisation for Hotspot 10 %. Traffic going Left Up . . . . . . . . . . 39

ix

2.9 Timing statistics for Hotspot 10 %. Traffic . . . . . . . . . . . . . . . . . . 40

2.10 Link Utilisation for Matrix Transpose Traffic going Right - Down . . . . . 41

2.11 Link Utilisation for Matrix Transpose Traffic going Left Up . . . . . . . . . 42

2.12 Timing statistics for Matrix Transpose Traffic . . . . . . . . . . . . . . . . 43

2.13 VHDL Router Black Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.14 VHDL Block Diagram for the XY Routing Module . . . . . . . . . . . . . 46

2.15 VHDL Block Diagram for Input Port Module . . . . . . . . . . . . . . . . 46

2.16 VHDL Block Diagram for Output Port Module . . . . . . . . . . . . . . . 47

2.17 VHDL Block Diagram for the Router . . . . . . . . . . . . . . . . . . . . . 48

2.18 TLM Phases in a NIC Read Operation . . . . . . . . . . . . . . . . . . . . 53

2.19 TLM Phases in a NIC Write Operation . . . . . . . . . . . . . . . . . . . . 54

2.20 VHDL Block Diagram for Network Interface Card . . . . . . . . . . . . . . 57

2.21 State Machine for the Handshaking Control . . . . . . . . . . . . . . . . . 58

2.22 State Machine for the NIC End to End Flow Control . . . . . . . . . . . . 59

2.23 NoC Performance for a 4× 4 Matrix Multiplication . . . . . . . . . . . . . 60

2.24 NoC Performance for a 8× 8 Matrix Multiplication . . . . . . . . . . . . . 61

2.25 Total simulation time at each node with North Last Routing . . . . . . . . 61

2.26 Total simulation time at each node with West First Routing . . . . . . . . 62

x

Introduction

Multiprocessor systems on chip (MPSoCs) are becoming ubiquitous platforms in current

devices; large integration of modules has permitted the successful creation of multi-core

architectures (2 to 20 cores) in the past, and now provides the means and technology for

developing the so called many-core ones (hundreds of processors). This platforms require

however, better design practices in order for both hardware and software entities to be

ready for release on time.

One of the most recent proposals for creating complex embedded systems and many-core

platforms is known as Networks on Chip (NoC); instead of utilizing bus systems, intercon-

nections between components are done through routers and Network Interface Cards (NIC);

whilst routers are in charge of transporting data throughout the chip, NICs gather informa-

tion from/to end-modules (ports, memories, cores, etc) and send it to the router’s network

for delivery. In spite of being hardware, due to it’s complexity, a NoC can be better de-

signed from higher levels of abstraction rather than traditional RTL/HDL; new languages

such as SystemC are more appropriate for this task as they can be used to rapidly create

software representations of hardware (known as virtual platforms)[3].

SystemC is currently an IEEE standard for high level modelling with which C++ de-

scriptions of hardware platforms can be made; it includes a release called Transaction Level

Modelling (TLM 2.0) that was designed for speeding up the development of embedded plat-

forms as it simplifies unnecessary physical communication details such as clocks or pin-out

specifications; in addition to all C++ features, SystemC and TLM 2.0 provide libraries

that ease the emulation the real platform with as much timing details as needed. To this

date, however, most NoC designs are conceived from lower abstraction levels; according

to [1], simulation and synthesis are the most common ways to evaluate them; only a few

have been synthesised as ASICs and the rest is implemented on FPGAs. High level models

of NoCs have also been proposed: in [6] a 4x4 mesh network is built and simulated with

1

SystemC; [7] creates C++ libraries to make a simulator for NoCs; reference [8] designs

a 6x6 mesh network and tests it with SystemC and VHDL before implementing it on an

FPGA; basic, low level SystemC description of a scalable NoC, is presented in [9] and is

validated with an MPEG encoder. Loads more of proposals can be found in [1] and [2] yet

none of them use the TLM 2.0 standard for design.

As suggested by the some of the previous references and specially by [5], higher ab-

straction levels are needed when constructing this architectures; to see this, consider the

specifications for network design defined by the OSI reference model [4]; although not all

of OSI’s layers have a direct equivalence to NoCs, most its principles can be extrapolated

to this field. In Figure 1, the OSI protocol stack is shown; for this work, the three upper

layers can be joint into a single one. From the figure, dependency between hardware and

software is evident when reading it from top to bottom; although it is possible to separately

test each group, SystemC descriptions and a correct application of the TLM standard can

help co-designing the NoC as a whole and iteratively improve it on both aspects.

Low level layers of the protocol define all related with routers: architecture, routing

algorithms, flow control, switching techniques, etc; higher ones determine the NIC’s struc-

ture. Because bus systems are the medium through which processors and most peripherals

transfer information, NICs use them to interchange data with end modules; due to the

great variety of bus specifications, this work considers only the most common ones, that is,

AMBA from ARM [11] and OCP [10] from the OCP-IP group; the latter was selected for

its simplicity and high support from the ESL community.

On the other hand, and as previously stated, top layers of the OSI model can be sum-

marized into a greater group that refers to software models necessary to access the network;

there are mainly two approaches in the field of multi-processor programming: shared or dis-

tributed memory. As its name indicates, shared memory implies that processing units can

access the same physical or logical memory spaces at any time; a well known API implemen-

tation of this, is called OpenMP [12], and is lead by a non-profit corporation, composed

of several companies and researchers, named OpenMP-ARB. The distributed model, on

2

Figure 1:OSI Protocol Stack. Networks are usually defined according to thelayers shown.

the contrary, assigns separate physical memory sectors for each unit and through message

passing, data is shared between all modules at any moment. One of the first API imple-

mentations for this protocol is called MPI [13] and is still developed by the MPI Forum.

Message passing has had wide application for computer networks and appeals better suited

for them as computers don’t share the same memory. For the purposes of this work, some

of the MPI specifications were adopted for NIC design.

The following sections will provide a better insight on all the topics already mentioned;

a functional TLM 2.0 model of a Network On Chip is proposed and validated through

simulations; traffic patterns and other NoC parameters are analysed with the design and

finally a VHDL synthesis is presented to evaluate area consumption.

3

Chapter 1

Networks On Chip Design

To properly board NoC design, several aspects need to be defined in terms of the afore-

mentioned OSI model. Through the evaluation of each layer, all aspects needed for our

high level model will be defined.

1.1 Parallel Computing Memory Model

Parallel computing has faced big challenges since its creation: task dependencies, race

conditions, mutual exclusion and parallel slowdown are concerns that can’t be omitted.

Whether shared or distributed memory, these are software aspects that have to be solved

at high level so represent an additional task for the programmer.

Because none of the previous issues marks any difference on the memory model to be

used, it is necessary to consider elements that clearly affect this decision: portability and

scalability. A shared memory configuration is shown in Figure 1.1; if a program is to be

run on such platforms, either a software compiler, aware of all system resources, has to be

provided along with the hardware, or the programmer has to know about low level details

to write an application for it. Apart from that, if the number of cores is modified, the

cost at software level shouldn’t be expensive at all, and again, it will depend either on the

compiler or the programmer.

Distributed memory shown in Figure 1.2; as indicated, processors interchange informa-

tion through messages. In contrast to the previous approach, no additional compiler or deep

knowledge about the hardware is needed; only network accessing methods are required. In

case the number of cores change, a correct parametrized software description would solve

4

the problem.

Loads more of pros and cons on each configuration could be mentioned but it goes

beyond the scope of this work; it will suffice by stating that the distributed memory model

better suits the NoC’s behaviour and is the one implemented here.

Figure 1.1:Shared Memory Model. All processing elements share a big memoryarea; each core may have as many caches as desired yet the main mem-ory is common to all of them.

Figure 1.2:Distributed Memory Model. Interconnections between cores is donethrough a network; if data has to be shared, it is sent via messagepassing.

1.2 Networks On Chip

Designing Networks On Chip is a process that requires consideration of several variables

in order to separate communication from computation. The OSI model shown in Figure 1

5

can be taken as a reference for this systems; to understand this association, each level of

the stack can be defined as follows:

1. Physical Layer: Defines voltage levels, length and width of wires, timing details

and topology among others.

2. Data Link Layer: Its the one in charge of safe data delivery; specifies flow control

mechanisms between hardware modules.

3. Network Layer: Controls message delivery from one node to another. It’s respon-

sible for storing data and implementing routing algorithms.

4. Transport Layer: Is in charge of establishing connections between end-nodes and

provide the information for them. This module (un) packages data and send (receive)

it to (from) the routers.

5. Session, Presentation and Application Layers: Can be condensed into a single

Application group for NoCs and refers to higher level aspects of the communication

such as software.

By following the above mentioned scheme, it is possible to define a functional and

synthesizable NoC model considering all it’s aspects. Although a high SystemC model of

the network is constructed, hardware details are considered for future implementation.

The following sections show the specifications on each layer for the model developed.

1.2.1 Physical Layer: Topology

Some aspects of the physical layer depend on the technology to be used for fabrication and

can’t be specified from the beginning; operating frequency and voltage levels are examples

of such limitations; because synthesis is not the main target of this work, the previous items

were discarded. Bus width was selected to match most standard processors nowadays, that

is 32 bit ones.

6

Another important issue and perhaps the most relevant on this layer, is the topology;

contrary to computer networks, NoCs have a fixed structure that cannot be modified for

the rest of the chip’s lifetime. On this subject, several configurations have been proposed:

Figure 1.3 illustrates the most common topologies for networks; SPIN [14], Mesh, Torus,

Folded Torus, Octagon and Trees are a few examples. According to [1], Mesh and Torus

topologies constitute 62% of the overall designs; trees represent 12% and the rest, smaller

percentiles. There are as well specific ad-hoc implementations that can be seen in Figure

1.4; addition of links and combination of basic structures constitute the differences; despite

reducing worst case paths or improving latency delays, the cost on area consumption and

creation of new routing algorithms might be too high.

The guidelines to pick a topology were its scalability and the availability of routing

algorithms. As mentioned before, Mesh and Torus structures are used by the majority

of researchers and it is mainly because of their scalability: the cost of adding one or two

cores to a grid is pretty low as it doesn’t critically change the structure, additionally,

routing algorithms need not to be modified. Differences among them are turnaround links

that can significantly reduce some of the worst case conditions. In reference [18], both

structures show similar behaviour in power consumption, throughput and saturation, but

Torus topologies perform better with adaptive routing algorithms which, as will be seen on

the next section, are needed. After considering all previous restrictions and the results of

the cited references, Torus topology was selected for this work.

1.2.2 Data Link Layer: Flow Control

Routers are complex modules that have simple handshaking protocols to transfer data.

Whether interacting with another router or a network interface card (NIC), the mechanism

is the same. Again, some differences when compared to computer networks, exist; modules

inside the same chip transmit data in a much more reliable way than physically separated

ones so it suffices with controlling when to send and receive information, assuming that it is

properly transmitted. Some router implementations such as Æthereal [19] or MANGO[20]

7

Figure 1.3:Common Network Topologies. Both for computers networks and NoCs,most common structures are shown in the graphic.

Figure 1.4:Ad-Hoc Network Topologies. Academic proposals for NoC topologies:Mesh Connected Crossbars [16](left), Spidergon [17] (center), and Di-agonal Mesh [15] (right).

8

offer Quality of Service (QoS) guarantees but it requires a highly and specilized work that

goes beyond the scope of this work.

Flow control techniques are shown in Table 1.1; most implementation use the Credit

Based approach; STALL/GO has never been implemented and the rest of the literature

use handshaking and ACK/NACK like solutions. The handshaking approach is adopted

for our design.

It is important to note that flow control on the NoC’s SystemC model is abstracted

with the TLM 2.0 standard and may correspond to any of the techniques available when

ready for synthesis.

Table 1.1: Flow Control Techniques for NoCs [2].

.

Name Description

Credit Based Every router keeps an internal counter of the

spaces available for data storage (credits); once

a new space is free, a credit is sent back to in-

form its availability.

Handshaking Signal Based A VALID signal is sent whenever a flit is trans-

mitted. The receiver acknowledges by asserting a

VALID signal after consuming it.

ACK/NACK A copy of a data packet is kept in a buffer until

an ACK is received; if asserted, the flit is deleted.

If a NACK signal is asserted, the flit is scheduled

for retransmission.

STALL/GO Two wires are used for flow control; When a

buffer space is available, a GO signal is activated.

When no space is available, a STALL signal is as-

serted.

9

1.2.3 Network Layer: Switching Policy and Routing Algorithm

Switching policy determines the way information is transmitted; it can be either packet or

circuit switched. Circuit Switching is the least implemented and states that a path from

source to destination must be reserved before transmitting data and shall only be released

after the message has been fully delivered. This policy is time expensive and may increase

network congestion because messages can be blocked for long time if data is big; such

situation may easily lead to deadlock issues.

Packet switching is widely used in both computer networks and On-Chip ones; it can

be implemented on either the following three versions:

1. Wormhole: Packets are splitted into smaller ones called flits (Flow Control Units).

Head flits contain address’ information and each router uses it for forwarding it to the

destination; body flits follow it in a worm-like way. Only a 1-flit space is necessary

on each router input for implementation.

2. Store and Forward: Routers accept and send data when there is enough capacity

for fully storing the packet. A minimum space equal to the packet’s maximum length,

is required per router.

3. Virtual Cut Through: Data is transmitted per flit but is only accepted when

there’s enough buffer space for saving the whole packet; all routers must be able to

store at least the maximum’s packet length.

Figure 1.5 illustrates how information is transmitted through packet switching tech-

niques; around 80% of proposed NoCs, implement the wormhole one because of its low-area

requirements; wormhole switching was also selected for this work given those advantages.

Another item addressed by the Network Layer that highly affects the platform’s perfor-

mance is the routing algorithm; because of Torus resemblance with Mesh arrangements,

10

Figure 1.5:Packet Switching on NoCs: Wormhole(left), Store and Foward (center)and Virtual Cut Through (right) [19]. Only the wormhole techniquesignificantly reduces area consumption.

most algorithms that work for it, may as well operate on Torus networks with minor mod-

ifications.

A good guideline for selecting an appropriate algorithm, irrespective of the structure,

is the scheme shown in 1.6; several router implementation details can be established from

that graph: router complexity increases with the number of destinations it can deliver

information to. Due to area restrictions and the possibility of solving it at the software

level, multicast routing is discarded for the current work.

Routing decisions also determine the chip’s design: centralized routing requires a con-

trolling entity, aware of all nodes and traffic throughout the network, to decide how should

the information traverse it; source routing might increase the packet’s size for long paths

and finally multiphase routing also implies some of the previous problems. Distributed

routing is by far the most suited for NoCs and facilitates the adoption of the algorithms

proposed.

As for implementation, both lookup tables and FSM are feasible to adopt; area cost

on both options is similar and don’t affect the design drastically; one variable that could

determine which to choose is whether the algorithm is deterministic (always the same path

between two nodes) or adaptive (relies on network congestion). Thanks to the fact that a

high level model of the network will be created, tests are to be carried on with deterministic

and adaptive algorithms; adaptive ones are be backtracking (fault tolerant), mis-routing

11

(can route away from the destination if necessary) and partial (don’t consider all possible

routing paths).

Figure 1.6: Guidelines for selecting a Routing Algorithm [2].

For grid-like structures the most common deterministic algorithm is the XY one, where

information travels in the X direction until it reaches the Y coordinate of the destination;

it then travels in the Y direction. Adaptive routing is more complex as it attempts send-

ing data through low congested paths that aren’t always minimal; because of that, two

conditions that usually restricts the algorithms adaptability are deadlock, where several

messages block each other’s path preventing themselves to ever advance, and livelock

where data keeps travelling throughout the chip without ever reaching the target.

A few semi-adaptive, deadlock and livelock free algorithms widely adopted are known

as turn model solutions [21], [22]; from all possible 90◦ turns, 2 are prohibited in order

to avoid deadlock. Figure 1.7 shows three algorithms inferred with this theory. To better

12

understand each one, a brief explanation, taken from [23], is presented:

? West-First: Packets should start going to the west if necessary, then, adaptively are

routed south, east and north. Prohibited turns are the two to the west. Figure 1.8 shows

some path examples with this algorithm.

? North-Last: When going north, packets can’t turn anywhere else; the only option for

packets to go northwards is when that is the last direction to take. Examples are shown

in Figure 1.9.

? Negative-First: Prohibited turns are the two from a positive direction to a negative

one; if a packet has to go in the negative direction, it must start in that direction. Figure

1.10 exemplifies this behaviour.

Any of the aforementioned algorithms can be used with the SystemC model of the

network as describing them doesn’t require much development time; studies shown in [18]

demonstrate that no significant difference among them exist.

Other algorithms have been proposed in [24], [25], [26], [27] and many more references

but will be left for future work.

1.2.4 Transport Layer: Network Interface Card

Up to this point, most design specifications affected the router’s final structure, however,

this layer has more implications on the Network Interface Card; problems to be solved at

this level are end-to-end flow control and (un)packing of information.

In order to control packet injection on the network, our NIC design is based on the

message passing model previously mentioned; the way processors intercommunicate with

each other can be summarized in two activities: sending and receiving data; for each

message transmitted by a core (write operation), another one should be expecting it (read

operation).

13

Figure 1.7:Turn model for Adaptive Routing: Two turns are prohibited on eachmodel to avoid deadlocks; minimal and non-minimal paths are possiblefrom all options. [22]

Figure 1.8: West First Routing Examples [23].

Figure 1.9: North Last Routing Examples [23].

14

Figure 1.10: Negative First Routing Examples [23].

It is clear that processors won’t be synchronized at all times, and at a certain point, two

or more cores could send messages to another that isn’t ready yet; this will only increase

network congestion, require retransmission protocols, message discard support, and might

also lead to a deadlock at high level if not properly solved.

Considering the indicated problems and especially area constrains, the proposed Net-

work Interface Card implements end-to-end flow control with the following protocol: when

a core requests data, that is, performs a read operation, it sends a 1-flit-size packet to the

core that is intended to write on it; upon reception, the second NIC sends the information

only if the second core has a pending write transaction that matches the requester’ address;

if the second core doesn’t expect that specific request, discards it, and the first one has

to retry after some time. On the other hand when a NIC receives a write transaction, it

starts packing data, so that when a request arrives, most if not all information is ready to

be transmitted; if an application is properly written, the number of read requests should

match the number of write statements.

The cost of such implementation is that for every read/write pair, at least one flit has

to be sent between two nodes in order to “establish” a connection; this is nonetheless, far

more efficient than allowing all cores to send their packets any time and oblige NICs to

constantly delete them if they don’t correspond to expected transactions.

15

Other important items regarding NIC end-to-end flow control behaviour are:

1. No read transactions requested by a processing element are accepted by the NIC while

another read is in progress; violation of the algorithm sequence can lead to incorrect

results.

2. If data is being transmitted, the NIC can accept a read transaction from the processor

but won’t send the request until the previous transaction has terminated.

3. If a NIC is receiving packets from the network (read transaction), a write transaction

can be started from processor to NIC; data can be stored at a send buffer but won’t

be sent until a request from the correct module is received.

4. A write transaction starts when a processing element sends data to the NIC for

transmission. For the processing element, it ends when all the information has been

transferred to the NIC; for the latter, when all flits have been injected into the

network.

5. A read transaction starts when a processing element requests data from the NIC; it

ends when all the information requested is successfully delivered from the NIC to the

processing element.

6. Irrespective of the type of transaction a NIC is performing, under any circumstances

can it skip the execution order when another read/write transaction is received.

7. Buffer size for storing incoming and outgoing transactions was defined to be of 64

words. Separate buffers are implemented to improve performance.

As stated before, the protocol used for communication between the NIC and the process-

ing elements is the OCP-IP one; because it belongs to another section, it is not explained

here.

16

1.3 SystemC and Transaction Level Modelling TLM

2.0

Transaction Level Modelling TLM is a standard developed by the Open SystemC Initiative

(OSCI) which provides tools to rapidly create virtual descriptions of embedded platforms;

it’s main objective is to decouple computation from communication at a high abstrac-

tion level so that complex systems can be modelled. According to the OSCI group [28],

simulations run from 10X up to 1000X faster than corresponding HDL descriptions.

The TLM 2.0 standard allows two coding styles: loosely timed (L.T) and approximately

timed (A.T). When a quick and slightly detailed model of a design is required, the loosely

timed approach can be adopted; L.T transactions are modelled as a single function call

(read or write) that either returns after some delay, or do it immediately with an additional

delay argument so that the caller reacts after that time. A.T descriptions, on the contrary,

provide mechanisms for specifying as much timing details as desired so are more suited for

architectural analysis and hardware verification. The Network On Chip model developed

here only uses A.T descriptions and therefore an emphasis on explaining it is made. Figure

1.11 shows a bigger context where it’s worth applying TLM 2.0.1 descriptions.

The basic unit in all TLM transactions is the object interchanged, the generic payload ;

it’s a C++ class which members include the minimum elements to execute a transaction:

command, address and data; apart from those, additional variables such as byte en-

ables, streaming width, bus width, response status, etc, are included to model more

complex protocols. Generic payload objects also support user defined extensions that can

carry an unlimited number of attributes if required. Table 1.2 explains the basic attributes

aforementioned.

All TLM 2.0.1 transactions are carried out between an Initiator and at least one Tar-

get; the channel through which they communicate is called a socket and the only module

allowed to start transactions is the Initiator; Target modules can just reply to in-progress

transactions; Interconnect modules (such as routers or buses) can also be integrated with

17

Figure 1.11: Transaction Level Modelling Use Cases, Coding Styles and Mechanisms [28].

the previous ones. Figure 1.12 shows an example of one Initiator, one Interconnect com-

ponent and a Target.

AT transactions can be split into 4 phases as shown in Figure 1.13; through functions

named non-blocking forward transport (nb forward) and non-blocking backward transport

(nb backward), communication takes place; both functions have three parameters:

1. Trans: Pointer to the generic payload object.

2. Phase: Current transaction phase; it can be either of those shown in Figure 1.13.

3. Delay: Time that a module has to wait before responding to a transaction.

Initiators call nb transport forward, with BEGIN REQ as phase argument, to start

transmitting data; they use phase END RESP to conclude a transaction. Targets call

18

Table 1.2: Generic Payload Attributes according to [29].

.

Generic Payload Attribute Meaning

Command Can be either Write or Read.

Address Target address to execute transaction.

Data Pointer Pointer to the data array. Data should be read or written

to this variable.

Data Length Length of the data to be transferred computed as

BUSWIDTH/4;

Byte Enable Pointer Used to enable access to specific data bytes.

Byte Enable Length To specify the number of valid elements of the byte enable

pointer.

Streaming Width States the number of words per burst transfer.

DMI Allowed Marks whether the Direct Memory Interface can be used

or not.

Response Status Used for storing the status of the transaction.

nb transport backward with phase END REQ to acknowledge the reception of a transac-

tion and use phase BEGIN RESP to indicate the correct execution of the it regardless

whether is read or write.

At some points it might be unnecessary to use all four phases to model a platform’s

behaviour, i.e. when a write transaction is performed: an initiator(cpu) sends data to a

target (memory) which can execute the order immediately; in this case, the target can reply

to the initiator with a phase update, changing it from BEGIN REQ to BEGIN RESP and

adding some delay; the way each agent is aware of such status updates is by checking the

return value of a nb transport call. Return values can be either of TLM ACCEPTED (no

change in phase), TLM UPDATED (phase updated) or TLM COMPLETED (transaction

19

executed).

Specific rules concerning each module’s permission to modify the generic payload at-

tributes, possible return values from each nb transport call, and detailed explanation of the

whole standard, can be found on [29] for more information.

Figure 1.12:TLM Transaction Flow [28]. The generic payload object is created bythe Initiator but is only referenced by interconnection modules ortargets. Socket arrow’s indicate how the information flows.

As previously mentioned, some extensions can be added to the generic payload object

for routing purposes and can be either global or instance specific, that is, each module of

can add attributes to the transaction object and being the only one able to access them;

this work adds two extensions to the generic payload object: a global one for end-to-end

verification purposes, and an instance specific one for router operation. Next chapter will

show more details about this.

20

Figure 1.13:TLM Base Protocol Phases [28]; the initiator is the vertical line on theleft and the target the one on the right.

1.4 Open Core Protocol

The Open Core Protocol International Partnership (OCP-IP) is a community in charge of

“proliferating a common standard for intellectual property core interfaces, or sockets that

facilitate “plug and play” System-on-Chip design”[10]. Their specifications for intercon-

necting modules is a bus model as complete as ARM’s AMBA-AXI and can be perfectly

described with OSCI’s TLM 2.0.1 standard.

Because of the amount of details the OCP has, a light version of it will be used for this

work; all basic signals shown in Table 1.3 are used but additional burst support is included.

Standard OCP burst extension require 8 additional signals where all but MBurstLenght

can be skipped; to see how can this be done, consider Table 1.4: MAtomicLength is used

when the length of data is bigger than the word size and this is not the case; MBurstPrecise

indicates that the length of the burst is known at the start of the transmission as always

is for our design; MBurstSeq specifies how are the addresses of the burst emitted which in

this work are assumed to be incrementing; MBurstSingleReq implies that only one request

is done per burst transfer; MDataLast, MReqLast and MRespLast are unnecessary as each

module keeps track of the number of data transferred.

21

Table 1.3:Basic OCP Signals extracted from [10]. Signal MDataValid is skippedin our implementation. Width measured in bits.

.

Name Width Driver Function

Clk 1 varies OCP Clock

MAddr configurable master Transfer address

MCmd 3 master Transfer command

MData configurable master Write data

MDataValid 1 master Write data valid

MRespAccept 1 master Master accepts response

SCmdAccept 1 slave Slave accepts transfer

SData configurable slave Read data

SDataAccept 1 slave Slave accepts write data

SResp 2 slave Transfer response

Table 1.4: Burst OCP Signals [10]. Only MBurstLenght is enough for this work’s NIC.

.

Name Width Driver Function

MAtomicLength configurable master Length of atomic burst.

MBurstLength configurable master Burst Length.

MBurstPrecise 1 master Burst length precise.

MBurstSeq 3 master Address sequence.

MBurstSingleReq 1 master Single request/multiple

data protocol

MDataLast 1 master Last data in burst.

MReqLast 1 master Last request in burst.

SRespLast configurable slave Last response in burst.

22

To better understand how transfer with the OCP protocol work, consider Figure 1.14;

only signal MRespAccept is missing on the diagram yet the behaviour is practically the

same. Figure 1.15 shows an scenario for burst transfers, handshaking is carried on the same

way.

Figure 1.14:OCP Read Transaction [10]; signal behaviour when performing a readrequest: When the master issues the command it has to wait for SCm-dAccept to assert before changing the MCmd line. After some timethe slave indicates valid data on the SData bus by issuing a Data Validcommand on the SResp line.

23

Figure 1.15:OCP Burst Write Transaction [10]; signals MBurstSeq and MBurstPre-cise never change. Handshaking between master and slave is basicallythe same as the previous non-burst example.

24

Chapter 2

NoC Implementation

Once the implementation details and design flow have been clarified as in the previous

chapter, its now possible to describe the router and the NIC at any level of abstraction.

Although most items regarding each structure are well defined, some aspects still lack

specification and will be analysed hereafter. Code of each description can be found on the

Appendix section.

2.1 Flit and Message structure

In order to determine the structure of both the router and the NIC, it’s necessary to define

the units they are going to deal with: Flits and Messages. Messages are composed of

one or more flits, which are the units injected into the router’s network; because wormhole

routing is to be used, one of the flits must include information about the origin and des-

tination of the whole message; NICs, however, require additional data fields to properly

implement end-to-end flow control. As a start, a review of explanations provided on Sec-

tion 1.2.4 and the constrains mentioned on Section 2.3 are necessary to define all constrains.

If more TCP-like control parameters are needed for high level control, those parameters

must be set by processing elements and are to be transmitted to the NIC as common data;

NICs only support the minimum amount of control fields to ensure correct functionality.

Head Flit structure is displayed in Figure 2.1 and message structure in 2.2. Flit fields are

explained in Table 2.1.

25

Figure 2.1: Head Flit Structure

Figure 2.2: Message Structure. Payload can be up to 64 bytes long.

Table 2.1: Flit fields explanation

.

Field Use

Type Flits can be either: Head, Body, Tail or Single; single

flits are used to ask for data and for barrier operations.

Source X Flit’s origin X coordinate.

Source Y Flit’s origin Y coordinate.

Destination X Flit’s destination X coordinate.

Destination Y Flit’s destination Y coordinate.

Length Message length. Maximum 64 words.

Single Indicates whether flit is a single-flit transaction or not.

Message Number Message number stated by source module.

Broadcast States whether the message is broadcast or not.

BarrierID. Stores a BarrierID according to the source.

ReadWrite If message is single-flit, this bit is set when is a barrier

write.

26

Through SystemC descriptions and simulations it was possible to establish the correct

behaviour of the platform. Now that the units needed by both router and NIC are defined,

their designs can be presented.

2.2 Router Architecture

Studies presented in Chapter 1 yielded the following conclusions regarding router imple-

mentation:

? Topology: Torus. Displayed in Figure 2.3. Taken from [30].

? Switching Policy: Wormhole Packet Switched.

? Flow Control Technique: Handshaking Signals.

? Routing Algorithms: Deterministic XY and Adaptive Turn Model.

Only two aspects about the router’s structure are still undefined: Arbitration tech-

niques and number of Virtual Channels. When two ore more inputs attempt to use a

router’s output it is necessary to establish a mechanism to assign output control. Table 2.2

lists usual solutions to this problem; most implementations listed in [2] use Round-Robin

or First Come - First Served techniques for Best Effort routers and priority approaches for

Guaranteed Traffic (GT) ones such as [19] and [14].

Specialized routers are required when GT services are to be provided; just a few NoCs

like the ones cited have implemented GT services. Best effort Round Robin arbitration will

be used on this work.

On the other hand, Virtual Channels (VCs) are buffer additions to the router’s inputs

(outputs) used for alleviating congestion on the network; despite using the same physical

paths, addition of buffers decrease the probability of deadlock and improve performance

as delayed messages can hold on routers and still advance to their destinations. Area is

the main cost of adding Virtual Channels and is also one of the most critical issues in

27

Figure 2.3: Torus Topology NoC

Table 2.2: Router Arbitration Techniques [2].

.

Arbitration Technique Policy

Round Robin Output is assigned equally starting from the

first element.

First Come - First Served Output control is assigned in request order.

Priority Based All packets are assigned a priority and get

output control according to their importance.

Priority Based Round Robin Round Robin is implemented but a priority

proportional to the frequency of usage is as-

signed.

28

embedded system design; because of that, an optimal placement and integration of buffers

is required. Figure 2.4 shows a router with input VCs, which are in principle, connected to

all possible outputs. In [31] studies show that for unicast routing, having a VC per output

at each input can reduce area consumption significantly; with this result, the router and

VC integration can be seen in Figure 2.5.

Figure 2.4:Block diagram of a Router with Virtual Channels. Area constrainshave to be considered to choose an appropriate number of buffers.

Now that all specifications related to the router’s behaviour are defined a high level

block diagram of it can be constructed; no major implementation details are shown for it is

an abstraction of the real hardware and all functional blocks are software described. Figure

2.6 shows the general block diagram that will be used to describe the TLM model of the

router.

29

Figure 2.5:Virtual Channel connections to Router. A single VC per output isavailable at each input so to decrease area consumption. Extractedfrom [31].

Figure 2.6:General Router Block Diagram. Four virtual channels at each inputare placed to reach all possible outputs; no packets are routed backthrough the same input.

30

2.2.1 Router TLM Model

SystemC’s Transaction Level Modelling is a standard for decoupling communication from

computation in high level designs; most mechanisms offered by the standard are easily

abstracted to bus models because it’s the traditional way to interconnect Systems On

Chip (SoCs). As routers use different flow control techniques compared to traditional bus

systems, different interpretation of the TLM 2.0.1 phases is required in order for the model

to keep faithful to the hardware. Table 2.3 explains phase’s meaning for inter-router, packet

based communication.

Table 2.3: TLM 2.0.1 Phases Interpretation for Routers.

.

Phase Flow Direction Meaning

BEGIN REQ Init. Router To Target Router Flit is being transmitted.

END REQ Target Router To Init. Router Flit is stored, can be erased

on initiator.

BEGIN RESP Target Router To Init. Router A new space is free. Can

send more flits.

END RESP Init. Router To Target Router Final reply.

Another addition to the TLM 2.0.1 base protocol, described in the previous chapter,

are routing extensions; as mentioned before, extensions can be locally or globally accessed.

The proposed model uses both for debugging and verification purposes; a local extension

is created on every transaction when they traverse a router and each one adds its own

extension to the transaction. It’s got the following fields:

(a) Port: Stores the number of the incoming port through which the transaction entered.

(b) Port VC: Stores the number of the outgoing port through which the transaction will

go out.

31

(c) TimesBlocked: Counter that is increased in 1 unit when a router attempts to be

transmitted. This allows recognizing deadlock situations.

The global extension is created by the initiator, can be accessed by all modules and

adds the following information to the transaction object:

(a) MainInitiator: Stores the ID of the module that first issued the transaction.

(b) FinalTarget: Stores the ID of the module where the transaction is to be delivered.

(c) TransID: Records the transaction number for debugging purposes.

(d) FlitType: Stores the type of flit of the current transaction.

(e) TransCounter: Incremented every time a transaction passes through a router.

(f) TransPath: Array for storing the path the flit goes. Used for debugging.

At this point it is necessary to clarify that there are four type of flits: Head ones which

contain routing information, body ones which are the data itself, tail ones that mark the

end of a packet (may or may not contain data) and full ones that are single-flit messages

used for (a) sending read requests from one core to the other (end-to-end flow control) and

(b) single-flit writes used for barrier operations.

SystemC implementation of the router is composed of five functions that act on each

port:

I Non-blocking Transport Forward: Is a standard mandatory function that receives

three parameters: a transaction pointer, a TLM phase argument and a time value

called delay. When a module wants to send a flit, it calls this function with those

parameters and a BEGIN REQ as phase argument; the delay time is the time at

which the target has to react after getting this call. The function checks the type of

flit, space availability, computes the output, returns TLM ACCEPTED and tells the

32

simulator to execute Forward Payload Event Queue at the time indicated by the delay.

Detailed behaviour of this method is shown in Algorithm 2.1.

II Non-blocking Transport Backward: Is also a mandatory function that receives

the same three parameters but correct phase arguments are either END REQ or BE-

GIN RESP. If receiving END REQ, a method called Backward Event Queue is notified

for execution after the delay time; if BEGIN RESP is received method Transaction

Update is notified.

III Forward Payload Event Queue: Function invoked by nb forward transport ; it takes

the transaction object and stores it on the corresponding Virtual Channel and noti-

fies method Transaction Update to be executed after an internal delay time. It also

returns phase END REQ back to the initiator to acknowledge the correct storage of

the transaction.

IV Backward Payload Event Queue: Function invoked by nb backward transport, in

charge of double checking that the transaction is correct. Notifies the Transaction

Update method for immediate execution.

V Transaction Update: Considered the brain of the router; it starts transaction previ-

ously stored on the VCs, deletes transactions already sent, notifies modules the avail-

ability of new spaces if there are some and implements output arbitration. Algorithm

2.2.1 describes the thoroughly method .

2.2.2 Traffic Evaluation and Routing Algorithm Testing

MPSoC platforms are generic systems that can implement any algorithm whose inter-

module traffic can be known once task partitioning is done; because it is uncertain which

application will be executed on such platforms, it is necessary to test synthetic traffic

patterns on the chip to establish its performance under random circumstances. There

33

Algorithm 2.1 Non-blocking Transport Forward.

Require: Transaction object, phase, delay

1: if phase = BEGIN REQ then

2: if (FlitType = Head) or (FlitType = Full) then

3: OutPort = Value returned by Routing Algorithm.

4: if (VC Empty) then

5: Reserve Virtual Channel

6: Set response status to TLM OK RESPONSE

7: if (OutPort Free) then

8: Take control of OutPort

9: end if

10: else

11: Set response status to TLM GENERIC ERROR RESPONSE

12: Return TLM ACCEPTED

13: end if

14: Notify Forward Payload Event Queue to execute after delay tim

15: Decrease VC Space


17: else if (VC has space) then

18: Set response status to TLM OK RESPONSE

19: Decrease VC Space

20: Notify Forward Payload Event Queue to execute after delay time


22: else

23: Set response status to TLM GENERIC ERROR RESPONSE


25: end if

26: else if phase = END RESP then


28: else

29: Abort Execution

30: end if

34

Algorithm 2.2 Transaction Update Method Implemented by Routers

Require: Virtual Circuit to Update

Require: InputPort, OutputPort

1: if A transaction on VC can be started then

2: Call non-blocking forward method on the next module with phase BEGIN REQ.

3: end if

4: for i = 0 to V CSize do

5: if A Transaction can be freed then

6: Delete transaction.

7: Increase VC space.

8: Call non-blocking backward method on the previous module with phase BE-

GIN RESP to indicate that a new space is available.

9: if Transaction is type “Tail” then

10: Free Virtual Circuit.

11: Stop controlling Output Port.

12: end if

13: end if

14: end for

15: if Output Port is not busy then

16: for i = 0 to Number of Router Inputs do

17: NewInput = InputPort + 1

18: if NewInput is ready to use OutputPort then

19: Give NewInput control of OutputPort.

20: Execute again from the start.

21: end if

22: end for

23: end if

35

are a few typical tests conducted on NoC designs that help realizing routing algorithms

performance:

Uniform Traffic

Nodes communicate with each other with the same probability.

Matrix Transpose Traffic

Each node sends messages only to a destination with the upper and lower halves of

its own address transposed.

Hotspot Traffic

Each node sends messages to other nodes with an equal probability except for a

specific node (called Hotspot) which receives messages with a greater probability.

The percentage of additional messages that a Hotspot node receives compared to the

other nodes is indicated after the Hotspot name e.g Hotspot 15%.

Complement Traffic

Each node sends messages only to a node corresponding to the one’s complement of

its own address.

Several scenarios were tested under some of this traffic conditions and mainly three

routing algorithms were implemented: West-First (adaptive), North-Last (adaptive) and

XY (deterministic). Additionally, an aspect that hasn’t been studied yet, VC depth, was

also considered and the results can be seen on the following figures.

Figures 2.7 and 2.8 show link utilisation under Hotspot 10 % (on node 7) traffic condi-

tions; two groups of figures are provided as all links transmit information in both directions

(up-down or right-left); for a better discrimination of link congestion, plots were done sep-

arately. XY Routing in Figure 2.7(a) has 4 high traffic links (higher bars); West First

in Figure 2.7(b) presents only two congested links and North Last in Figure 2.7(c) just

one. On the other direction, Figure 2.8(a) shows XY with 2 congested links, Figure 2.8(b)

36

presents West First behaviour with 1 high traffic link and Figure 2.8(c) has 2 congested

links with North Last routing.

From the previous figures, apparently West First and North Last routing better spread

traffic along the network as they only have 3 high traffic links in both corresponding graphs.

No turnaround links show significant utilization despite the adaptiveness of those algo-

rithms.

In order to check the overall behaviour of all routing algorithms under this traffic pat-

tern, plots for average flit latency and total simulation time are shown in Figure 2.9. From

2.9(a) it can be seen that the more Virtual Channels are, the more the flit latency on the

network; that is because NICs can inject more packets into the network at any given time;

adaptive algorithms show lesser values than XY’s, indicating that information is forwarded

faster with them. Figure 2.9(b) gives more information about the routing performance; for

large messages XY and North Last perform better than West First regardless the Virtual

Channel depth. For shorter transmissions West First decreases simulation time. In gen-

eral, results presented for this traffic pattern are very close to each other and might need

a deeper analysis to make a routing decision.

A second pattern was studied under the same conditions as before; Matrix Transpose

traffic was implemented and results are shown in figures 2.10, 2.11 and 2.12. From graphs

2.10(a) and 2.11(a), 8 congested links can be distinguished when using XY routing; West

First behaviour, shown in 2.10(b) and 2.11(b), only present 2 high traffic links, as well

as North Last routing in 2.10(c) and 2.11(c). Although, as before, adaptive algorithms

attempt to better distribute traffic throughout all available paths, results from 2.12 demon-

strate that long messages with low Virtual Channel depth get faster to their destination

with XY-Routing, which is also, the one with lowest flit latency; medium size messages are

more suited to West First-Routing under this pattern.

37

(a) XY Routing

(b) West First Routing

(c) North Last Routing

Figure 2.7: Link Utilisation for Hotspot 10 %. Traffic going Right - Down

38

(a) XY Routing



Figure 2.8: Link Utilisation for Hotspot 10 %. Traffic going Left Up

39

(a) Average Flit Latency

(b) Total Simulation Time

Figure 2.9:Timing statistics for Hotspot 10 %. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.

40

(a) XY Routing



Figure 2.10: Link Utilisation for Matrix Transpose Traffic going Right - Down

41

(a) XY Routing



Figure 2.11: Link Utilisation for Matrix Transpose Traffic going Left Up

42

(a) Average Flit Latency

(b) Total Simulation Time

Figure 2.12:Timing statistics for Matrix Transpose. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.

43

Despite the fact that more traffic patterns could have been evaluated for the scope of

this study it is enough to show the capabilities of the constructed model. Each of the

timing graphs (3D plots) considered 8 message sizes and 8 Virtual Channel depths, that is

64 simulations in total. Each run used 100 messages on each of the 16 nodes for a total of

1600 messages that varied from 6400 up to 51200 flits being transported along the NoC.

2.2.3 Router VHDL Model

Once the router’s behaviour was validated with the high level model presented, a detailed

HDL design was implemented; three main blocks compose this design: Input Port, Output

Port and Multiplexers. Input ports are in charge of data reception control, routing and

Virtual Channel storage; Output ports control data transmission and round robin channel

arbitration; multiplexers interconnect all input buffers with the router’s outputs.

Because of the flit size, that is 34 bits, there are 34 lines for data transmission and 34

for data reception; also, two lines are used for handshaking transmission control, Tx and

Tx Ack and two lines for reception control, Rx and Rx Ack. In summary, each port has 36

inputs and 36 outputs. Figure 2.13 shows the router’s black box.

Input Port module is composed by 4 FIFOs, a routing unit, one multiplexer and one

de-multiplexer; deterministic XY routing was chosen for state machine implementation.

Flow control was implemented according to the studies presented in section ??, where

handshaking signals were selected. The control module receives incoming requests from

external modules through the Rx input; it sends a request signal to the routing unit who

replies back when the output has been computed; after getting a response from the routing

module, the flit is stored at the corresponding queue, if space is available. When no space

is available, Rx Ack line remains de-asserted.

The routing module was designed with a state machine that uses external comparators

to determine whether the coordinates of the destination are larger or smaller than the

router’s. Depending on the results of all compares, an output is computed and stored into

44

Figure 2.13:VHDL Router Black Box. Five ports are needed for a torus or (mostrouters on) mesh configurations.

an internal register; block diagram of this module is presented in Figure 2.14 and the one

for the whole InputPort is shown in 2.15.

Output Port module is composed of a control unit in charge of arbitrating outputs and

negotiate data transmission with another router or NIC; also, two de-multiplexers a Flit

decoder and a multiplexer are included into this big object. Flit decoder is in charge of

notifying the control when a tail or single flit has been transmitted so that it assigns the

output to another input. A block diagram of this box is shown in 2.16.

Finally, the full block diagram of the router is shown in Figure 2.17. VHDL code is

attached at the end on the Appendix section. Due to the number of input/outputs of

this module, implementation can only be possible on an ASIC, however, FPGA synthesis

allowed us to know some information about area consumption. Studies shown in [32] lists

statistics about the number of slices consumed on a Virtex-II FPGA; a 5 input/output

router consumed 397 slices. Also, [8] obtained a 1762 CLB consumption on a Virtex-II-

8000 FPGA.

45

Figure 2.14: VHDL Block Diagram for the XY Routing Module.

Figure 2.15:VHDL Block Diagram for Input Port Module. Four FIFOs are neededto route information to each output port.

46

Figure 2.16: VHDL Block Diagram for Output Port Module.

Our router model was synthesised on Virtex5. Virtual Channels were generated as

FIFO memories with Xilinx’s IP Core Generator with 16-flit depth. Resource utilisation

is shown in Table 2.4. Because low-level detailed designed was not the objective of this

work, HDL simulations are skipped on this document yet VHDL code is attached at the

Appendices section.

It is important to note that pin-out of all previous modules was not enough for syn-

thesizing a single router; however, if more of this modules are embedded, a small network

of them can be constructed and the chips could be plugged to external processors with

FPGAs outputs.

2.3 Network Interface Card Architecture

Network Interface design is intended to support and validate message passing transactions

which are composed of two tasks for communication, send and receive, and one for syn-

47

Figure 2.17:VHDL Block Diagram for the Router. Multiplexers shown on diagramare the same as the ones shown in Figure 2.16 for data selection.

48

Table 2.4: Router Area Consumption on Virtex 5 (XC5VFX30T-1FF665)

.

Device Utilisation

Logic Utilisation Used Available Utilization

Number of Slice Registers 730 20480 3 %

Number of Slice LUTs 846 20480 4 %

Number of fully used LUT-FF pairs 230 1346 17 %

Number of bonded IOBs 372 360 103 %

Number of Block RAM/FIFOs 10 68 14 %

Number BUFG/BUFGCTRLs 1 32 3 %

chronization called barrier. Those functions were taken from the MPI standard and suffice

the functionality required.

2.3.1 Network Interface TLM Model

SystemC TLM Model of the NIC has one target socket for receiving the core’s transaction,

one initiator socket for sending data to the local router and another target socket to get

data from it; for each target socket there is a corresponding nb forward transport function

and for the initiator socket, a nb backward transport method is provided.

On sockets connected to routers, TLM phases are interpreted the same way as stated

on Table 2.3, however, on sockets connected to processing elements (end-modules), phases

are considered as specified by the standard.in Section 1.3.

In order for the system to react at the appropriate time (because of transaction delays),

there are three payload event queues linked to each nb transport function. Other methods

are in charge of standard operations such as storing data on send or receive buffers, ar-

bitrate output control, reply to processing elements, etc. Next a list of all NIC method’s

functionality is presented.

I. BuildHeadFlit: Method in charge of creating a transaction’s header flit. It stores

49

message number, type of flit and initiator and target addresses on a single word.

II. GetHeaderInfo: Is in charge of extracting all header information from a head flit.

III. CheckIfExpectedTransaction: Method invoked when a new request arrives; it is

in charge of establishing whether it corresponds to a write transaction started by the

local processor or not. If the transaction doesn’t match the expected one it’s stored

at a incoming requests buffer.

IV. StoreAtSendBuffer: Function used for storing flits at a send buffer (if a write

transaction) or at the send request buffer (if a read transaction).

V. StoreAtReceiveBuffer: In charge of storing flits at a receive buffer (if a read trans-

action) or at the receive request buffer (if a write transaction) when the request doesn’t

match with the processor request.

VI. RESPONSE TransactionUpdate: Dynamic-event triggered method used for send-

ing phase BEGIN RESP back to the router when a tail flit has been received; it also

sends BEGIN RESP to the local processor to indicate it that data is ready to be

transmitted.

VII. REQUEST TransactionUpdate: Method invoked to send a new write request

when the timer has expired.

VIII. RRESPONSE TransactionUpdate: Is in charge of returning phase BEGIN RESP

back to the router when a read request has been received.

IX. SEND TransactionUpdate: Acts as a central control unit for the NIC module;

this method checks whether the IncomingRequestQueue has a valid transaction that

matches the one specified by the processor, if so, grants output access to the send

queue. After that, sends the first flit on the send buffer and updates debugging

information; after that, frees already transmitted flits from the queue and checks if

50

it was a tail flit, if so, releases the output port. More tasks are performed by this

function and can be better described by pseudo-algorithm 2.3.1.

Apart from reading and writing, all cores are capable of executing barrier operations for

synchronization. Depending on the core’s ID, a barrier is implemented differently: there

must always be a master core and one or more slaves cores; master cores await for slaves

to send a barrier message and once got everyone’s, they issue a command for all of them

to resume executing their tasks. Because it is necessary to address all nodes when issuing

barrier transactions from the master core, and because routers are unable to realize of that,

NICs were designed to support a broadcast command that sends the same data to every

node. This functionality is also useful when processors need to share information stored at

one of them, however, it won’t be until a node gets requests flits from all the others, that

it will start transferring data; this approach might prove useful in some scenarios but can

also decrease overall performance on others.

In order to improve performance and reduce processor computation the NIC implements

barrier operations as follows:

Slave cores : Send a normal write request transaction to the master core and expect a

one-flit write.

Master core : Builds a single-flit write transaction and stores it at the send buffer; when

requests from all modules are received, it sends that flit to all the modules. When all

flits have been transmitted the NIC replies back to the core.

Mechanical computation implied by the barrier function is done at the NIC so that the

core can perform other operations; the cost of that is an increase in area consumption.

TLM phases for read operations can be seen in Figure 2.18. This transactions take

a long time to complete because once the NIC is notified of a read transaction, it sends

a request-data flit to the appropriate module and has to wait for information to come;

51

Algorithm 2.3 Transaction Update pseudo-algorithm implemented by NICs.

1: if Write Pending and Not Read In-progress then

2: Check Request queue.

3: if Transaction Requested is expected then

4: Give Output control to Send Queue

5: end if

6: end if

7: if Send Queue controls Output then

8: Send first flit on Queue

9: if Flit accepted then

10: Mark Flit as accepted.

11: Notify method for later execution to delete Flit.

12: end if

13: if Write is Unicast then

14: Delete transmitted flits in Send Queue.

15: end if

16: if Write is broadcast then

17: if Write is Burst and Burst Completed then

18: Create new Head Flit.

19: Notify method for later to start transmission to next node.

20: Reset Transmission counters

21: end if

22: end if

23: if Write is Burst and Not all data packed then

24: Store next flit at Send Queue

25: end if

26: end if

27: if All data is transmitted then

28: Send phase BEGIN RESP back to Initiator to release transactions.

29: end if52

there is only after getting all packets that the processing element is notified about the data

availability, and the transaction concluded.On the other hand, write transactions between

the NIC and the processing element can be finalised faster. A phase diagram for write

transactions can be seen in Figure 2.19.

Figure 2.18:TLM Phases in a NIC Read Operation. CPUs ask for data, NICs senda request to the corresponding module and waits for data to arrive.After all information is received, phase BEGIN RESP is issued to theCPU to indicate the end of the transaction.

53

Figure 2.19:TLM Phases in a NIC Write Operation. Processing elements send alldata to the NICs and finalize the transaction after transmitting all theinformation. NICs await a read request and send packets when thecorresponding one is received.

54

2.3.2 Network Interface Hardware Design

Network Interface design was extrapolated from the SystemC high level description; a high

HDL complexity was found on this module as it has to implement part of the router’s

functionality, solve end to end flow control and communicate with the processing element

through the OCP-IP bus model. Several control units were necessary for this design to

support all the features implemented in SystemC listed in the previous section; because of

space constrains, a general block diagram of the overall module is shown in Figure 2.20;

control signal paths are shown in red and data path ones, in yellow.

To better understand the figure, a correspondence between the TLM 2.0 model and the

VHDL one is presented in Table 2.5; although the equivalence is not exact, it tries to match

the main aspects. Functions shown in the table are also described in the previous section.

One of the most complex modules of the NIC was the OCP-Handshaking Control and

required careful design in order for it to support transactions and respect their execution

order; from the diagram in 2.20 it can be seen that another control unit (End-to-End Flow

Control) was necessary. State diagrams of both are shown in Figures 2.21 and 2.22.

When a new transaction is started from the processing element, handshaking control

verifies whether is possible to initiate it internally; if that is possible, an appropriate header

flit is stored at the corresponding queue and information is packed (for write transactions).

End to end flow control is notified of the operation in progress and commands transmission

and reception units to do the necessary operations to carry on with the transaction: if a

read is to be performed, a request flit is sent to the corresponding module; if a write is

requested, reception control must report itself when a request matching the write address

is received.

VHDL implementation of the NIC is left for future work as it doesn’t constitute a

common test metric on the NoC field.

55

Table 2.5: VHDL-SystemC equivalence of NIC blocks

.

VHDL Block TLM Methods/Objects Function

OCP-Handshaking Control nb transport fw, RESPONSE

Transaction Update

Transfer data from (to) process-

ing elements

End to End Flow Control Target Payload Event Queue,

Check If Expected Transaction

Execute transactions tidily.

Tx Control nb fw router, SEND Transac-

tion Update

Initiate transactions with

router

Rx Control nb transport bw router, Store

At Receive Buffer

Receive transactions from

router

FIFO DataIN and Requests Double-ended queue Store data read and incoming

requests

Bank DataOUT Double-ended queue Store data out and read re-

quests.

Rest of Blocks Build Head Flit, Get Header

Info

Set and retrieve head flit infor-

mation

56

Figure 2.20: VHDL Block Diagram for Network Interface Card

57

Figure 2.21: State Machine for the Handshaking Control

58

Figure 2.22: State Machine for the Handshaking Control

2.4 Software Performance Results

After validating both the router and NIC TLM models, software applications were pro-

grammed to analyse performance results with the whole NoC. Matrix multiplication was

implemented for its straightforward parallelization; previous performance graphs could also

be obtained but are not shown for space constrains.

2.4.1 4× 4 Matrix Multiplication

The first test scenario was a 4 × 4 matrix multiplication split into 16 cores (1 master, 15

slaves) where each one performed a row-column product and returned its result back to the

master. MPI directives such as MPI Send, MPI Receive and MPI Broadcast were used for

data sharing between modules.

Figure 2.23 shows the full Network On Chip performance regarding operation time

for three routing algorithms and several Virtual Channel depths. In concordance with

59

previous simulations, XY routing had the worst behaviour while West-First had the best

one; increase in buffer storage allowed an exponential-like decrease in timing measures.

Figure 2.23: NoC Performance for a 4× 4 Matrix Multiplication

2.4.2 8× 8 Matrix Multiplication

In order to increase data transfer, a second test was performed with an 8 × 8 matrix

multiplication. Each core (including the master one) computed 4 row-column products and

sent its result back to the master; platform behaviour is shown in Figure 2.24. Compared

to the first scenario, decrease is not exponential but linear with both adaptive routing

algorithms and does not improve with buffer increase for XY routing.

Because of the way MPI Broadcast was implemented, that is, if the master core is N ,

data is sent first to node N + 1 and so on, it is expected that cores finish their operation

with the same order; this fact was actually verified and is presented in Figures 2.25 and

2.26.

Results obtained from the SystemC Network On Chip model demonstrate the advan-

tages of having a high level abstraction of hardware platform and the richness of statistics

60

Figure 2.24: NoC Performance for a 8× 8 Matrix Multiplication

Figure 2.25: Total simulation time at each node with North Last Routing

61

Figure 2.26: Total simulation time at each node with West First Routing

extracted from it. More studies and better tuning of the NoC, can significantly lead to

concise and robust design decisions when continuing with the design flow.

62

Chapter 3

Concluding Remarks

A co-design methodology was successfully validated with the adoption of the new IEEE

standard for high level modelling, SystemC; it was possible to specify both hardware and

software constrains from the start and therefore a clearer and more concrete approach was

possible. As in any design, unspecified conditions or unexpected behaviour were encoun-

tered throughout the design process, yet correction and validation was much faster with

the virtual platform constructed.

Contrary to traditional hardware design, a virtual, portable platform of the real hard-

ware, is available for quick software development and testing, and only requires a C++

compiler to run; no specialized or licensed software is needed to start writing applications

for this system, which makes it versatile. In case of requiring more statistics or perfor-

mance metrics about hardware or software behaviour, it suffices with adding few lines to

the libraries provided; code that don’t make calls to the simulator won’t change or modify

the platform statistics.

Ideas borrowed from software theories for distributed programming were extrapolated

to create a hardware platform capable of running MPI-like applications; despite being an

old standard, most its premises are used nowadays for non-shared memory architectures

and are still under development. Thanks to this approach, an end to end flow control

technique was proposed to reduce undesired traffic inside the network as well as to avoid

wasting processing time in end-modules therefore improving performance.

Traffic simulations carried on showed that with the proposed NoC structure, 4x4 torus

topologies have close performance metrics with adaptive algorithms as well as with deter-

63

ministic ones; two turn model [23] algorithms for mesh topologies were implemented and

adapted to a Torus one, but no significant difference between them was encountered under

the traffic patterns implemented. Real software applications are needed to select a suitable

routing algorithm that matches the required performance.

Most design decisions presented in Chapter 1 were taken to cope with the desired func-

tionality but also to reduce hardware complexity (area constrains); because of that, many

control task were left for high level implementation either at the NIC or at the processor

level. Contrary to studies shown in [31] and others which allow multicast capable routers,

the model developed here can only process unicast transactions at low level (routers) or

broadcast transactions at a higher level (NIC). If multicast support is required, additional

logic can be integrated inside NICs; router modifications are far more complex but can be

also added.

Implementation statistics of the VHDL router model suggest that it is possible to syn-

thesize more than one router on FPGAs but still consume big area, specially RAM blocks;

reducing Virtual Channel depth will allow including more routers on the same chip so that

a real, considerable sized NoC can be implemented.

3.1 Significance of the Result

The intend of this work was mainly to develop and validate a high level model of a NoC to

allow implementation of parallel algorithms targeted to this platform; despite pure C++

code can be used for software development, there are other market, open-source, tools

that can also be integrated with the libraries created here. Among the most common,

Open Virtual Platforms (OVP) from Imperas, is a collection of high level descriptions or

processors and hardware modules that can be easily integrated with TLM models such as

the one created. More detailed, almost cycle accurate, simulations are possible with OVP

if needed.

Even though TLM modelling is becoming a common design practice, to the best of our

64

knowledge, no TLM 2.0 of a entire NoC has been proposed; there are approaches such

as Noxim [33], that use SystemC to simulate a customizable mesh network of routers,

they don’t implement the TLM 2.0 standard. Another advantage of this platform is that

it provides both, NICs and routers, which makes it an immediate option for embedded

software development.

Versatility and high reconfiguration capabilities are also an important characteristic of

our NoC model. Default timing parameters can be changed with minor effort and can be

set to match real hardware constrains if needed. Apart from that, if the MPI approach

doesn’t suit a design’s specifications, a new NIC model can be integrated with the router’s

network if it follows indications shown in 2.2.1. A final remarkable aspect of this work is

that if more complex statistics such as average message blocking time, most used paths,

throughput, etc, can be extracted from the model with minor additions to the source code.

The router developed was designed for Tours topology NoCs and therefore has 5 in-

put/output ports, nonetheless, mesh topologies are straightforward to obtain by modifying

routing algorithms; setting them to avoid sending data beyond the limits of the mesh, i.e.

impeding the use of turnaround links, will automatically change it from torus to mesh. If

another topology like those presented in section 1.2.1, you can always take our base router

as a start.

Finally, another contribution for the state of the art regarding NoC and specifically NIC

design, is that through an MPI abstraction, a hardware module was created for end-to-end

flow control; approaches using MPI as software approach for NoC programming such as [34]

don’t synthesize it onto the NIC module but implement it at high level. Which approach is

better is still undetermined and benchmarks need to be conducted to have a better insight

on this.

65

3.2 Future Work

From the beginning of this work, it was stated that when creating complex platforms such

as NoCs, it was necessary to be able of co-developing software applications to validate

their correct behaviour; here only a high level TLM 2.0, and a VHDL model of a NoC

was constructed, but there’s a lack of software applications integrated with the SystemC

description. Although typical traffic and load balancing studies were applied, only real

software implementation can demonstrate the validity of the results shown.

Software engineering faced the problem of distributed computing years ago, and is be-

ing dealing with it for a long time; thanks to them, concepts such as shared or distributed

memory models are now known and have been solved through APIs like MPI or OpenMP;

hardware engineers are starting to raise abstraction levels for embedded design and are

facing the same problems than software ones; what this means, is that a deeper integration

between both branches can lead to better design strategies, as the ones required for this

kind of platforms. This work’s objective was to implement MPI support, however, MPI is

a standard created more than 10 years ago, which lead to think that far better solutions

are currently available but remain unknown for the hardware community.

A router VHDL model was provided along this work and a NIC design was mostly de-

scribed as well; future work should also include an implementation of this work to verify

its correctness at the hardware level; integration with 32-bit compatible processors would

complete such verification.

66

References

[1] Salminen et al, “Survey of Network on Chip proposals,” OCP-IP White Paper. OCP-

IP, 2008

[2] A. Agarwal, C. Iskander and R. Shankar, “Survey of Network on Chip Architectures

and Contributions,” Journal of Engineering, Computing and Architectures. Volume 3,

Issue 1, 2009.

[3] K. Popovici and A. Jerraya, “Virtual Platforms in Sis-

tem On Chip Design,” 47th Design Automation Conference.

http://webadmin.dac.com/knowledgecenter/2010/documents/POPOVICI-VP-ABK-

FINAL.pdf

[4] International Organization for Standarization, “Information technology - Open Sys-

tems Interconnection - Basic Reference Model: The Basic Model,” ISO/IEC 7498-1

Second Edition. http://standards.iso.org/ittf/licence.html

[5] T. Kogel, R. Leupers and H. Meyr, “Integrated System-Level Modeling of Network-On-

Chip enabled Multi-Processor Platforms,” Chapter 4: System Level Design Principles

Springer, 2007

[6] S. Chai, C. Wu, Y. Li and Z. Yang, “A NoC Simulation and Verification Platform

based on SystemC,” 2008 International Conference on Computer Science and Software

Engineering. IEEE Computer Society, 2008, pp. 423–426.

[7] D Wiklund, S Sathe and D. Liu, “Network on Chip Simulations for benchmarking.”

Linkoping University, Sweden

[8] P. T. Wolkotte, P. K.F. Holzenspies and G. J.N. Smit, “Fast Accurate and Detailed

67

NoC Simulations,” Proceedings of the first International Symposium on Networks on

Chip. IEEE Computer Society, 2007

[9] A. Portero, R. Pla and J. Carrabina, “SystemC Implementation of a NoC,” Escuela

Tecnica Superior de Ingenierıas, Espana

[10] OCP-IP Organization “Open Core Protocol Specifications,”

http://www.ocpip.org/the complete socket.php

[11] ARM, “AMBA 4 AXI4-lite and AXI4-Stream Protocol Assertions User Guide,”

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.amba/index.html

[12] OpenMP Architecture Review Board, “OpenMP Application Program Interface,” Ver-

sion 3.0, May 2008. http://www.openmp.org/mp-documents/spec30.pdf

[13] Message Passing Interface Forum “MPI: A Message-Passing Interface Standard,” Ver-

sion 2.2, September 2009 http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf

[14] A. Adriahantenaima et all, “SPIN: a Scalable, Packet Switched, On Chip Micro-

network,” Proceedings of the Design, Automation and Test in Europe Conference and

Exhibition IEEE Computer Society, 2003.

[15] W-H. Hu, S. E. Lee and N. Bagherzadeh, “DMesh: a Diagonally-Linked

Mesh Network-On-Chip Architecture,” University of California, Irvine.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.4454&rep=rep1&type=pdf

[16] A. Tavakkol, R. Moraveji, H. Sarbazi-Azad, “Mesh Connected Crossbars: A Novel NoC

Topology with Scalable Communication Bandwidth,” 2008 International Symposium

on Parallel and Distribuited Processing with Applications IEEE Computer Society,

2008, pp. 319 – 326.

[17] A. Zitouni, M. Zid, S. Badrouchi and R. Tourki, “A Generic and Extensible Spidergon

NoC,” World Academy of Science, Engineering and Technology 2007, pp. 14 – 19.

68

[18] M. Mirza-Aghatabar, S. Koohi, S. Hessabi and M. Pedram, “An Empirical Investi-

gation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and

Traffic Models,” 10th Euromicro Conference on Digital System Design Architectures,

Methods and Tools DSD 2007 IEEE Computer Society, 2007.

[19] K. Goossens, J. Dielissen and A. Radulescu, “AEthereal Network on Chip: Concepts,

Architectures and Implementations,” IEEE Design & Test of Computers IEEE 2005,

pp. 414 – 421

[20] T. Bjerregaard and J. Sparso, “Implementation of guaranteed services in the MANGO

Clockless Network On Chip,” IEEE Proceedings on Computers and Digital Techniques

2006 Vol 153 No 4, July 2006.

[21] A. Y. Seydim, “Wormhole Routing in Parallel Computers,” Southern Methodist Uni-

versity.

[22] L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in Direct

Networks,” Michigan State University, COMPUTER 1993, pp. 62 – 76.

[23] C. J. Glass and L. M. Ni, “The Turn Model for Adaptive Routing,” 1992 ACM 0-

89791-509-7 1992, pp. 278 – 287.

[24] Z. Xiaohu, C. Yang and W. Liwei, “A Novel Routing Algorithm for Networks On

Chip,” 2007 IEEE 1-4244-1312-5 2007, pp. 1877 – 1879.

[25] E. Behrouzian-Nezhad and A. Khademzadeh, “BIOS: A New Efficient Routing Algo-

rithm for Network On Chip,” Contemporary Engineering Sciences Vol. 2, 2009. No 1.

pp. 37 – 46.

[26] W. Zhang et al, “Comparison Research between XY and Odd-Even Routing Algo-

rithm of a 2-Dimension 3x3 Mesh Topology Network On Chip,” Global Congress on

Intelligent Systems IEEE Computer Society 2009, pp. 329 – 333.

69

[27] T. Schonwald et al, “Fully Adaptive Fault-Tolerant Routing Algorithm for Network

on Chip Architectures,” 10th Euromicro Conference on Digital System Design Archi-

tectures, Methods and Tools IEEE Computer Society 2007.

[28] Open SystemC Initiative, “The Transaction Level Modelling standard of the Open

SystemC Initiative,” 2007 - 2009 OSCI Group.

[29] Open SystemC Initiative, “OSCI TLM-2.0 Language Reference Manual,” July 2009.

[30] S. V, Tota et al, “MEDEA: A Hybrid Shared-memory/Message-passing Multiprocessor

NoC-based ARchitecture,” 2010 Design, Automation and Test in Europe 2010 DATE.

[31] B. Yin, “Design and Implementation of a Wormhole Router Supporting Multicast for

Networks On Chip,” Master of Science Thesis, 2005 Stockhol, Sweeden.

[32] U. T. Ogras et al, “Challenges and promising results in NoC Prototyping using FP-

GAs,” IEEE MICRO 2007, IEEE Computer Society September - October 2007, pp. 86

– 95

[33] M. Palesi, D. Patti and F. Fazzino “NOXIM, the NoC Simulator User Guide” Uni-

veristy of Catania, 2005 - 2010

[34] J. Joven et al, “xENoC - An eXperimental Network-On-Chip Environment for Par-

allel Distributed Computing on NoC-based MPSoC architectures,” 16th Euromicro

Conference on Parallel, Distributed and Network-Based Processing. 2008, pp. 141 –

148

70

design of a network-on-chip platform for mpsocs using tlm

Documents