design of a network-on-chip platform for mpsocs using tlm
TRANSCRIPT
![Page 1: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/1.jpg)
Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation
Fernando Adolfo Escobar Juzga
Electric and Electronic Engineering Department
APPROVED:
Antonio Garcıa Rozo, Ph.D.
Mauricio Guerrero, MSc.
Alain Gauthier, Ph.D.Dean of Faculty
![Page 2: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/2.jpg)
to my
MOTHER, FATHER and SISTERS
with love
![Page 3: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/3.jpg)
Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation
by
Fernando Adolfo Escobar Juzga
Thesis
Presented to the Academic Faculty of the Graduate School of
Universidad de los Andes, Bogota
in Partial Fulfilment
of the Requirements
for the Degree of
Master Of Electronic Engineering
Electric and Electronic Engineering Department
Universidad de Los Andes
January 2011
![Page 4: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/4.jpg)
Acknowledgements
I wish to thank my advisers Antonio Garcıa and Mauricio Guerrero for their guide and
support throughout the project; it was their experience and knowledge what helped me
choose and love this research area from years before. To my parents and sisters that have
unconditionally supported me at all times and without whom I wouldn’t have got here.
Additionally I want to thank all my friends who continuously inspire and demonstrate me
how far can one go with hard work, dedication and passion.
This thesis wouldn’t have been possible without the support of the OSCI TLM working
group and all its members. Finally, I want to thank CMUA group for providing me with
the necessary resources and tools that were required.
iv
![Page 5: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/5.jpg)
Abstract
Complex systems that include a great variety of modules inside the same dice require higher
level design techniques that allow obtaining accurate models suitable to test hardware as
well as software at early stages; multiprocessors Systems On-Chip (MPSoCs) are scaling
to levels where it is possible to embed tens and up to hundreds of cores on the same chip.
Such architectures cannot be integrated with traditional bus structures as they are not
scalable; as a solution to that, a new paradigm called Network on Chip (NoC) has gained
strength to solve this issue.
SystemC, an IEEE standard for electronic level design (ESL) is used here to build a
NoC functional model; to simplify hardware details and speed up simulations, the new
Transaction Level Modelling standard (TLM 2.0) is also adopted. Relying on different
design constrains, variables such as router and network interfaces architectures, routing
algorithms, message and flit size, etc, are evaluated.
At a final stage, a VHDL synthesis is done and compared with other implementations.
Results prove this design flow to be adequate and helpful for this kind of systems due to
its size and complexity.
v
![Page 6: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/6.jpg)
Table of Contents
Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Networks On Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Parallel Computing Memory Model . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Networks On Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Physical Layer: Topology . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Data Link Layer: Flow Control . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Network Layer: Switching Policy and Routing Algorithm . . . . . . 10
1.2.4 Transport Layer: Network Interface Card . . . . . . . . . . . . . . . 13
1.3 SystemC and Transaction Level Modelling TLM 2.0 . . . . . . . . . . . . . 17
1.4 Open Core Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 NoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Flit and Message structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Router TLM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Traffic Evaluation and Routing Algorithm Testing . . . . . . . . . . 33
2.2.3 Router VHDL Model . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Network Interface Card Architecture . . . . . . . . . . . . . . . . . . . . . 47
2.3.1 Network Interface TLM Model . . . . . . . . . . . . . . . . . . . . . 49
vi
![Page 7: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/7.jpg)
2.3.2 Network Interface VHDL Model . . . . . . . . . . . . . . . . . . . . 55
2.4 Software Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.1 4× 4 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 59
2.4.2 8× 8 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 60
3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1 Significance of the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
![Page 8: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/8.jpg)
List of Tables
1.1 Flow Control Techniques for NoCs . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Generic Payload Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Basic OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Burst OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Flit fields explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Router Arbitration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 TLM 2.0.1 Phases Interpretation for Routers . . . . . . . . . . . . . . . . . 31
2.4 Router Area Consumption on Virtex5 (XC5VFX30T-1FF665) . . . . . . . 49
2.5 VHDL-SystemC equivalence of NIC blocks . . . . . . . . . . . . . . . . . . 56
viii
![Page 9: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/9.jpg)
List of Figures
1 OSI Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Shared Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Distributed Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Common Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Ad-Hoc Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Packet Switching on NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Guidelines for selecting a Routing Algorithm . . . . . . . . . . . . . . . . . 12
1.7 Turn model for Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 West First Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 North Last Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.10 Negative First Routing Examples . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Transaction Level Modelling Use Cases, Coding Styles and Mechanisms . . 18
1.12 TLM Transaction Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.13 TLM Base Protocol Phases . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.14 OCP Read Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.15 OCP Burst Write Transaction . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1 Head Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Message Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Torus Topology NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Block diagram of a Router with Virtual Channels . . . . . . . . . . . . . . 29
2.5 Virtual Channel connections to Router . . . . . . . . . . . . . . . . . . . . 30
2.6 General Router Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Link Utilisation for Hotspot 10 %. Traffic going Right - Down . . . . . . . 38
2.8 Link Utilisation for Hotspot 10 %. Traffic going Left Up . . . . . . . . . . 39
ix
![Page 10: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/10.jpg)
2.9 Timing statistics for Hotspot 10 %. Traffic . . . . . . . . . . . . . . . . . . 40
2.10 Link Utilisation for Matrix Transpose Traffic going Right - Down . . . . . 41
2.11 Link Utilisation for Matrix Transpose Traffic going Left Up . . . . . . . . . 42
2.12 Timing statistics for Matrix Transpose Traffic . . . . . . . . . . . . . . . . 43
2.13 VHDL Router Black Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.14 VHDL Block Diagram for the XY Routing Module . . . . . . . . . . . . . 46
2.15 VHDL Block Diagram for Input Port Module . . . . . . . . . . . . . . . . 46
2.16 VHDL Block Diagram for Output Port Module . . . . . . . . . . . . . . . 47
2.17 VHDL Block Diagram for the Router . . . . . . . . . . . . . . . . . . . . . 48
2.18 TLM Phases in a NIC Read Operation . . . . . . . . . . . . . . . . . . . . 53
2.19 TLM Phases in a NIC Write Operation . . . . . . . . . . . . . . . . . . . . 54
2.20 VHDL Block Diagram for Network Interface Card . . . . . . . . . . . . . . 57
2.21 State Machine for the Handshaking Control . . . . . . . . . . . . . . . . . 58
2.22 State Machine for the NIC End to End Flow Control . . . . . . . . . . . . 59
2.23 NoC Performance for a 4× 4 Matrix Multiplication . . . . . . . . . . . . . 60
2.24 NoC Performance for a 8× 8 Matrix Multiplication . . . . . . . . . . . . . 61
2.25 Total simulation time at each node with North Last Routing . . . . . . . . 61
2.26 Total simulation time at each node with West First Routing . . . . . . . . 62
x
![Page 11: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/11.jpg)
Introduction
Multiprocessor systems on chip (MPSoCs) are becoming ubiquitous platforms in current
devices; large integration of modules has permitted the successful creation of multi-core
architectures (2 to 20 cores) in the past, and now provides the means and technology for
developing the so called many-core ones (hundreds of processors). This platforms require
however, better design practices in order for both hardware and software entities to be
ready for release on time.
One of the most recent proposals for creating complex embedded systems and many-core
platforms is known as Networks on Chip (NoC); instead of utilizing bus systems, intercon-
nections between components are done through routers and Network Interface Cards (NIC);
whilst routers are in charge of transporting data throughout the chip, NICs gather informa-
tion from/to end-modules (ports, memories, cores, etc) and send it to the router’s network
for delivery. In spite of being hardware, due to it’s complexity, a NoC can be better de-
signed from higher levels of abstraction rather than traditional RTL/HDL; new languages
such as SystemC are more appropriate for this task as they can be used to rapidly create
software representations of hardware (known as virtual platforms)[3].
SystemC is currently an IEEE standard for high level modelling with which C++ de-
scriptions of hardware platforms can be made; it includes a release called Transaction Level
Modelling (TLM 2.0) that was designed for speeding up the development of embedded plat-
forms as it simplifies unnecessary physical communication details such as clocks or pin-out
specifications; in addition to all C++ features, SystemC and TLM 2.0 provide libraries
that ease the emulation the real platform with as much timing details as needed. To this
date, however, most NoC designs are conceived from lower abstraction levels; according
to [1], simulation and synthesis are the most common ways to evaluate them; only a few
have been synthesised as ASICs and the rest is implemented on FPGAs. High level models
of NoCs have also been proposed: in [6] a 4x4 mesh network is built and simulated with
1
![Page 12: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/12.jpg)
SystemC; [7] creates C++ libraries to make a simulator for NoCs; reference [8] designs
a 6x6 mesh network and tests it with SystemC and VHDL before implementing it on an
FPGA; basic, low level SystemC description of a scalable NoC, is presented in [9] and is
validated with an MPEG encoder. Loads more of proposals can be found in [1] and [2] yet
none of them use the TLM 2.0 standard for design.
As suggested by the some of the previous references and specially by [5], higher ab-
straction levels are needed when constructing this architectures; to see this, consider the
specifications for network design defined by the OSI reference model [4]; although not all
of OSI’s layers have a direct equivalence to NoCs, most its principles can be extrapolated
to this field. In Figure 1, the OSI protocol stack is shown; for this work, the three upper
layers can be joint into a single one. From the figure, dependency between hardware and
software is evident when reading it from top to bottom; although it is possible to separately
test each group, SystemC descriptions and a correct application of the TLM standard can
help co-designing the NoC as a whole and iteratively improve it on both aspects.
Low level layers of the protocol define all related with routers: architecture, routing
algorithms, flow control, switching techniques, etc; higher ones determine the NIC’s struc-
ture. Because bus systems are the medium through which processors and most peripherals
transfer information, NICs use them to interchange data with end modules; due to the
great variety of bus specifications, this work considers only the most common ones, that is,
AMBA from ARM [11] and OCP [10] from the OCP-IP group; the latter was selected for
its simplicity and high support from the ESL community.
On the other hand, and as previously stated, top layers of the OSI model can be sum-
marized into a greater group that refers to software models necessary to access the network;
there are mainly two approaches in the field of multi-processor programming: shared or dis-
tributed memory. As its name indicates, shared memory implies that processing units can
access the same physical or logical memory spaces at any time; a well known API implemen-
tation of this, is called OpenMP [12], and is lead by a non-profit corporation, composed
of several companies and researchers, named OpenMP-ARB. The distributed model, on
2
![Page 13: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/13.jpg)
Figure 1:OSI Protocol Stack. Networks are usually defined according to thelayers shown.
the contrary, assigns separate physical memory sectors for each unit and through message
passing, data is shared between all modules at any moment. One of the first API imple-
mentations for this protocol is called MPI [13] and is still developed by the MPI Forum.
Message passing has had wide application for computer networks and appeals better suited
for them as computers don’t share the same memory. For the purposes of this work, some
of the MPI specifications were adopted for NIC design.
The following sections will provide a better insight on all the topics already mentioned;
a functional TLM 2.0 model of a Network On Chip is proposed and validated through
simulations; traffic patterns and other NoC parameters are analysed with the design and
finally a VHDL synthesis is presented to evaluate area consumption.
3
![Page 14: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/14.jpg)
Chapter 1
Networks On Chip Design
To properly board NoC design, several aspects need to be defined in terms of the afore-
mentioned OSI model. Through the evaluation of each layer, all aspects needed for our
high level model will be defined.
1.1 Parallel Computing Memory Model
Parallel computing has faced big challenges since its creation: task dependencies, race
conditions, mutual exclusion and parallel slowdown are concerns that can’t be omitted.
Whether shared or distributed memory, these are software aspects that have to be solved
at high level so represent an additional task for the programmer.
Because none of the previous issues marks any difference on the memory model to be
used, it is necessary to consider elements that clearly affect this decision: portability and
scalability. A shared memory configuration is shown in Figure 1.1; if a program is to be
run on such platforms, either a software compiler, aware of all system resources, has to be
provided along with the hardware, or the programmer has to know about low level details
to write an application for it. Apart from that, if the number of cores is modified, the
cost at software level shouldn’t be expensive at all, and again, it will depend either on the
compiler or the programmer.
Distributed memory shown in Figure 1.2; as indicated, processors interchange informa-
tion through messages. In contrast to the previous approach, no additional compiler or deep
knowledge about the hardware is needed; only network accessing methods are required. In
case the number of cores change, a correct parametrized software description would solve
4
![Page 15: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/15.jpg)
the problem.
Loads more of pros and cons on each configuration could be mentioned but it goes
beyond the scope of this work; it will suffice by stating that the distributed memory model
better suits the NoC’s behaviour and is the one implemented here.
Figure 1.1:Shared Memory Model. All processing elements share a big memoryarea; each core may have as many caches as desired yet the main mem-ory is common to all of them.
Figure 1.2:Distributed Memory Model. Interconnections between cores is donethrough a network; if data has to be shared, it is sent via messagepassing.
1.2 Networks On Chip
Designing Networks On Chip is a process that requires consideration of several variables
in order to separate communication from computation. The OSI model shown in Figure 1
5
![Page 16: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/16.jpg)
can be taken as a reference for this systems; to understand this association, each level of
the stack can be defined as follows:
1. Physical Layer: Defines voltage levels, length and width of wires, timing details
and topology among others.
2. Data Link Layer: Its the one in charge of safe data delivery; specifies flow control
mechanisms between hardware modules.
3. Network Layer: Controls message delivery from one node to another. It’s respon-
sible for storing data and implementing routing algorithms.
4. Transport Layer: Is in charge of establishing connections between end-nodes and
provide the information for them. This module (un) packages data and send (receive)
it to (from) the routers.
5. Session, Presentation and Application Layers: Can be condensed into a single
Application group for NoCs and refers to higher level aspects of the communication
such as software.
By following the above mentioned scheme, it is possible to define a functional and
synthesizable NoC model considering all it’s aspects. Although a high SystemC model of
the network is constructed, hardware details are considered for future implementation.
The following sections show the specifications on each layer for the model developed.
1.2.1 Physical Layer: Topology
Some aspects of the physical layer depend on the technology to be used for fabrication and
can’t be specified from the beginning; operating frequency and voltage levels are examples
of such limitations; because synthesis is not the main target of this work, the previous items
were discarded. Bus width was selected to match most standard processors nowadays, that
is 32 bit ones.
6
![Page 17: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/17.jpg)
Another important issue and perhaps the most relevant on this layer, is the topology;
contrary to computer networks, NoCs have a fixed structure that cannot be modified for
the rest of the chip’s lifetime. On this subject, several configurations have been proposed:
Figure 1.3 illustrates the most common topologies for networks; SPIN [14], Mesh, Torus,
Folded Torus, Octagon and Trees are a few examples. According to [1], Mesh and Torus
topologies constitute 62% of the overall designs; trees represent 12% and the rest, smaller
percentiles. There are as well specific ad-hoc implementations that can be seen in Figure
1.4; addition of links and combination of basic structures constitute the differences; despite
reducing worst case paths or improving latency delays, the cost on area consumption and
creation of new routing algorithms might be too high.
The guidelines to pick a topology were its scalability and the availability of routing
algorithms. As mentioned before, Mesh and Torus structures are used by the majority
of researchers and it is mainly because of their scalability: the cost of adding one or two
cores to a grid is pretty low as it doesn’t critically change the structure, additionally,
routing algorithms need not to be modified. Differences among them are turnaround links
that can significantly reduce some of the worst case conditions. In reference [18], both
structures show similar behaviour in power consumption, throughput and saturation, but
Torus topologies perform better with adaptive routing algorithms which, as will be seen on
the next section, are needed. After considering all previous restrictions and the results of
the cited references, Torus topology was selected for this work.
1.2.2 Data Link Layer: Flow Control
Routers are complex modules that have simple handshaking protocols to transfer data.
Whether interacting with another router or a network interface card (NIC), the mechanism
is the same. Again, some differences when compared to computer networks, exist; modules
inside the same chip transmit data in a much more reliable way than physically separated
ones so it suffices with controlling when to send and receive information, assuming that it is
properly transmitted. Some router implementations such as Æthereal [19] or MANGO[20]
7
![Page 18: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/18.jpg)
Figure 1.3:Common Network Topologies. Both for computers networks and NoCs,most common structures are shown in the graphic.
Figure 1.4:Ad-Hoc Network Topologies. Academic proposals for NoC topologies:Mesh Connected Crossbars [16](left), Spidergon [17] (center), and Di-agonal Mesh [15] (right).
8
![Page 19: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/19.jpg)
offer Quality of Service (QoS) guarantees but it requires a highly and specilized work that
goes beyond the scope of this work.
Flow control techniques are shown in Table 1.1; most implementation use the Credit
Based approach; STALL/GO has never been implemented and the rest of the literature
use handshaking and ACK/NACK like solutions. The handshaking approach is adopted
for our design.
It is important to note that flow control on the NoC’s SystemC model is abstracted
with the TLM 2.0 standard and may correspond to any of the techniques available when
ready for synthesis.
Table 1.1: Flow Control Techniques for NoCs [2].
.
Name Description
Credit Based Every router keeps an internal counter of the
spaces available for data storage (credits); once
a new space is free, a credit is sent back to in-
form its availability.
Handshaking Signal Based A VALID signal is sent whenever a flit is trans-
mitted. The receiver acknowledges by asserting a
VALID signal after consuming it.
ACK/NACK A copy of a data packet is kept in a buffer until
an ACK is received; if asserted, the flit is deleted.
If a NACK signal is asserted, the flit is scheduled
for retransmission.
STALL/GO Two wires are used for flow control; When a
buffer space is available, a GO signal is activated.
When no space is available, a STALL signal is as-
serted.
9
![Page 20: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/20.jpg)
1.2.3 Network Layer: Switching Policy and Routing Algorithm
Switching policy determines the way information is transmitted; it can be either packet or
circuit switched. Circuit Switching is the least implemented and states that a path from
source to destination must be reserved before transmitting data and shall only be released
after the message has been fully delivered. This policy is time expensive and may increase
network congestion because messages can be blocked for long time if data is big; such
situation may easily lead to deadlock issues.
Packet switching is widely used in both computer networks and On-Chip ones; it can
be implemented on either the following three versions:
1. Wormhole: Packets are splitted into smaller ones called flits (Flow Control Units).
Head flits contain address’ information and each router uses it for forwarding it to the
destination; body flits follow it in a worm-like way. Only a 1-flit space is necessary
on each router input for implementation.
2. Store and Forward: Routers accept and send data when there is enough capacity
for fully storing the packet. A minimum space equal to the packet’s maximum length,
is required per router.
3. Virtual Cut Through: Data is transmitted per flit but is only accepted when
there’s enough buffer space for saving the whole packet; all routers must be able to
store at least the maximum’s packet length.
Figure 1.5 illustrates how information is transmitted through packet switching tech-
niques; around 80% of proposed NoCs, implement the wormhole one because of its low-area
requirements; wormhole switching was also selected for this work given those advantages.
Another item addressed by the Network Layer that highly affects the platform’s perfor-
mance is the routing algorithm; because of Torus resemblance with Mesh arrangements,
10
![Page 21: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/21.jpg)
Figure 1.5:Packet Switching on NoCs: Wormhole(left), Store and Foward (center)and Virtual Cut Through (right) [19]. Only the wormhole techniquesignificantly reduces area consumption.
most algorithms that work for it, may as well operate on Torus networks with minor mod-
ifications.
A good guideline for selecting an appropriate algorithm, irrespective of the structure,
is the scheme shown in 1.6; several router implementation details can be established from
that graph: router complexity increases with the number of destinations it can deliver
information to. Due to area restrictions and the possibility of solving it at the software
level, multicast routing is discarded for the current work.
Routing decisions also determine the chip’s design: centralized routing requires a con-
trolling entity, aware of all nodes and traffic throughout the network, to decide how should
the information traverse it; source routing might increase the packet’s size for long paths
and finally multiphase routing also implies some of the previous problems. Distributed
routing is by far the most suited for NoCs and facilitates the adoption of the algorithms
proposed.
As for implementation, both lookup tables and FSM are feasible to adopt; area cost
on both options is similar and don’t affect the design drastically; one variable that could
determine which to choose is whether the algorithm is deterministic (always the same path
between two nodes) or adaptive (relies on network congestion). Thanks to the fact that a
high level model of the network will be created, tests are to be carried on with deterministic
and adaptive algorithms; adaptive ones are be backtracking (fault tolerant), mis-routing
11
![Page 22: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/22.jpg)
(can route away from the destination if necessary) and partial (don’t consider all possible
routing paths).
Figure 1.6: Guidelines for selecting a Routing Algorithm [2].
For grid-like structures the most common deterministic algorithm is the XY one, where
information travels in the X direction until it reaches the Y coordinate of the destination;
it then travels in the Y direction. Adaptive routing is more complex as it attempts send-
ing data through low congested paths that aren’t always minimal; because of that, two
conditions that usually restricts the algorithms adaptability are deadlock, where several
messages block each other’s path preventing themselves to ever advance, and livelock
where data keeps travelling throughout the chip without ever reaching the target.
A few semi-adaptive, deadlock and livelock free algorithms widely adopted are known
as turn model solutions [21], [22]; from all possible 90◦ turns, 2 are prohibited in order
to avoid deadlock. Figure 1.7 shows three algorithms inferred with this theory. To better
12
![Page 23: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/23.jpg)
understand each one, a brief explanation, taken from [23], is presented:
? West-First: Packets should start going to the west if necessary, then, adaptively are
routed south, east and north. Prohibited turns are the two to the west. Figure 1.8 shows
some path examples with this algorithm.
? North-Last: When going north, packets can’t turn anywhere else; the only option for
packets to go northwards is when that is the last direction to take. Examples are shown
in Figure 1.9.
? Negative-First: Prohibited turns are the two from a positive direction to a negative
one; if a packet has to go in the negative direction, it must start in that direction. Figure
1.10 exemplifies this behaviour.
Any of the aforementioned algorithms can be used with the SystemC model of the
network as describing them doesn’t require much development time; studies shown in [18]
demonstrate that no significant difference among them exist.
Other algorithms have been proposed in [24], [25], [26], [27] and many more references
but will be left for future work.
1.2.4 Transport Layer: Network Interface Card
Up to this point, most design specifications affected the router’s final structure, however,
this layer has more implications on the Network Interface Card; problems to be solved at
this level are end-to-end flow control and (un)packing of information.
In order to control packet injection on the network, our NIC design is based on the
message passing model previously mentioned; the way processors intercommunicate with
each other can be summarized in two activities: sending and receiving data; for each
message transmitted by a core (write operation), another one should be expecting it (read
operation).
13
![Page 24: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/24.jpg)
Figure 1.7:Turn model for Adaptive Routing: Two turns are prohibited on eachmodel to avoid deadlocks; minimal and non-minimal paths are possiblefrom all options. [22]
Figure 1.8: West First Routing Examples [23].
Figure 1.9: North Last Routing Examples [23].
14
![Page 25: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/25.jpg)
Figure 1.10: Negative First Routing Examples [23].
It is clear that processors won’t be synchronized at all times, and at a certain point, two
or more cores could send messages to another that isn’t ready yet; this will only increase
network congestion, require retransmission protocols, message discard support, and might
also lead to a deadlock at high level if not properly solved.
Considering the indicated problems and especially area constrains, the proposed Net-
work Interface Card implements end-to-end flow control with the following protocol: when
a core requests data, that is, performs a read operation, it sends a 1-flit-size packet to the
core that is intended to write on it; upon reception, the second NIC sends the information
only if the second core has a pending write transaction that matches the requester’ address;
if the second core doesn’t expect that specific request, discards it, and the first one has
to retry after some time. On the other hand when a NIC receives a write transaction, it
starts packing data, so that when a request arrives, most if not all information is ready to
be transmitted; if an application is properly written, the number of read requests should
match the number of write statements.
The cost of such implementation is that for every read/write pair, at least one flit has
to be sent between two nodes in order to “establish” a connection; this is nonetheless, far
more efficient than allowing all cores to send their packets any time and oblige NICs to
constantly delete them if they don’t correspond to expected transactions.
15
![Page 26: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/26.jpg)
Other important items regarding NIC end-to-end flow control behaviour are:
1. No read transactions requested by a processing element are accepted by the NIC while
another read is in progress; violation of the algorithm sequence can lead to incorrect
results.
2. If data is being transmitted, the NIC can accept a read transaction from the processor
but won’t send the request until the previous transaction has terminated.
3. If a NIC is receiving packets from the network (read transaction), a write transaction
can be started from processor to NIC; data can be stored at a send buffer but won’t
be sent until a request from the correct module is received.
4. A write transaction starts when a processing element sends data to the NIC for
transmission. For the processing element, it ends when all the information has been
transferred to the NIC; for the latter, when all flits have been injected into the
network.
5. A read transaction starts when a processing element requests data from the NIC; it
ends when all the information requested is successfully delivered from the NIC to the
processing element.
6. Irrespective of the type of transaction a NIC is performing, under any circumstances
can it skip the execution order when another read/write transaction is received.
7. Buffer size for storing incoming and outgoing transactions was defined to be of 64
words. Separate buffers are implemented to improve performance.
As stated before, the protocol used for communication between the NIC and the process-
ing elements is the OCP-IP one; because it belongs to another section, it is not explained
here.
16
![Page 27: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/27.jpg)
1.3 SystemC and Transaction Level Modelling TLM
2.0
Transaction Level Modelling TLM is a standard developed by the Open SystemC Initiative
(OSCI) which provides tools to rapidly create virtual descriptions of embedded platforms;
it’s main objective is to decouple computation from communication at a high abstrac-
tion level so that complex systems can be modelled. According to the OSCI group [28],
simulations run from 10X up to 1000X faster than corresponding HDL descriptions.
The TLM 2.0 standard allows two coding styles: loosely timed (L.T) and approximately
timed (A.T). When a quick and slightly detailed model of a design is required, the loosely
timed approach can be adopted; L.T transactions are modelled as a single function call
(read or write) that either returns after some delay, or do it immediately with an additional
delay argument so that the caller reacts after that time. A.T descriptions, on the contrary,
provide mechanisms for specifying as much timing details as desired so are more suited for
architectural analysis and hardware verification. The Network On Chip model developed
here only uses A.T descriptions and therefore an emphasis on explaining it is made. Figure
1.11 shows a bigger context where it’s worth applying TLM 2.0.1 descriptions.
The basic unit in all TLM transactions is the object interchanged, the generic payload ;
it’s a C++ class which members include the minimum elements to execute a transaction:
command, address and data; apart from those, additional variables such as byte en-
ables, streaming width, bus width, response status, etc, are included to model more
complex protocols. Generic payload objects also support user defined extensions that can
carry an unlimited number of attributes if required. Table 1.2 explains the basic attributes
aforementioned.
All TLM 2.0.1 transactions are carried out between an Initiator and at least one Tar-
get; the channel through which they communicate is called a socket and the only module
allowed to start transactions is the Initiator; Target modules can just reply to in-progress
transactions; Interconnect modules (such as routers or buses) can also be integrated with
17
![Page 28: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/28.jpg)
Figure 1.11: Transaction Level Modelling Use Cases, Coding Styles and Mechanisms [28].
the previous ones. Figure 1.12 shows an example of one Initiator, one Interconnect com-
ponent and a Target.
AT transactions can be split into 4 phases as shown in Figure 1.13; through functions
named non-blocking forward transport (nb forward) and non-blocking backward transport
(nb backward), communication takes place; both functions have three parameters:
1. Trans: Pointer to the generic payload object.
2. Phase: Current transaction phase; it can be either of those shown in Figure 1.13.
3. Delay: Time that a module has to wait before responding to a transaction.
Initiators call nb transport forward, with BEGIN REQ as phase argument, to start
transmitting data; they use phase END RESP to conclude a transaction. Targets call
18
![Page 29: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/29.jpg)
Table 1.2: Generic Payload Attributes according to [29].
.
Generic Payload Attribute Meaning
Command Can be either Write or Read.
Address Target address to execute transaction.
Data Pointer Pointer to the data array. Data should be read or written
to this variable.
Data Length Length of the data to be transferred computed as
BUSWIDTH/4;
Byte Enable Pointer Used to enable access to specific data bytes.
Byte Enable Length To specify the number of valid elements of the byte enable
pointer.
Streaming Width States the number of words per burst transfer.
DMI Allowed Marks whether the Direct Memory Interface can be used
or not.
Response Status Used for storing the status of the transaction.
nb transport backward with phase END REQ to acknowledge the reception of a transac-
tion and use phase BEGIN RESP to indicate the correct execution of the it regardless
whether is read or write.
At some points it might be unnecessary to use all four phases to model a platform’s
behaviour, i.e. when a write transaction is performed: an initiator(cpu) sends data to a
target (memory) which can execute the order immediately; in this case, the target can reply
to the initiator with a phase update, changing it from BEGIN REQ to BEGIN RESP and
adding some delay; the way each agent is aware of such status updates is by checking the
return value of a nb transport call. Return values can be either of TLM ACCEPTED (no
change in phase), TLM UPDATED (phase updated) or TLM COMPLETED (transaction
19
![Page 30: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/30.jpg)
executed).
Specific rules concerning each module’s permission to modify the generic payload at-
tributes, possible return values from each nb transport call, and detailed explanation of the
whole standard, can be found on [29] for more information.
Figure 1.12:TLM Transaction Flow [28]. The generic payload object is created bythe Initiator but is only referenced by interconnection modules ortargets. Socket arrow’s indicate how the information flows.
As previously mentioned, some extensions can be added to the generic payload object
for routing purposes and can be either global or instance specific, that is, each module of
can add attributes to the transaction object and being the only one able to access them;
this work adds two extensions to the generic payload object: a global one for end-to-end
verification purposes, and an instance specific one for router operation. Next chapter will
show more details about this.
20
![Page 31: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/31.jpg)
Figure 1.13:TLM Base Protocol Phases [28]; the initiator is the vertical line on theleft and the target the one on the right.
1.4 Open Core Protocol
The Open Core Protocol International Partnership (OCP-IP) is a community in charge of
“proliferating a common standard for intellectual property core interfaces, or sockets that
facilitate “plug and play” System-on-Chip design”[10]. Their specifications for intercon-
necting modules is a bus model as complete as ARM’s AMBA-AXI and can be perfectly
described with OSCI’s TLM 2.0.1 standard.
Because of the amount of details the OCP has, a light version of it will be used for this
work; all basic signals shown in Table 1.3 are used but additional burst support is included.
Standard OCP burst extension require 8 additional signals where all but MBurstLenght
can be skipped; to see how can this be done, consider Table 1.4: MAtomicLength is used
when the length of data is bigger than the word size and this is not the case; MBurstPrecise
indicates that the length of the burst is known at the start of the transmission as always
is for our design; MBurstSeq specifies how are the addresses of the burst emitted which in
this work are assumed to be incrementing; MBurstSingleReq implies that only one request
is done per burst transfer; MDataLast, MReqLast and MRespLast are unnecessary as each
module keeps track of the number of data transferred.
21
![Page 32: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/32.jpg)
Table 1.3:Basic OCP Signals extracted from [10]. Signal MDataValid is skippedin our implementation. Width measured in bits.
.
Name Width Driver Function
Clk 1 varies OCP Clock
MAddr configurable master Transfer address
MCmd 3 master Transfer command
MData configurable master Write data
MDataValid 1 master Write data valid
MRespAccept 1 master Master accepts response
SCmdAccept 1 slave Slave accepts transfer
SData configurable slave Read data
SDataAccept 1 slave Slave accepts write data
SResp 2 slave Transfer response
Table 1.4: Burst OCP Signals [10]. Only MBurstLenght is enough for this work’s NIC.
.
Name Width Driver Function
MAtomicLength configurable master Length of atomic burst.
MBurstLength configurable master Burst Length.
MBurstPrecise 1 master Burst length precise.
MBurstSeq 3 master Address sequence.
MBurstSingleReq 1 master Single request/multiple
data protocol
MDataLast 1 master Last data in burst.
MReqLast 1 master Last request in burst.
SRespLast configurable slave Last response in burst.
22
![Page 33: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/33.jpg)
To better understand how transfer with the OCP protocol work, consider Figure 1.14;
only signal MRespAccept is missing on the diagram yet the behaviour is practically the
same. Figure 1.15 shows an scenario for burst transfers, handshaking is carried on the same
way.
Figure 1.14:OCP Read Transaction [10]; signal behaviour when performing a readrequest: When the master issues the command it has to wait for SCm-dAccept to assert before changing the MCmd line. After some timethe slave indicates valid data on the SData bus by issuing a Data Validcommand on the SResp line.
23
![Page 34: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/34.jpg)
Figure 1.15:OCP Burst Write Transaction [10]; signals MBurstSeq and MBurstPre-cise never change. Handshaking between master and slave is basicallythe same as the previous non-burst example.
24
![Page 35: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/35.jpg)
Chapter 2
NoC Implementation
Once the implementation details and design flow have been clarified as in the previous
chapter, its now possible to describe the router and the NIC at any level of abstraction.
Although most items regarding each structure are well defined, some aspects still lack
specification and will be analysed hereafter. Code of each description can be found on the
Appendix section.
2.1 Flit and Message structure
In order to determine the structure of both the router and the NIC, it’s necessary to define
the units they are going to deal with: Flits and Messages. Messages are composed of
one or more flits, which are the units injected into the router’s network; because wormhole
routing is to be used, one of the flits must include information about the origin and des-
tination of the whole message; NICs, however, require additional data fields to properly
implement end-to-end flow control. As a start, a review of explanations provided on Sec-
tion 1.2.4 and the constrains mentioned on Section 2.3 are necessary to define all constrains.
If more TCP-like control parameters are needed for high level control, those parameters
must be set by processing elements and are to be transmitted to the NIC as common data;
NICs only support the minimum amount of control fields to ensure correct functionality.
Head Flit structure is displayed in Figure 2.1 and message structure in 2.2. Flit fields are
explained in Table 2.1.
25
![Page 36: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/36.jpg)
Figure 2.1: Head Flit Structure
Figure 2.2: Message Structure. Payload can be up to 64 bytes long.
Table 2.1: Flit fields explanation
.
Field Use
Type Flits can be either: Head, Body, Tail or Single; single
flits are used to ask for data and for barrier operations.
Source X Flit’s origin X coordinate.
Source Y Flit’s origin Y coordinate.
Destination X Flit’s destination X coordinate.
Destination Y Flit’s destination Y coordinate.
Length Message length. Maximum 64 words.
Single Indicates whether flit is a single-flit transaction or not.
Message Number Message number stated by source module.
Broadcast States whether the message is broadcast or not.
BarrierID. Stores a BarrierID according to the source.
ReadWrite If message is single-flit, this bit is set when is a barrier
write.
26
![Page 37: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/37.jpg)
Through SystemC descriptions and simulations it was possible to establish the correct
behaviour of the platform. Now that the units needed by both router and NIC are defined,
their designs can be presented.
2.2 Router Architecture
Studies presented in Chapter 1 yielded the following conclusions regarding router imple-
mentation:
? Topology: Torus. Displayed in Figure 2.3. Taken from [30].
? Switching Policy: Wormhole Packet Switched.
? Flow Control Technique: Handshaking Signals.
? Routing Algorithms: Deterministic XY and Adaptive Turn Model.
Only two aspects about the router’s structure are still undefined: Arbitration tech-
niques and number of Virtual Channels. When two ore more inputs attempt to use a
router’s output it is necessary to establish a mechanism to assign output control. Table 2.2
lists usual solutions to this problem; most implementations listed in [2] use Round-Robin
or First Come - First Served techniques for Best Effort routers and priority approaches for
Guaranteed Traffic (GT) ones such as [19] and [14].
Specialized routers are required when GT services are to be provided; just a few NoCs
like the ones cited have implemented GT services. Best effort Round Robin arbitration will
be used on this work.
On the other hand, Virtual Channels (VCs) are buffer additions to the router’s inputs
(outputs) used for alleviating congestion on the network; despite using the same physical
paths, addition of buffers decrease the probability of deadlock and improve performance
as delayed messages can hold on routers and still advance to their destinations. Area is
the main cost of adding Virtual Channels and is also one of the most critical issues in
27
![Page 38: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/38.jpg)
Figure 2.3: Torus Topology NoC
Table 2.2: Router Arbitration Techniques [2].
.
Arbitration Technique Policy
Round Robin Output is assigned equally starting from the
first element.
First Come - First Served Output control is assigned in request order.
Priority Based All packets are assigned a priority and get
output control according to their importance.
Priority Based Round Robin Round Robin is implemented but a priority
proportional to the frequency of usage is as-
signed.
28
![Page 39: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/39.jpg)
embedded system design; because of that, an optimal placement and integration of buffers
is required. Figure 2.4 shows a router with input VCs, which are in principle, connected to
all possible outputs. In [31] studies show that for unicast routing, having a VC per output
at each input can reduce area consumption significantly; with this result, the router and
VC integration can be seen in Figure 2.5.
Figure 2.4:Block diagram of a Router with Virtual Channels. Area constrainshave to be considered to choose an appropriate number of buffers.
Now that all specifications related to the router’s behaviour are defined a high level
block diagram of it can be constructed; no major implementation details are shown for it is
an abstraction of the real hardware and all functional blocks are software described. Figure
2.6 shows the general block diagram that will be used to describe the TLM model of the
router.
29
![Page 40: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/40.jpg)
Figure 2.5:Virtual Channel connections to Router. A single VC per output isavailable at each input so to decrease area consumption. Extractedfrom [31].
Figure 2.6:General Router Block Diagram. Four virtual channels at each inputare placed to reach all possible outputs; no packets are routed backthrough the same input.
30
![Page 41: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/41.jpg)
2.2.1 Router TLM Model
SystemC’s Transaction Level Modelling is a standard for decoupling communication from
computation in high level designs; most mechanisms offered by the standard are easily
abstracted to bus models because it’s the traditional way to interconnect Systems On
Chip (SoCs). As routers use different flow control techniques compared to traditional bus
systems, different interpretation of the TLM 2.0.1 phases is required in order for the model
to keep faithful to the hardware. Table 2.3 explains phase’s meaning for inter-router, packet
based communication.
Table 2.3: TLM 2.0.1 Phases Interpretation for Routers.
.
Phase Flow Direction Meaning
BEGIN REQ Init. Router To Target Router Flit is being transmitted.
END REQ Target Router To Init. Router Flit is stored, can be erased
on initiator.
BEGIN RESP Target Router To Init. Router A new space is free. Can
send more flits.
END RESP Init. Router To Target Router Final reply.
Another addition to the TLM 2.0.1 base protocol, described in the previous chapter,
are routing extensions; as mentioned before, extensions can be locally or globally accessed.
The proposed model uses both for debugging and verification purposes; a local extension
is created on every transaction when they traverse a router and each one adds its own
extension to the transaction. It’s got the following fields:
(a) Port: Stores the number of the incoming port through which the transaction entered.
(b) Port VC: Stores the number of the outgoing port through which the transaction will
go out.
31
![Page 42: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/42.jpg)
(c) TimesBlocked: Counter that is increased in 1 unit when a router attempts to be
transmitted. This allows recognizing deadlock situations.
The global extension is created by the initiator, can be accessed by all modules and
adds the following information to the transaction object:
(a) MainInitiator: Stores the ID of the module that first issued the transaction.
(b) FinalTarget: Stores the ID of the module where the transaction is to be delivered.
(c) TransID: Records the transaction number for debugging purposes.
(d) FlitType: Stores the type of flit of the current transaction.
(e) TransCounter: Incremented every time a transaction passes through a router.
(f) TransPath: Array for storing the path the flit goes. Used for debugging.
At this point it is necessary to clarify that there are four type of flits: Head ones which
contain routing information, body ones which are the data itself, tail ones that mark the
end of a packet (may or may not contain data) and full ones that are single-flit messages
used for (a) sending read requests from one core to the other (end-to-end flow control) and
(b) single-flit writes used for barrier operations.
SystemC implementation of the router is composed of five functions that act on each
port:
I Non-blocking Transport Forward: Is a standard mandatory function that receives
three parameters: a transaction pointer, a TLM phase argument and a time value
called delay. When a module wants to send a flit, it calls this function with those
parameters and a BEGIN REQ as phase argument; the delay time is the time at
which the target has to react after getting this call. The function checks the type of
flit, space availability, computes the output, returns TLM ACCEPTED and tells the
32
![Page 43: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/43.jpg)
simulator to execute Forward Payload Event Queue at the time indicated by the delay.
Detailed behaviour of this method is shown in Algorithm 2.1.
II Non-blocking Transport Backward: Is also a mandatory function that receives
the same three parameters but correct phase arguments are either END REQ or BE-
GIN RESP. If receiving END REQ, a method called Backward Event Queue is notified
for execution after the delay time; if BEGIN RESP is received method Transaction
Update is notified.
III Forward Payload Event Queue: Function invoked by nb forward transport ; it takes
the transaction object and stores it on the corresponding Virtual Channel and noti-
fies method Transaction Update to be executed after an internal delay time. It also
returns phase END REQ back to the initiator to acknowledge the correct storage of
the transaction.
IV Backward Payload Event Queue: Function invoked by nb backward transport, in
charge of double checking that the transaction is correct. Notifies the Transaction
Update method for immediate execution.
V Transaction Update: Considered the brain of the router; it starts transaction previ-
ously stored on the VCs, deletes transactions already sent, notifies modules the avail-
ability of new spaces if there are some and implements output arbitration. Algorithm
2.2.1 describes the thoroughly method .
2.2.2 Traffic Evaluation and Routing Algorithm Testing
MPSoC platforms are generic systems that can implement any algorithm whose inter-
module traffic can be known once task partitioning is done; because it is uncertain which
application will be executed on such platforms, it is necessary to test synthetic traffic
patterns on the chip to establish its performance under random circumstances. There
33
![Page 44: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/44.jpg)
Algorithm 2.1 Non-blocking Transport Forward.
Require: Transaction object, phase, delay
1: if phase = BEGIN REQ then
2: if (FlitType = Head) or (FlitType = Full) then
3: OutPort = Value returned by Routing Algorithm.
4: if (VC Empty) then
5: Reserve Virtual Channel
6: Set response status to TLM OK RESPONSE
7: if (OutPort Free) then
8: Take control of OutPort
9: end if
10: else
11: Set response status to TLM GENERIC ERROR RESPONSE
12: Return TLM ACCEPTED
13: end if
14: Notify Forward Payload Event Queue to execute after delay tim
15: Decrease VC Space
16: Return TLM ACCEPTED
17: else if (VC has space) then
18: Set response status to TLM OK RESPONSE
19: Decrease VC Space
20: Notify Forward Payload Event Queue to execute after delay time
21: Return TLM ACCEPTED
22: else
23: Set response status to TLM GENERIC ERROR RESPONSE
24: Return TLM ACCEPTED
25: end if
26: else if phase = END RESP then
27: Return TLM ACCEPTED
28: else
29: Abort Execution
30: end if
34
![Page 45: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/45.jpg)
Algorithm 2.2 Transaction Update Method Implemented by Routers
Require: Virtual Circuit to Update
Require: InputPort, OutputPort
1: if A transaction on VC can be started then
2: Call non-blocking forward method on the next module with phase BEGIN REQ.
3: end if
4: for i = 0 to V CSize do
5: if A Transaction can be freed then
6: Delete transaction.
7: Increase VC space.
8: Call non-blocking backward method on the previous module with phase BE-
GIN RESP to indicate that a new space is available.
9: if Transaction is type “Tail” then
10: Free Virtual Circuit.
11: Stop controlling Output Port.
12: end if
13: end if
14: end for
15: if Output Port is not busy then
16: for i = 0 to Number of Router Inputs do
17: NewInput = InputPort + 1
18: if NewInput is ready to use OutputPort then
19: Give NewInput control of OutputPort.
20: Execute again from the start.
21: end if
22: end for
23: end if
35
![Page 46: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/46.jpg)
are a few typical tests conducted on NoC designs that help realizing routing algorithms
performance:
Uniform Traffic
Nodes communicate with each other with the same probability.
Matrix Transpose Traffic
Each node sends messages only to a destination with the upper and lower halves of
its own address transposed.
Hotspot Traffic
Each node sends messages to other nodes with an equal probability except for a
specific node (called Hotspot) which receives messages with a greater probability.
The percentage of additional messages that a Hotspot node receives compared to the
other nodes is indicated after the Hotspot name e.g Hotspot 15%.
Complement Traffic
Each node sends messages only to a node corresponding to the one’s complement of
its own address.
Several scenarios were tested under some of this traffic conditions and mainly three
routing algorithms were implemented: West-First (adaptive), North-Last (adaptive) and
XY (deterministic). Additionally, an aspect that hasn’t been studied yet, VC depth, was
also considered and the results can be seen on the following figures.
Figures 2.7 and 2.8 show link utilisation under Hotspot 10 % (on node 7) traffic condi-
tions; two groups of figures are provided as all links transmit information in both directions
(up-down or right-left); for a better discrimination of link congestion, plots were done sep-
arately. XY Routing in Figure 2.7(a) has 4 high traffic links (higher bars); West First
in Figure 2.7(b) presents only two congested links and North Last in Figure 2.7(c) just
one. On the other direction, Figure 2.8(a) shows XY with 2 congested links, Figure 2.8(b)
36
![Page 47: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/47.jpg)
presents West First behaviour with 1 high traffic link and Figure 2.8(c) has 2 congested
links with North Last routing.
From the previous figures, apparently West First and North Last routing better spread
traffic along the network as they only have 3 high traffic links in both corresponding graphs.
No turnaround links show significant utilization despite the adaptiveness of those algo-
rithms.
In order to check the overall behaviour of all routing algorithms under this traffic pat-
tern, plots for average flit latency and total simulation time are shown in Figure 2.9. From
2.9(a) it can be seen that the more Virtual Channels are, the more the flit latency on the
network; that is because NICs can inject more packets into the network at any given time;
adaptive algorithms show lesser values than XY’s, indicating that information is forwarded
faster with them. Figure 2.9(b) gives more information about the routing performance; for
large messages XY and North Last perform better than West First regardless the Virtual
Channel depth. For shorter transmissions West First decreases simulation time. In gen-
eral, results presented for this traffic pattern are very close to each other and might need
a deeper analysis to make a routing decision.
A second pattern was studied under the same conditions as before; Matrix Transpose
traffic was implemented and results are shown in figures 2.10, 2.11 and 2.12. From graphs
2.10(a) and 2.11(a), 8 congested links can be distinguished when using XY routing; West
First behaviour, shown in 2.10(b) and 2.11(b), only present 2 high traffic links, as well
as North Last routing in 2.10(c) and 2.11(c). Although, as before, adaptive algorithms
attempt to better distribute traffic throughout all available paths, results from 2.12 demon-
strate that long messages with low Virtual Channel depth get faster to their destination
with XY-Routing, which is also, the one with lowest flit latency; medium size messages are
more suited to West First-Routing under this pattern.
37
![Page 48: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/48.jpg)
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.7: Link Utilisation for Hotspot 10 %. Traffic going Right - Down
38
![Page 49: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/49.jpg)
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.8: Link Utilisation for Hotspot 10 %. Traffic going Left Up
39
![Page 50: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/50.jpg)
(a) Average Flit Latency
(b) Total Simulation Time
Figure 2.9:Timing statistics for Hotspot 10 %. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.
40
![Page 51: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/51.jpg)
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.10: Link Utilisation for Matrix Transpose Traffic going Right - Down
41
![Page 52: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/52.jpg)
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.11: Link Utilisation for Matrix Transpose Traffic going Left Up
42
![Page 53: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/53.jpg)
(a) Average Flit Latency
(b) Total Simulation Time
Figure 2.12:Timing statistics for Matrix Transpose. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.
43
![Page 54: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/54.jpg)
Despite the fact that more traffic patterns could have been evaluated for the scope of
this study it is enough to show the capabilities of the constructed model. Each of the
timing graphs (3D plots) considered 8 message sizes and 8 Virtual Channel depths, that is
64 simulations in total. Each run used 100 messages on each of the 16 nodes for a total of
1600 messages that varied from 6400 up to 51200 flits being transported along the NoC.
2.2.3 Router VHDL Model
Once the router’s behaviour was validated with the high level model presented, a detailed
HDL design was implemented; three main blocks compose this design: Input Port, Output
Port and Multiplexers. Input ports are in charge of data reception control, routing and
Virtual Channel storage; Output ports control data transmission and round robin channel
arbitration; multiplexers interconnect all input buffers with the router’s outputs.
Because of the flit size, that is 34 bits, there are 34 lines for data transmission and 34
for data reception; also, two lines are used for handshaking transmission control, Tx and
Tx Ack and two lines for reception control, Rx and Rx Ack. In summary, each port has 36
inputs and 36 outputs. Figure 2.13 shows the router’s black box.
Input Port module is composed by 4 FIFOs, a routing unit, one multiplexer and one
de-multiplexer; deterministic XY routing was chosen for state machine implementation.
Flow control was implemented according to the studies presented in section ??, where
handshaking signals were selected. The control module receives incoming requests from
external modules through the Rx input; it sends a request signal to the routing unit who
replies back when the output has been computed; after getting a response from the routing
module, the flit is stored at the corresponding queue, if space is available. When no space
is available, Rx Ack line remains de-asserted.
The routing module was designed with a state machine that uses external comparators
to determine whether the coordinates of the destination are larger or smaller than the
router’s. Depending on the results of all compares, an output is computed and stored into
44
![Page 55: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/55.jpg)
Figure 2.13:VHDL Router Black Box. Five ports are needed for a torus or (mostrouters on) mesh configurations.
an internal register; block diagram of this module is presented in Figure 2.14 and the one
for the whole InputPort is shown in 2.15.
Output Port module is composed of a control unit in charge of arbitrating outputs and
negotiate data transmission with another router or NIC; also, two de-multiplexers a Flit
decoder and a multiplexer are included into this big object. Flit decoder is in charge of
notifying the control when a tail or single flit has been transmitted so that it assigns the
output to another input. A block diagram of this box is shown in 2.16.
Finally, the full block diagram of the router is shown in Figure 2.17. VHDL code is
attached at the end on the Appendix section. Due to the number of input/outputs of
this module, implementation can only be possible on an ASIC, however, FPGA synthesis
allowed us to know some information about area consumption. Studies shown in [32] lists
statistics about the number of slices consumed on a Virtex-II FPGA; a 5 input/output
router consumed 397 slices. Also, [8] obtained a 1762 CLB consumption on a Virtex-II-
8000 FPGA.
45
![Page 56: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/56.jpg)
Figure 2.14: VHDL Block Diagram for the XY Routing Module.
Figure 2.15:VHDL Block Diagram for Input Port Module. Four FIFOs are neededto route information to each output port.
46
![Page 57: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/57.jpg)
Figure 2.16: VHDL Block Diagram for Output Port Module.
Our router model was synthesised on Virtex5. Virtual Channels were generated as
FIFO memories with Xilinx’s IP Core Generator with 16-flit depth. Resource utilisation
is shown in Table 2.4. Because low-level detailed designed was not the objective of this
work, HDL simulations are skipped on this document yet VHDL code is attached at the
Appendices section.
It is important to note that pin-out of all previous modules was not enough for syn-
thesizing a single router; however, if more of this modules are embedded, a small network
of them can be constructed and the chips could be plugged to external processors with
FPGAs outputs.
2.3 Network Interface Card Architecture
Network Interface design is intended to support and validate message passing transactions
which are composed of two tasks for communication, send and receive, and one for syn-
47
![Page 58: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/58.jpg)
Figure 2.17:VHDL Block Diagram for the Router. Multiplexers shown on diagramare the same as the ones shown in Figure 2.16 for data selection.
48
![Page 59: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/59.jpg)
Table 2.4: Router Area Consumption on Virtex 5 (XC5VFX30T-1FF665)
.
Device Utilisation
Logic Utilisation Used Available Utilization
Number of Slice Registers 730 20480 3 %
Number of Slice LUTs 846 20480 4 %
Number of fully used LUT-FF pairs 230 1346 17 %
Number of bonded IOBs 372 360 103 %
Number of Block RAM/FIFOs 10 68 14 %
Number BUFG/BUFGCTRLs 1 32 3 %
chronization called barrier. Those functions were taken from the MPI standard and suffice
the functionality required.
2.3.1 Network Interface TLM Model
SystemC TLM Model of the NIC has one target socket for receiving the core’s transaction,
one initiator socket for sending data to the local router and another target socket to get
data from it; for each target socket there is a corresponding nb forward transport function
and for the initiator socket, a nb backward transport method is provided.
On sockets connected to routers, TLM phases are interpreted the same way as stated
on Table 2.3, however, on sockets connected to processing elements (end-modules), phases
are considered as specified by the standard.in Section 1.3.
In order for the system to react at the appropriate time (because of transaction delays),
there are three payload event queues linked to each nb transport function. Other methods
are in charge of standard operations such as storing data on send or receive buffers, ar-
bitrate output control, reply to processing elements, etc. Next a list of all NIC method’s
functionality is presented.
I. BuildHeadFlit: Method in charge of creating a transaction’s header flit. It stores
49
![Page 60: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/60.jpg)
message number, type of flit and initiator and target addresses on a single word.
II. GetHeaderInfo: Is in charge of extracting all header information from a head flit.
III. CheckIfExpectedTransaction: Method invoked when a new request arrives; it is
in charge of establishing whether it corresponds to a write transaction started by the
local processor or not. If the transaction doesn’t match the expected one it’s stored
at a incoming requests buffer.
IV. StoreAtSendBuffer: Function used for storing flits at a send buffer (if a write
transaction) or at the send request buffer (if a read transaction).
V. StoreAtReceiveBuffer: In charge of storing flits at a receive buffer (if a read trans-
action) or at the receive request buffer (if a write transaction) when the request doesn’t
match with the processor request.
VI. RESPONSE TransactionUpdate: Dynamic-event triggered method used for send-
ing phase BEGIN RESP back to the router when a tail flit has been received; it also
sends BEGIN RESP to the local processor to indicate it that data is ready to be
transmitted.
VII. REQUEST TransactionUpdate: Method invoked to send a new write request
when the timer has expired.
VIII. RRESPONSE TransactionUpdate: Is in charge of returning phase BEGIN RESP
back to the router when a read request has been received.
IX. SEND TransactionUpdate: Acts as a central control unit for the NIC module;
this method checks whether the IncomingRequestQueue has a valid transaction that
matches the one specified by the processor, if so, grants output access to the send
queue. After that, sends the first flit on the send buffer and updates debugging
information; after that, frees already transmitted flits from the queue and checks if
50
![Page 61: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/61.jpg)
it was a tail flit, if so, releases the output port. More tasks are performed by this
function and can be better described by pseudo-algorithm 2.3.1.
Apart from reading and writing, all cores are capable of executing barrier operations for
synchronization. Depending on the core’s ID, a barrier is implemented differently: there
must always be a master core and one or more slaves cores; master cores await for slaves
to send a barrier message and once got everyone’s, they issue a command for all of them
to resume executing their tasks. Because it is necessary to address all nodes when issuing
barrier transactions from the master core, and because routers are unable to realize of that,
NICs were designed to support a broadcast command that sends the same data to every
node. This functionality is also useful when processors need to share information stored at
one of them, however, it won’t be until a node gets requests flits from all the others, that
it will start transferring data; this approach might prove useful in some scenarios but can
also decrease overall performance on others.
In order to improve performance and reduce processor computation the NIC implements
barrier operations as follows:
Slave cores : Send a normal write request transaction to the master core and expect a
one-flit write.
Master core : Builds a single-flit write transaction and stores it at the send buffer; when
requests from all modules are received, it sends that flit to all the modules. When all
flits have been transmitted the NIC replies back to the core.
Mechanical computation implied by the barrier function is done at the NIC so that the
core can perform other operations; the cost of that is an increase in area consumption.
TLM phases for read operations can be seen in Figure 2.18. This transactions take
a long time to complete because once the NIC is notified of a read transaction, it sends
a request-data flit to the appropriate module and has to wait for information to come;
51
![Page 62: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/62.jpg)
Algorithm 2.3 Transaction Update pseudo-algorithm implemented by NICs.
1: if Write Pending and Not Read In-progress then
2: Check Request queue.
3: if Transaction Requested is expected then
4: Give Output control to Send Queue
5: end if
6: end if
7: if Send Queue controls Output then
8: Send first flit on Queue
9: if Flit accepted then
10: Mark Flit as accepted.
11: Notify method for later execution to delete Flit.
12: end if
13: if Write is Unicast then
14: Delete transmitted flits in Send Queue.
15: end if
16: if Write is broadcast then
17: if Write is Burst and Burst Completed then
18: Create new Head Flit.
19: Notify method for later to start transmission to next node.
20: Reset Transmission counters
21: end if
22: end if
23: if Write is Burst and Not all data packed then
24: Store next flit at Send Queue
25: end if
26: end if
27: if All data is transmitted then
28: Send phase BEGIN RESP back to Initiator to release transactions.
29: end if52
![Page 63: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/63.jpg)
there is only after getting all packets that the processing element is notified about the data
availability, and the transaction concluded.On the other hand, write transactions between
the NIC and the processing element can be finalised faster. A phase diagram for write
transactions can be seen in Figure 2.19.
Figure 2.18:TLM Phases in a NIC Read Operation. CPUs ask for data, NICs senda request to the corresponding module and waits for data to arrive.After all information is received, phase BEGIN RESP is issued to theCPU to indicate the end of the transaction.
53
![Page 64: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/64.jpg)
Figure 2.19:TLM Phases in a NIC Write Operation. Processing elements send alldata to the NICs and finalize the transaction after transmitting all theinformation. NICs await a read request and send packets when thecorresponding one is received.
54
![Page 65: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/65.jpg)
2.3.2 Network Interface Hardware Design
Network Interface design was extrapolated from the SystemC high level description; a high
HDL complexity was found on this module as it has to implement part of the router’s
functionality, solve end to end flow control and communicate with the processing element
through the OCP-IP bus model. Several control units were necessary for this design to
support all the features implemented in SystemC listed in the previous section; because of
space constrains, a general block diagram of the overall module is shown in Figure 2.20;
control signal paths are shown in red and data path ones, in yellow.
To better understand the figure, a correspondence between the TLM 2.0 model and the
VHDL one is presented in Table 2.5; although the equivalence is not exact, it tries to match
the main aspects. Functions shown in the table are also described in the previous section.
One of the most complex modules of the NIC was the OCP-Handshaking Control and
required careful design in order for it to support transactions and respect their execution
order; from the diagram in 2.20 it can be seen that another control unit (End-to-End Flow
Control) was necessary. State diagrams of both are shown in Figures 2.21 and 2.22.
When a new transaction is started from the processing element, handshaking control
verifies whether is possible to initiate it internally; if that is possible, an appropriate header
flit is stored at the corresponding queue and information is packed (for write transactions).
End to end flow control is notified of the operation in progress and commands transmission
and reception units to do the necessary operations to carry on with the transaction: if a
read is to be performed, a request flit is sent to the corresponding module; if a write is
requested, reception control must report itself when a request matching the write address
is received.
VHDL implementation of the NIC is left for future work as it doesn’t constitute a
common test metric on the NoC field.
55
![Page 66: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/66.jpg)
Table 2.5: VHDL-SystemC equivalence of NIC blocks
.
VHDL Block TLM Methods/Objects Function
OCP-Handshaking Control nb transport fw, RESPONSE
Transaction Update
Transfer data from (to) process-
ing elements
End to End Flow Control Target Payload Event Queue,
Check If Expected Transaction
Execute transactions tidily.
Tx Control nb fw router, SEND Transac-
tion Update
Initiate transactions with
router
Rx Control nb transport bw router, Store
At Receive Buffer
Receive transactions from
router
FIFO DataIN and Requests Double-ended queue Store data read and incoming
requests
Bank DataOUT Double-ended queue Store data out and read re-
quests.
Rest of Blocks Build Head Flit, Get Header
Info
Set and retrieve head flit infor-
mation
56
![Page 67: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/67.jpg)
Figure 2.20: VHDL Block Diagram for Network Interface Card
57
![Page 68: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/68.jpg)
Figure 2.21: State Machine for the Handshaking Control
58
![Page 69: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/69.jpg)
Figure 2.22: State Machine for the Handshaking Control
2.4 Software Performance Results
After validating both the router and NIC TLM models, software applications were pro-
grammed to analyse performance results with the whole NoC. Matrix multiplication was
implemented for its straightforward parallelization; previous performance graphs could also
be obtained but are not shown for space constrains.
2.4.1 4× 4 Matrix Multiplication
The first test scenario was a 4 × 4 matrix multiplication split into 16 cores (1 master, 15
slaves) where each one performed a row-column product and returned its result back to the
master. MPI directives such as MPI Send, MPI Receive and MPI Broadcast were used for
data sharing between modules.
Figure 2.23 shows the full Network On Chip performance regarding operation time
for three routing algorithms and several Virtual Channel depths. In concordance with
59
![Page 70: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/70.jpg)
previous simulations, XY routing had the worst behaviour while West-First had the best
one; increase in buffer storage allowed an exponential-like decrease in timing measures.
Figure 2.23: NoC Performance for a 4× 4 Matrix Multiplication
2.4.2 8× 8 Matrix Multiplication
In order to increase data transfer, a second test was performed with an 8 × 8 matrix
multiplication. Each core (including the master one) computed 4 row-column products and
sent its result back to the master; platform behaviour is shown in Figure 2.24. Compared
to the first scenario, decrease is not exponential but linear with both adaptive routing
algorithms and does not improve with buffer increase for XY routing.
Because of the way MPI Broadcast was implemented, that is, if the master core is N ,
data is sent first to node N + 1 and so on, it is expected that cores finish their operation
with the same order; this fact was actually verified and is presented in Figures 2.25 and
2.26.
Results obtained from the SystemC Network On Chip model demonstrate the advan-
tages of having a high level abstraction of hardware platform and the richness of statistics
60
![Page 71: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/71.jpg)
Figure 2.24: NoC Performance for a 8× 8 Matrix Multiplication
Figure 2.25: Total simulation time at each node with North Last Routing
61
![Page 72: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/72.jpg)
Figure 2.26: Total simulation time at each node with West First Routing
extracted from it. More studies and better tuning of the NoC, can significantly lead to
concise and robust design decisions when continuing with the design flow.
62
![Page 73: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/73.jpg)
Chapter 3
Concluding Remarks
A co-design methodology was successfully validated with the adoption of the new IEEE
standard for high level modelling, SystemC; it was possible to specify both hardware and
software constrains from the start and therefore a clearer and more concrete approach was
possible. As in any design, unspecified conditions or unexpected behaviour were encoun-
tered throughout the design process, yet correction and validation was much faster with
the virtual platform constructed.
Contrary to traditional hardware design, a virtual, portable platform of the real hard-
ware, is available for quick software development and testing, and only requires a C++
compiler to run; no specialized or licensed software is needed to start writing applications
for this system, which makes it versatile. In case of requiring more statistics or perfor-
mance metrics about hardware or software behaviour, it suffices with adding few lines to
the libraries provided; code that don’t make calls to the simulator won’t change or modify
the platform statistics.
Ideas borrowed from software theories for distributed programming were extrapolated
to create a hardware platform capable of running MPI-like applications; despite being an
old standard, most its premises are used nowadays for non-shared memory architectures
and are still under development. Thanks to this approach, an end to end flow control
technique was proposed to reduce undesired traffic inside the network as well as to avoid
wasting processing time in end-modules therefore improving performance.
Traffic simulations carried on showed that with the proposed NoC structure, 4x4 torus
topologies have close performance metrics with adaptive algorithms as well as with deter-
63
![Page 74: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/74.jpg)
ministic ones; two turn model [23] algorithms for mesh topologies were implemented and
adapted to a Torus one, but no significant difference between them was encountered under
the traffic patterns implemented. Real software applications are needed to select a suitable
routing algorithm that matches the required performance.
Most design decisions presented in Chapter 1 were taken to cope with the desired func-
tionality but also to reduce hardware complexity (area constrains); because of that, many
control task were left for high level implementation either at the NIC or at the processor
level. Contrary to studies shown in [31] and others which allow multicast capable routers,
the model developed here can only process unicast transactions at low level (routers) or
broadcast transactions at a higher level (NIC). If multicast support is required, additional
logic can be integrated inside NICs; router modifications are far more complex but can be
also added.
Implementation statistics of the VHDL router model suggest that it is possible to syn-
thesize more than one router on FPGAs but still consume big area, specially RAM blocks;
reducing Virtual Channel depth will allow including more routers on the same chip so that
a real, considerable sized NoC can be implemented.
3.1 Significance of the Result
The intend of this work was mainly to develop and validate a high level model of a NoC to
allow implementation of parallel algorithms targeted to this platform; despite pure C++
code can be used for software development, there are other market, open-source, tools
that can also be integrated with the libraries created here. Among the most common,
Open Virtual Platforms (OVP) from Imperas, is a collection of high level descriptions or
processors and hardware modules that can be easily integrated with TLM models such as
the one created. More detailed, almost cycle accurate, simulations are possible with OVP
if needed.
Even though TLM modelling is becoming a common design practice, to the best of our
64
![Page 75: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/75.jpg)
knowledge, no TLM 2.0 of a entire NoC has been proposed; there are approaches such
as Noxim [33], that use SystemC to simulate a customizable mesh network of routers,
they don’t implement the TLM 2.0 standard. Another advantage of this platform is that
it provides both, NICs and routers, which makes it an immediate option for embedded
software development.
Versatility and high reconfiguration capabilities are also an important characteristic of
our NoC model. Default timing parameters can be changed with minor effort and can be
set to match real hardware constrains if needed. Apart from that, if the MPI approach
doesn’t suit a design’s specifications, a new NIC model can be integrated with the router’s
network if it follows indications shown in 2.2.1. A final remarkable aspect of this work is
that if more complex statistics such as average message blocking time, most used paths,
throughput, etc, can be extracted from the model with minor additions to the source code.
The router developed was designed for Tours topology NoCs and therefore has 5 in-
put/output ports, nonetheless, mesh topologies are straightforward to obtain by modifying
routing algorithms; setting them to avoid sending data beyond the limits of the mesh, i.e.
impeding the use of turnaround links, will automatically change it from torus to mesh. If
another topology like those presented in section 1.2.1, you can always take our base router
as a start.
Finally, another contribution for the state of the art regarding NoC and specifically NIC
design, is that through an MPI abstraction, a hardware module was created for end-to-end
flow control; approaches using MPI as software approach for NoC programming such as [34]
don’t synthesize it onto the NIC module but implement it at high level. Which approach is
better is still undetermined and benchmarks need to be conducted to have a better insight
on this.
65
![Page 76: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/76.jpg)
3.2 Future Work
From the beginning of this work, it was stated that when creating complex platforms such
as NoCs, it was necessary to be able of co-developing software applications to validate
their correct behaviour; here only a high level TLM 2.0, and a VHDL model of a NoC
was constructed, but there’s a lack of software applications integrated with the SystemC
description. Although typical traffic and load balancing studies were applied, only real
software implementation can demonstrate the validity of the results shown.
Software engineering faced the problem of distributed computing years ago, and is be-
ing dealing with it for a long time; thanks to them, concepts such as shared or distributed
memory models are now known and have been solved through APIs like MPI or OpenMP;
hardware engineers are starting to raise abstraction levels for embedded design and are
facing the same problems than software ones; what this means, is that a deeper integration
between both branches can lead to better design strategies, as the ones required for this
kind of platforms. This work’s objective was to implement MPI support, however, MPI is
a standard created more than 10 years ago, which lead to think that far better solutions
are currently available but remain unknown for the hardware community.
A router VHDL model was provided along this work and a NIC design was mostly de-
scribed as well; future work should also include an implementation of this work to verify
its correctness at the hardware level; integration with 32-bit compatible processors would
complete such verification.
66
![Page 77: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/77.jpg)
References
[1] Salminen et al, “Survey of Network on Chip proposals,” OCP-IP White Paper. OCP-
IP, 2008
[2] A. Agarwal, C. Iskander and R. Shankar, “Survey of Network on Chip Architectures
and Contributions,” Journal of Engineering, Computing and Architectures. Volume 3,
Issue 1, 2009.
[3] K. Popovici and A. Jerraya, “Virtual Platforms in Sis-
tem On Chip Design,” 47th Design Automation Conference.
http://webadmin.dac.com/knowledgecenter/2010/documents/POPOVICI-VP-ABK-
FINAL.pdf
[4] International Organization for Standarization, “Information technology - Open Sys-
tems Interconnection - Basic Reference Model: The Basic Model,” ISO/IEC 7498-1
Second Edition. http://standards.iso.org/ittf/licence.html
[5] T. Kogel, R. Leupers and H. Meyr, “Integrated System-Level Modeling of Network-On-
Chip enabled Multi-Processor Platforms,” Chapter 4: System Level Design Principles
Springer, 2007
[6] S. Chai, C. Wu, Y. Li and Z. Yang, “A NoC Simulation and Verification Platform
based on SystemC,” 2008 International Conference on Computer Science and Software
Engineering. IEEE Computer Society, 2008, pp. 423–426.
[7] D Wiklund, S Sathe and D. Liu, “Network on Chip Simulations for benchmarking.”
Linkoping University, Sweden
[8] P. T. Wolkotte, P. K.F. Holzenspies and G. J.N. Smit, “Fast Accurate and Detailed
67
![Page 78: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/78.jpg)
NoC Simulations,” Proceedings of the first International Symposium on Networks on
Chip. IEEE Computer Society, 2007
[9] A. Portero, R. Pla and J. Carrabina, “SystemC Implementation of a NoC,” Escuela
Tecnica Superior de Ingenierıas, Espana
[10] OCP-IP Organization “Open Core Protocol Specifications,”
http://www.ocpip.org/the complete socket.php
[11] ARM, “AMBA 4 AXI4-lite and AXI4-Stream Protocol Assertions User Guide,”
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.amba/index.html
[12] OpenMP Architecture Review Board, “OpenMP Application Program Interface,” Ver-
sion 3.0, May 2008. http://www.openmp.org/mp-documents/spec30.pdf
[13] Message Passing Interface Forum “MPI: A Message-Passing Interface Standard,” Ver-
sion 2.2, September 2009 http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
[14] A. Adriahantenaima et all, “SPIN: a Scalable, Packet Switched, On Chip Micro-
network,” Proceedings of the Design, Automation and Test in Europe Conference and
Exhibition IEEE Computer Society, 2003.
[15] W-H. Hu, S. E. Lee and N. Bagherzadeh, “DMesh: a Diagonally-Linked
Mesh Network-On-Chip Architecture,” University of California, Irvine.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.4454&rep=rep1&type=pdf
[16] A. Tavakkol, R. Moraveji, H. Sarbazi-Azad, “Mesh Connected Crossbars: A Novel NoC
Topology with Scalable Communication Bandwidth,” 2008 International Symposium
on Parallel and Distribuited Processing with Applications IEEE Computer Society,
2008, pp. 319 – 326.
[17] A. Zitouni, M. Zid, S. Badrouchi and R. Tourki, “A Generic and Extensible Spidergon
NoC,” World Academy of Science, Engineering and Technology 2007, pp. 14 – 19.
68
![Page 79: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/79.jpg)
[18] M. Mirza-Aghatabar, S. Koohi, S. Hessabi and M. Pedram, “An Empirical Investi-
gation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and
Traffic Models,” 10th Euromicro Conference on Digital System Design Architectures,
Methods and Tools DSD 2007 IEEE Computer Society, 2007.
[19] K. Goossens, J. Dielissen and A. Radulescu, “AEthereal Network on Chip: Concepts,
Architectures and Implementations,” IEEE Design & Test of Computers IEEE 2005,
pp. 414 – 421
[20] T. Bjerregaard and J. Sparso, “Implementation of guaranteed services in the MANGO
Clockless Network On Chip,” IEEE Proceedings on Computers and Digital Techniques
2006 Vol 153 No 4, July 2006.
[21] A. Y. Seydim, “Wormhole Routing in Parallel Computers,” Southern Methodist Uni-
versity.
[22] L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in Direct
Networks,” Michigan State University, COMPUTER 1993, pp. 62 – 76.
[23] C. J. Glass and L. M. Ni, “The Turn Model for Adaptive Routing,” 1992 ACM 0-
89791-509-7 1992, pp. 278 – 287.
[24] Z. Xiaohu, C. Yang and W. Liwei, “A Novel Routing Algorithm for Networks On
Chip,” 2007 IEEE 1-4244-1312-5 2007, pp. 1877 – 1879.
[25] E. Behrouzian-Nezhad and A. Khademzadeh, “BIOS: A New Efficient Routing Algo-
rithm for Network On Chip,” Contemporary Engineering Sciences Vol. 2, 2009. No 1.
pp. 37 – 46.
[26] W. Zhang et al, “Comparison Research between XY and Odd-Even Routing Algo-
rithm of a 2-Dimension 3x3 Mesh Topology Network On Chip,” Global Congress on
Intelligent Systems IEEE Computer Society 2009, pp. 329 – 333.
69
![Page 80: Design of a Network-On-Chip platform for MPSoCs using TLM](https://reader034.vdocuments.net/reader034/viewer/2022042721/6267bc8496705b6f327428b0/html5/thumbnails/80.jpg)
[27] T. Schonwald et al, “Fully Adaptive Fault-Tolerant Routing Algorithm for Network
on Chip Architectures,” 10th Euromicro Conference on Digital System Design Archi-
tectures, Methods and Tools IEEE Computer Society 2007.
[28] Open SystemC Initiative, “The Transaction Level Modelling standard of the Open
SystemC Initiative,” 2007 - 2009 OSCI Group.
[29] Open SystemC Initiative, “OSCI TLM-2.0 Language Reference Manual,” July 2009.
[30] S. V, Tota et al, “MEDEA: A Hybrid Shared-memory/Message-passing Multiprocessor
NoC-based ARchitecture,” 2010 Design, Automation and Test in Europe 2010 DATE.
[31] B. Yin, “Design and Implementation of a Wormhole Router Supporting Multicast for
Networks On Chip,” Master of Science Thesis, 2005 Stockhol, Sweeden.
[32] U. T. Ogras et al, “Challenges and promising results in NoC Prototyping using FP-
GAs,” IEEE MICRO 2007, IEEE Computer Society September - October 2007, pp. 86
– 95
[33] M. Palesi, D. Patti and F. Fazzino “NOXIM, the NoC Simulator User Guide” Uni-
veristy of Catania, 2005 - 2010
[34] J. Joven et al, “xENoC - An eXperimental Network-On-Chip Environment for Par-
allel Distributed Computing on NoC-based MPSoC architectures,” 16th Euromicro
Conference on Parallel, Distributed and Network-Based Processing. 2008, pp. 141 –
148
70