fault tolerant network on chip design - home page...

Facoltà di Ingegneria Corso di Laurea in Ingegneria delle Telecomunicazioni

Fault tolerant Network on Chip (NoC) design

Relatore: Laureando:

Prof. Roberto Passerone Andrea Foradori

Correlatore:

Dr. Marcello Lajolo

Anno Accademico 2008/2009

i

Acknowledgments

Firstly of all I would like to thank my advisor, Prof. Roberto Passerone for giving me

the chance to have a studying experience abroad in an important research laboratories

as NEC Labs. He has been and is always available, as when I needed help for

delivering the request of a grant for my stay in the United States.

The same goes for my supervisor at the NEC Labs, Dr. Marcello Lajolo. He followed

me during my internship giving me important advices regarding my thesis work. I am

currently in touch with him and I really appreciate this.

I thank my father Roberto and my mother Gina for supporting me through these six

years of university. They, with my ―little‖ brother Danilo, encouraged me in many

situations of my university carrier. It is especially thanks to them that I will graduate.

Furthermore I thank Gianni for having given me material, suggestions, several

opportunities for discussions during the preparation of this thesis and for helping me

during my first weeks in USA. I also thank Peilong and Sonny with whom I lived in the

Unites States; I remember many funny situations lived together.

Finally, last but not least, I would like to thank all my friends that have stayed next to

me in these last years. Since certainly I will forget somebody if I start naming people, I

thank simply all of them but in particular Elena and Francesco that helped me during

the thesis review and have been always available for any eventuality. Moreover Elena

motivated me every time that I was in difficulty.

iii

Contents

Acknowledgments ............................................................................................................. i List of Figures .................................................................................................................. v

List of Tables ................................................................................................................. vii Abstract............................................................................................................................ ix

Introduzione .................................................................................................................... xii

Glossary .......................................................................................................................... xv

Chapter 1: The Network on Chip ................................................................................. 2

1.1 NoC vs. BUS...................................................................................................... 4

1.2 NoC basic concepts overview ............................................................................ 6

1.2.1 Transport layer............................................................................................ 8

1.2.2 Network layer ........................................................................................... 10

1.2.3 Link and Physical layer ............................................................................ 21

1.3 Research Activities .......................................................................................... 22

1.4 NoC design flow .............................................................................................. 25

Chapter 2: NEC NoC.................................................................................................. 27

2.1 Topology and structure .................................................................................... 27

2.2 NoC components.............................................................................................. 32

2.2.1 AMBA AXI Network Interfaces (NIs) ..................................................... 32

2.2.2 Router ....................................................................................................... 41

2.3 NI Message encoding and Routing Algorithm ................................................ 43

2.4 Header and payload structures ......................................................................... 45

2.5 The backpressure protocol ............................................................................... 48

2.6 Pros and cons ................................................................................................... 49

Chapter 3: NoC Router redesign ................................................................................ 56

3.1 Previous Router................................................................................................ 57

3.2 The Router redesigning .................................................................................... 59

Chapter 4: Fault Tolerant NoC ................................................................................... 64

4.1 Fault tolerance and Network on Chip .............................................................. 64

4.2 How to make NoCs reliable ............................................................................. 71

4.3 Redundancy in the NEC NoC .......................................................................... 76

Chapter 5: Case study: a 5x2 tile NoC with 2 AXI Masters and 5 AXI Slaves ......... 78

5.1 Experimental platform ..................................................................................... 79

iv

5.2 The NI Sender queue selection policy............................................................. 93

5.3 Latency results................................................................................................. 99

5.4 3-master case ................................................................................................. 104

5.5 Summary ....................................................................................................... 109

Chapter 6: Conclusions and future work ................................................................. 112

Bibliography ................................................................................................................ 115

Appendix 1 ................................................................................................................... 119

Appendix 2 ................................................................................................................... 127

v

List of Figures

Figure 1.0.1 : Evolution of the cores number in a single chip .................................................................... 3 Figure 1.1.1 : Examples of communication structures in Systems-on-Chip. a) Traditional bus-based, b) dedicated point-to-point links, c) chip area network............................................................ 4 Figure 1.2.1 : 4x4 grid NoC structure ......................................................................................................... 6 Figure 1.2.2 : Layered research approach, TCP/IP stack vs. NoC stack .................................................... 7 Figure 1.2.3 : The Network Adapter........................................................................................................... 8 Figure 1.2.4 : The Network Interface hides the protocol communication to each IP core ......................... 9 Figure 1.2.5 : Typical regular network topologies ................................................................................... 10 Figure 1.2.6 : Irregular network topologies .............................................................................................. 11 Figure 1.2.7 : ST OctagonTM and ST SpidergonTM topology .................................................................... 12 Figure 1.2.8 : SpidergonTM topology layout ............................................................................................. 12 Figure 1.2.9 : Direct and indirect network ................................................................................................ 13 Figure 1.2.10 : Generic Router model ...................................................................................................... 15 Figure 1.2.11 : Units of resource allocation ............................................................................................. 16 Figure 1.2.12 : The concept of Virtual Channel (VC) .............................................................................. 17 Figure 1.2.13 : Wormhole routing deadlock example .............................................................................. 18 Figure 1.2.14 : Channel dependencies graph method ............................................................................... 19 Figure 1.2.15 : VCs Router model ............................................................................................................ 20 Figure 1.3.1 : Current NoC state of art. .................................................................................................... 23 Figure 1.3.2 : TeraFLOPS vs. ASCI Red – Source: Maurizio Palesi (Catania University, IT) ............. 24 Figure 1.4.1 : NoC design flow ................................................................................................................ 26 Figure 2.1.1 : Tile-based NoC architecture (*) ......................................................................................... 28 Figure 2.1.2 : From concentrated to distributed routers architecture ........................................................ 29 Figure 2.1.3 : Input/output routers directions ........................................................................................... 30 Figure 2.1.4 : Internal tile signals (*) ....................................................................................................... 31 Figure 2.1.5 : Block diagram of the network architecture (*) .................................................................. 31 Figure 2.2.1 : NI initiator block diagram (*) ............................................................................................ 33 Figure 2.2.2 : NI target block diagram (*) ................................................................................................ 34 Figure 2.2.3 : NI Sender architecture (*) .................................................................................................. 37 Figure 2.2.4 : FIFO architecture (*) ......................................................................................................... 39 Figure 2.2.5 : Data flow direction ............................................................................................................ 40 Figure 2.2.6 : Router architecture (*) ....................................................................................................... 42 Figure 2.3.1 : Multi-flit NoC packet format (*) ........................................................................................ 44 Figure 2.3.2 : Flit type encoding (*) ......................................................................................................... 44 Figure 2.3.3 : Supported and unsupported routing ................................................................................... 45 Figure 2.4.1 : AXI header structure (request phase) - (*) ..................................................................... 46 Figure 2.4.2 : NoC header structure (request phase) - (*) .................................................................... 46 Figure 2.4.3 : NoC header structure (response phase) - (*) .................................................................. 47 Figure 2.5.1 : Backpressure action (*) ...................................................................................................... 49 Figure 2.6.1 : Generic Input-queuing router ............................................................................................. 50 Figure 2.6.2 : Generic concentrated Virtual Output Queuing Router ....................................................... 51 Figure 2.6.3 : VOQs in the NEC NoC architecture .................................................................................. 51 Figure 2.6.4 : Used routers to reach destination: (a) standard tile-based topology, (b) NEC NoC one .... 52 Figure 2.6.5 : The connection of the Sender/Receiver with the Routers in the NEC NoC ....................... 53 Figure 2.6.6 : Wiring options ................................................................................................................... 55 Figure 3.2.1 : Modules hierarchy ............................................................................................................. 59 Figure 3.2.2 : Architecture of Router ....................................................................................................... 60 Figure 3.2.3 : RFSM states diagram ......................................................................................................... 61

vi

Figure 4.1.1 : Semiconductor Failure Rate (courtesy of M. Lajolo, NEC LA Inc.) .................................. 66 Figure 4.1.2 : CMOS technology scaling ................................................................................................. 66 Figure 4.1.3 : Failures influence the yield ................................................................................................ 67 Figure 4.1.4 : Immersion lithography ....................................................................................................... 68 Figure 4.1.5 : VLSI design variations (courtesy of M. Lajolo, NEC LA Inc.) ......................................... 69 Figure 4.1.6 : Yield loss (courtesy of M. Lajolo, NEC LA Inc.) .............................................................. 69 Figure 4.1.7 : Razor Double Data Sampling Technique ........................................................................... 70 Figure 4.2.1 : Experimental results - Source [10] ..................................................................................... 73 Figure 4.2.3 : Enable Dynamic Fault Discovery & Repartitioning (Source Intel [12]) ............................ 75 Figure 4.2.2 : Intel Larrabee (Source: [31]) .............................................................................................. 75 Figure 4.3.1 : Self-repair NoC (courtesy on NEC Labs) .......................................................................... 77 Figure 4.3.1 : Architecture evolution of NEC media chips (courtesy of M. Lajolo, NEC LA Inc.) ......... 79 Figure 5.1.1 : Experimental platform (base or standard configuration) .................................................... 80 Figure 5.1.2 : Connection between M1 and its Master Customizer in tile1 .............................................. 82 Figure 5.1.3 : End to end average packets latency from M1 @ prgn traffic ............................................. 83 Figure 5.1.4 : End to end average packets latency from M2 @ prgn traffic ............................................. 83 Figure 5.1.5 : Top level with the tiles and the latency_counter module ................................................... 84 Figure 5.1.6 : Example of end-2-end latency plot generated with GNU Plot ........................................... 86 Figure 5.1.7 : Average latencies comparison @ different injected data rate ............................................ 87 Figure 5.1.8 : Xilinx schematic file of tile1 .............................................................................................. 88 Figure 5.1.9 : General monitor setting window ........................................................................................ 88 Figure 5.1.10 : Queue utilization and backpressure graphs of the router T1_R4 (channell 0) ................. 89 Figure 5.1.11 : Standard with alternative path #1 ..................................................................................... 90 Figure 5.1.12 : Standard with alternative path #2 ..................................................................................... 91 Figure 5.1.13 : Standard with alternative path #1 and #2 ......................................................................... 92 Figure 5.2.1 : Queue selector FSM ........................................................................................................... 96 Figure 5.2.2 : Block diagram of the algorithm implementation: at behavioral level (a) and in detail (b) 98 Figure 5.3.1 : Project sub-folders ............................................................................................................. 99 Figure 5.3.2 : Sequence of actions in order to obtain the experimental results....................................... 100 Figure 5.3.3 : Simulation parameters ...................................................................................................... 100 Figure 5.3.4 : Average end-to-end latency (from Master 1) measured in clock cycle ............................ 102 Figure 5.3.5 : Average end-to-end latency (from Master 2) measured in clock cycle ............................ 102 Figure 5.3.6 : The best NoC configuration in terms of average latency improvement ........................... 103 Figure 5.3.7 : Percentage of latency improvement ................................................................................. 104 Figure 5.4.1 : Experimental platform with one Master more, placed in Tile 3 ....................................... 106 Figure 5.4.2 : Standard vs. best configuration average latency (a), Latency improvement adopting the best configuration (b) ....................................................................................................... 108

vii

List of Tables

Table 1.1.1 : BUS vs. NoC: analysis of the advantages/drawbacks ........................................................... 5 Table 2.2.1 : FIFO flags setting conditions (*) ......................................................................................... 41 Table 3.1.1 : Synthesis results of the previous router ............................................................................... 58 Table 3.2.1 : Synthesis results of the new router redesigned .................................................................... 63 Table 5.1.1 : Comparison of the resources utilization in the several platform configurations ................. 92 Table 5.4.1 : Latency of NoC components in absence of backpressure ................................................. 105 Table 5.4.2 : Possible configurations of the experimental platform ....................................................... 106 Table 5.4.3 : Average latency value for each configuration of 3 Master / 5 Slaves platform ................. 107 Table 5.5.1 : Summary of the latency/area results (standard and best configuration) ............................ 110

ix

Abstract

On-chip communication architectures are known to have a significant impact on system

performance, power dissipation and time-to-market. Therefore system designers, as

well as the research community have focused on the issue of exploring, evaluating, and

designing communication architectures to meet the targeted design goals. The

emergence of multi-core architectures and heterogeneous multiprocessor Systems-on-

Chip (MPSoCs) further underscores the importance and the criticality of a suitable on-

chip communication architecture. This should handle the ever increase volume of on-

chip communication traffic and it should operate under severe performance constraints

with limited energy and thermal budgets. On the other hand, aggressive scaling of VLSI

technology has resulted in nanoscale effects that adversely affect interconnect

performance, reliability, power dissipation, and predictability. Thus new approaches to

on-chip communication architectures need to be devised in order to overcome these

effects.

The employment of Networks on Chip (NoCs) can cope with the issues mentioned

above. NoC designs consist of a number of interconnected heterogeneous devices (e.g.,

general or special purpose processors, embedded memories, application specific

components, mixed-signal I/O cores) where communication is achieved by sending

packets over a scalable interconnection network. Many models, techniques and tools

widely used in the macro-network design field can be applied to SoC design. This

x

means that a NoC can be developed in order to satisfy quality-of-service requirements

such as reliability, performance, and energy bounds.

The variability and the ceaseless CMOS technology scaling are the main factors of

transient and permanent failures. The consequence of this is a lower yield due to

unexpected power consumption and performance. Scope for optimization is limited by

architecture and hardware structure, thus device-level solutions cannot completely

solve this problem. New design models, able to tolerate failures by operating at higher

abstraction level, are necessary. Fault Tolerant NoCs are a possible solution to the

problems mentioned above. They can cope with malfunctions by supporting multipath

communication and network reconfiguration.

This thesis explores the crucial factors that lead to faults after the manufacturing of a

chip. Moreover it analyzes the possibility to handle these defects using Fault Tolerant

NoCs. Since most of the work involved in this thesis was done during a study exchange

program at NEC Laboratories America (Princeton, USA) in System Architecture

Department, experimental results and cases study are referred to the NEC NoC

architecture. In particular Chapter 5, where a case study introduced, assumes a good

familiarity with the NEC Network-on-Chip. For this reason this architecture is

described in Chapter 2.

The rest of the thesis is organized as follow: Chapter 1 provides a broad overview of

NoC concepts, existing research projects, state of the art and basic principle of on chip

communication. In Chapter 3, the router architecture designed as part of my thesis work

is presented. The architecture is compared in terms of area and performance with

respect to the implementation that was available at the beginning of my thesis. Chapter

4 explores the concept of Fault Tolerance, analyzing the factors that induce defects. An

overview of the possible NoC solutions in order to prevent or solve faults is provided.

At the end of Chapter 4 is then introduced a possible solution for the design of a

reliable NEC NoC in case of post-manufacturing faults. Furthermore Chapter 5

explores a case study, where the NoC employs many solutions presented in the

previous chapter. Implementation details and performance results are explored.

Concluding remarks and some thoughts about the possible future works are given in

Chapter 6.

During my experience at NEC Labs I participated to more than one of the on-going

activities. At first I contributed to the design of an asynchronous router for a NoC based

xi

on the GALS (Globally Asynchronous Locally Synchronous) approach. Afterwards I

developed a new version of the synchronous router, obtaining the synthesis results

presented in Chapter 4. I contributed to the realization of a simulation platform for

training purposes and, finally, I implemented and analyzed a combination of

configurations involving spatial redundancy (multipath communication) obtaining the

experimental results that are presented in Chapter 5.

xii

Introduzione

Le architetture di comunicazione on-chip comportano un significativo impatto sulle

performance di sistema, sulla dissipazione di potenza e sul time-to-market; inoltre gli

sviluppatori di sistema, cosí come la comunitá di ricerca, sono focalizzati sui problemi

di esplorazione, stima e progettazione di architetture di comunicazione che siano in

grado di raggiungere gli obbiettivi di progetto. L’emergenza di architetture multi-core e

di System-on-Chip a multiprocessore eterogenei (MPSoCs), sottolineano, inoltre,

l’importanza e la criticitá di architetture di comunicazione on-chip appropriate. Esse

devono essere in grado di gestire il rapido incremento della quantitá globale di traffico

di comunicazione on-chip e operare sotto rigidi vincoli di performance con limiti

sull’energia e sui bilanci termici (thermal budgets). D’altra parte, il continuo progresso

delle tecnologie VLSI porta ad avere dimensioni dei transistor sempre minori. Ció

implica l’incremento di effetti indesiderati (nanoscale effects), che incidono

avversamente su performance delle interconnessioni, affidabilitá, dissipazione di

potenza e predicibilitá. Di conseguenza, con lo scopo di tener testa a questi effetti, si

necessitano nuovi approcci architetturali di comunicazione on-chip.

L’impiego di Network on Chip (NoCs) puó far fronte ai problemi sopraccitati. Il

progetto di una NoC consiste di un numero di dispositivi eterogenei interconnessi tra di

loro (per esempio: processori general/special purpose, memorie embedded, componenti

per applicazioni specifiche, core mixed-signal I/O), dove la comunicazione si ottiene

mandando pacchetti attraverso una rete di interconnessione (interconnection network)

xiii

scalable (che puó essere dimensionata). Molti modelli, tecniche e tools largamente usati

per la realizzazione di reti di comunizazione macro-scale, possono essere applicati per

il progetto di SoCs. Questo significa che una Network on Chip puó essere sviluppata

per soddisfare requisiti di Quality-of-Service, quali affidabilitá, performance e limiti

energetici.

Il fenomeno conosciuto come variability e il continuo progresso tecnologico (CMOS

scaling) sono le cause principali di guasti transitori e permanenti. Tutto ció si traduce in

una minor resa produttiva (yield) causata da consumi di potenza e performance inattese.

I possibili miglioramenti della resa sono limitati dalle strutture architetturali e

hardware, perció, soluzioni solamente a livello di dispositivo non sono in grado di

risolvere completamente queste problematiche. Sono necessari nuovi modelli di

progettazione in grado di tollerare guasti operando ad un livello di astrazione maggiore.

Le NoC tolleranti i guasti (Fault Tolerant NoC) sono una possibile soluzione ai

problemi sopraccitati. Esse possono far fronte a malfunzionamenti proveddendo

comunicazione multipercorso (ridondanza) e riconfiguazione della rete.

Questa tesi esplora i fattori cruciali che portano al manifestarsi di guasti

successivamente alla fase di fabbricazione del chip e analizza la possibilitá di gestire

questi difetti usando Fault Tolerant NoCs. Poiché il lavoro di tesi è stato fatto, durante

uno programma di studio all’estero, presso i NEC Laboratories America Inc.

(Princeton, USA) nel dipartimento di System Architecture, risultati sperimentali e ―case

study‖ fanno riferimento all’architettura NoC di NEC. In particolare il Capitolo 5, nel

quale viene esaminato un―case study‖, pressupone la conoscenza della NEC NoC. Per

questo motivo il Capitolo 2 è interamente dedicato alla descrizione di quest’ultima.

La parte rimanente della tesi è organizzata nel modo seguente. Il Capitolo 1 fornisce

una ampio sguardo generale sulle NoCs: concetti elementari, progetti di ricerca

esistenti, stato dell’arte e principi di comunicazione base sono trattati. Nel Capitolo 3

viene presentata una nuova versione di Router per la NEC NoC. Si confronta la

precedente versione con quella nuova in termini di area e di performance. Il Capitolo 4

esplora il concetto di Fault Tolerance, analizzando i fattori che producono i difetti che

si manifestano dopo la fabbricazione (post-manufacturing). Vengono prese in esame le

possibili soluzioni che si avrebbero con l’impiego NoCs. Alla fine del Capitolo 4 si

presenta una possibile soluzione per rendere affidabile la NEC NoC in presenza di

guasti. Il Capitolo 5 analizza un ―case study‖, dove la NoC include le idee proposte nel

capitolo precedente mostrando dettagli d’implementazione e risultati sperimentali. In

xiv

conclusione nel Capitolo 6 vengono fatte alcune osservazioni sul lavoro svolto e si

prendono in considerazione alcuni possibili lavori futuri.

Durante la mia esperienza presso i NEC Labs ho avuto modo di prendere parte a piú

progetti attivi in quel periodo. Inizialmente ho contribuito al progetto di un router

asincrono per una NoC basata su un approccio GALS (Glabally Synchronous Locally

Asynchronous). Successivamente ho sviluppato una nuova versione del router sincrono

ottenendo i risultati di sintesi mostrati a Capitolo 4. Ho contribuito alla preparazione di

una piattaforma di simulazione utilizata dal gruppo per scopi didattici e infine ho

implementato e analizzato un semplice caso di ridondanza spaziale (multipath

communication) ottenendo i risultati sperimentali presentati a Capitolo 5.

xv

Glossary

AMBA AXI: Advanced Microcontroller Bus Architecture (AMBA) Advanced eXtensible Interface (AXI) is the 3rd generation of ARM AMBA bus protocol. The AMBA AXI protocol is targeted at high-performance, high-frequency system designs and includes a number of features that make it suitable for a high-speed submicron interconnect. Asynchronous circuit: An asynchronous circuit is a circuit in which the parts are largely autonomous. They are not governed by a clock circuit or global clock signal, but instead need only wait for the signals that indicate completion of instructions and operations. These signals are specified by simple data transfer protocols. This digital logic design is contrasted with a synchronous circuit which operates according to clock timing signals. The asynchronous circuits have many benefits. We underline one of them, particularly referred to this thesis. It is the immunity to transistor-to-transistor variability in the manufacturing process, which is one of the most serious problems facing the semiconductor industry as dies shrink. The asynchronous circuits have also disadvantage. In particular they require people experienced in synchronous design to learn a new style. Furthermore performance analysis of asynchronous circuits may be challenging. AWK: AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed. AWK was created at Bell Labs in the 1970s. The name AWK is derived from the family names of its authors — Alfred Aho, Peter Weinberger, and Brian Kernighan

xvi

BDL: Behavioral Design Language is a language based on C language with extensions for hardware description, developed to describe hardware at levels ranging from the algorithm level to the functional level Burn-in: Burn-in is the process by which components of a system are exercised prior to being placed in service (and often, prior to the system being completely assembled from those components). The intention is to detect those particular components that would fail as a result of the initial, high-failure rate portion of the bathtub curve of component reliability. If the burn-in period is made sufficiently long (and, perhaps, artificially stressful), the system can then be trusted to be mostly free of further early failures once the burn-in process is complete. A precondition for a successful burn-in is a bathtub-like failure rate, that is, there are noticeable early failures with a decreasing failure rate following that period. By stressing all devices for a certain burn-in time the devices with the highest failure rate fail first and can be taken out of the cohort. The devices that survive the stress have a later position in the bathtub curve (with an appropriately lower ongoing failure rate). Thus by applying a burn-in, early in-use system failures can be avoided at the expense (tradeoff) of a reduced yield caused by the burn-in process. For electronic components, burn-in is frequently conducted at elevated temperature and perhaps elevated voltage. This process may also be called heat soaking. The components may be under continuous test or simply tested at the end of the burn-in period. BUS: In computer architecture, a bus is a subsystem that transfers data between on-chip components inside a chip, between computer components inside a computer or between computers. CRC code: A cyclic redundancy check (CRC) is a non-secure hash function designed to detect accidental changes to raw computer data, and is commonly used in digital networks and storage devices such as hard disk drives. A CRC-enabled device calculates a short, fixed-length binary sequence, known as the CRC code or just CRC, for each block of data and sends or stores them both together. When a block is read or received the device repeats the calculation; if the new CRC does not match the one calculated earlier, then the block contains a data error and the device may take corrective action such as rereading or requesting the block be sent again. CWB: NEC Cyber Work Bench is a behavioral synthesis system that can be used to generate hardware implementation for a system. It takes behavioral description in Behavior Description Language (BDL) or System C as input. Then, it generates the RTL description for this input. Dopant: A dopant, also called doping agent and dope, is an impurity element added to a crystal lattice in low concentrations in order to alter the optical/electrical properties of the crystal. The addition of a dopant to a semiconductor, known as doping, has the effect of shifting the Fermi level within the material. This results in a material with predominantly negative (n type) or positive (p type) charge carriers depending on the dopant species. Pure semiconductors altered by the presence of dopants are known as extrinsic semiconductors (cf. intrinsic semiconductor). Dopants are introduced into semiconductors in a variety of techniques: solid sources, gases, spin on liquid and ion implanting.

xvii

DSM: Deep Submicron VLSI technology. DVS: Dynamic voltage scaling is a power management technique in computer architecture, where the voltage used in a component is increased or decreased, depending upon circumstances. Dynamic voltage scaling to increase voltage is known as overvolting; Dynamic voltage scaling to decrease voltage is known as undervolting. Undervolting is done in order to conserve power, particularly in laptops and other mobile devices, where energy comes from a battery and thus is limited. Overvolting is done in order to increase computer performance. FIFO: FIFO is an acronym for First In, First Out, an abstraction in ways of organizing and manipulation of data relative to time and prioritization. This expression describes the principle of a queue processing technique or servicing conflicting demands by ordering process by first-come, first-served (FCFS) behaviour: what comes in first is handled first, what comes in next waits until the first is finished, etc. FSM: A finite state machine (FSM), or simply a state machine, is a model of behavior composed of a finite number of states, transitions between those states, and actions. It is similar to a "flow graph" where we can inspect the way in which the logic runs when certain conditions are met. A finite state machine is an abstract model of a machine with a primitive internal memory. Hamming code: A Hamming code is a linear error-correcting code named after its inventor, Richard Hamming. Hamming codes can detect up to two simultaneous bit errors, and correct single-bit errors; thus, reliable communication is possible when the Hamming distance between the transmitted and received bit patterns is less than or equal to one. By contrast, the simple parity code cannot correct errors, and can only detect an odd number of errors. High Level Synthesis or Behavioral synthesis: With a goal of increasing designer productivity, research efforts on the synthesis of circuits specified at the behavioral level have led to the emergence of commercial solutions recently, which are used for complex ASIC and FPGA design. These tools automatically synthesize circuits specified at C level to a register transfer level (RTL) specification, which can be used as input to a gate-level logic synthesis flow. Today, High Level Synthesis, also known as ESL synthesis and behavioral synthesis, essentially refers to circuit synthesis from high level Languages like ANSI C/C++ or SystemC etc., whereas Logic Synthesis refers to synthesis from structural or functional description in RTL. Latency: Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. In digital electronic the latency of a system is measured with the number of delay clock cycle, necessary to perform the system operation. Lithography: process used to transfer pattern from the mask (reticle used in lithography to block resist exposure to the irradiation in selected areas) to the layer of resist (material sensitive to irradiation i.e. changes its chemical properties when irradiated; in the form of thin film used as a pattern transfer layer in lithographic processes in semiconductor manufacturing.) deposited on the surface of the wafer; kind of lithography depends on the wavelength of radiation used to expose resist:

xviii

photolithography (or optical lithography) uses UV radiation, X-ray lithography uses X-ray, e-beam lithography uses electron bean, ion beam lithography uses ion beam. Many-core: A many-core processor is processing system in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — this threshold is somewhere in the range of several tens of cores — and likely requires a network on chip (NoC). Metastability: Metastability in electronics is the ability of an unstable equilibrium electronic state to persist for an indefinite period in a digital system. Usually the term is used to describe a state that doesn't settle into a stable '0' or '1' logic level within the time required for proper operation. This can cause the circuit to go into an undefined state and act in unpredictable ways, so it is considered a failure mode in a digital circuit. Metastable states are believed to be inherent features of asynchronous digital systems and systems with more than one clock domain, but careful design can often make the probability of a system failing very small indeed. Metastable states do not occur in fully synchronous systems when the set-up time specifications on logic gates are satisfied. MPSoC: The multiprocessor System-on-Chip (MPSoC) is a system-on-a-chip (SoC) which uses multiple processors (see multi-core), usually targeted for embedded applications. It is used by platforms that contain multiple, usually heterogeneous, processing elements with specific functionalities reflecting the need of the expected application domain, a memory hierarchy (often using scratchpad RAM and DMA) and I/O components. All these components are linked to each other by an on-chip interconnect. These architectures meet the performance needs of multimedia applications, telecommunication architectures, network security and other application domains while limiting the power consumption through the use of specialised processing elements and architecture. Multi-core: A multi-core processor is a processing system composed of two or more independent cores. The cores are typically integrated onto a single integrated circuit die (known as a chip multiprocessor or CMP), or they may be integrated onto multiple dies in a single chip package. PDA: A personal digital assistant (PDA) is a handheld computer, also known as a palmtop computer. Newer PDAs commonly have color screens and audio capabilities, enabling them to be used as mobile phones (smartphones), web browsers, or portable media players Parity bit: A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code.

Pipeline: In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

xix

RTL: In integrated circuit design, register transfer level (RTL) description is a way of describing the operation of a synchronous digital circuit. In RTL design, a circuit's behavior is defined in terms of the flow of signals (or transfer of data) between hardware registers, and the logical operations performed on those signals. Register transfer level abstraction is used in hardware description languages (HDLs) like Verilog and VHDL to create high-level representations of a circuit, from which lower-level representations and ultimately actual wiring can be derived. RTL is used in the logic design phase of the integrated circuit design cycle. An RTL description is usually converted to a gate-level description of the circuit by a logic synthesis tool. The synthesis results are then used by placement and routing tools to create a physical layout. Logic simulation tools may use a design's RTL description to verify its correctness. Synchronous circuit: A synchronous circuit is a digital circuit in which the parts are synchronized by a clock signal. In an ideal synchronous circuit, every change in the logical levels of its storage components is simultaneous. These transitions follow the level change of a special signal called the clock. Ideally, the input to each storage element has reached its final value before the next clock occurs, so the behaviour of the whole circuit can be predicted exactly. Practically, some delay is required for each logical operation, resulting in a maximum speed at which each synchronous system can run. Logic Synthesis: Logic synthesis is a process by which an abstract form of desired circuit behavior (typically register transfer level (RTL)) is turned into a design implementation in terms of logic gates. Common examples of this process include synthesis of HDLs, including VHDL and Verilog. Some tools can generate bitstreams for programmable logic devices such as PALs or FPGAs, while others target the creation of ASICs. Logic synthesis is one aspect of electronic design automation. SoC: System-on-a-chip or system on chip (SoC or SOC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip). It may contain digital, analog, mixed-signal, and often radio-frequency functions – all on one chip. A typical application is in the area of embedded systems. Tile: Elementary module of a NoC. It contains the IP core and the module called Network Interfaced, which splits the data in packets in order to send them on the interconnection network. VLSI: Very-large-scale integration (VLSI) is the process of creating integrated circuits by combining thousands of transistor-based circuits into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. The microprocessor is a VLSI device. The term is no longer as common as it once was, as chips have increased in complexity into billions of transistors. Yield: in semiconductor industry synonymous with "manufacturing yield", i.e. number defining percentage of operational devices out of all devices manufactured.

CHAPTER 1: THE NETWORK ON CHIP

2 | P a g e

Chapter 1: The Network on Chip

The design of a chip is based on four distinct aspects: computation, memory,

communication and I/O. The increase of the processing power and the emergence of

data intensive applications has attracted major attention on the challenge of the

communication aspect in single-chip systems (SoC). This chapter gives an overview

of an important concept for the communication in SoC, which is known as Network

on Chip (NoC). NoC does not constitute a new explicit alternative for the intra-chip

communication, but it is a unification of on-chip communication solutions. The most

important driving factors, necessary to the development of global communication

solution, are the continue increments of the on-chip resource density and the need to

use these resources with the minimum effort. The preferred solution is to try to take

advantage of economies of scale in system design, dividing the processing resources

into smaller pieces and reusing them as much as possible inside the overall design.

With this strategy it is possible to obtain shorter design time cycles, because the

global chip development can be divided in independent sub-problem.

P a g e | 3

Figure 1.0.1 : Evolution of the cores number in a single chip

Nowadays the number of cores on a single chip is increasing quickly (see Figure

1.0.1) and the inter-core communication is becoming the bottleneck in many multi-

core platforms. For this reason there is a shift of the design focus from a traditional

processing-centric to a communication-centric one.

NoC interconnection models provide a standard global communication scheme,

which gives a design style similar to a brick-like plug-and-play, allowing good use of

the available resources and fast product design cycles.


4 | P a g e

1.1 NoC vs. BUS

Figure 1.1.1 shows some examples of basic communication structures in a sample

SoC, for example, PDA. Since the introduction of the SoC concept in the 90s, the

solutions for SoC communication structures have generally been characterized by

custom designed ad hoc mixes of buses and point-to-point links. The bus builds on

well understood concepts and is easy to model. In a highly interconnected multi-core

system, however, it can quickly become a communication bottleneck. In fact, it is not

ultimately scalable, since as more units are added to it, the power usage per

communication event grows leading to higher capacitive load.

Figure 1.1.1 : Examples of communication structures in Systems-on-Chip. a) Traditional bus-based, b)

dedicated point-to-point links, c) chip area network.

For multi-master busses, the problem of arbitration is also not trivial. Table 1.1.1

summarizes the pros and cons of buses and networks. A crossbar overcomes some of

the limitations of the buses. However, it is not ultimately scalable and, as such, it is

an intermediate solution. Dedicated point-to-point links are optimal in terms of

bandwidth availability, latency, and power usage as they are designed especially for

this given purpose. Also, they are simple to design and verify and easy to model. But

1.1 NOC VS. BUS

P a g e | 5

the number of links needed increases exponentially as the number of cores increases.

For these reason it leads to area and possibly routing problems.

Bus Pros & Cons Network Pros & Cons

Every unit attached adds parasitic capacitance; therefore electrical performance degrades with growth.

- + Only point-to-point one-way wires are used, for all network sizes, thus local performance is not degraded when scaling.

Bus timing is difficult in a deep submicron process.

- + Network wires can be pipelined because links are point-to-point.

Bus arbitration can become a bottleneck. The arbitration delay grows with the number of masters.

- + Routing decisions are distributed, if the network protocol is non-centric.

The bus arbiter is a specific instance. - + The same router may be re-instantiated for all network sizes.

Bandwidth is limited and shared by all units attached.

- + Aggregate bandwidth scales with the network size.

Bus latency is wire-speed once arbiter has granted control.

+ - Internal network contention may cause high latencies.

Any bus is almost directly compatible with most available IPs, including software running on CPUs.

+ - Bus-oriented IPs need smart wrappers. Software needs clean synchronization in multiprocessor systems.

The concepts are simple and well understood.

+ - System designers need re-education for new concepts.

Table 1.1.1 : BUS vs. NoC: analysis of the advantages/drawbacks

From the point of view of design-effort, one may argue that, in small systems of less

than 20 cores, an ad hoc communication structure is viable. But, as the systems grow

and the design cycle time requirements decrease, the need for more generalized

solutions becomes pressing. For maximum flexibility and scalability, it is generally

accepted that a move towards a shared, segmented global communication structure is

needed. This notion translates into a data routing network consisting of

communication links and routing nodes that are implemented on the chip [1] [2]. In

contrast to traditional SoC communication methods outlined previously, such a

distributed communication media scales well with chip size and complexity.

Additional advantages include increased aggregated performance by exploiting

parallel operation.


6 | P a g e

1.2 NoC basic concepts overview

Figure 1.2.1 shows a sample NoC structured as a 4x4 grid, which provides global

chip level communication.

Figure 1.2.1 : 4x4 grid NoC structure

Traditional parallel computers have typically homogeneous architectures, but, in

general, SoCs do not necessarily exhibit such a regular architecture. NoC-based

systems implement a very high degree of variety in composition and in traffic

diversity, in order to take into account the actual system composition in terms of

homogeneity and granularity.

The three fundamental blocks of a Networks-on-Chip are:

Network Interfaces (NI): They implement the interface by which every

single IP core connects to the NoC. Their function is to decouple computation

(the cores) from communication (the network).

1.2 NOC BASIC CONCEPTS OVERVIEW

P a g e | 7

Routing Nodes: They route data according to the chosen NoC protocol and

implement the routing strategy.

Links: Connect the nodes, providing the raw bandwidth. They may consist of

one or more logical or physical channels.

Communication on chip reuses classical networking paradigms with some specific

modifications [1] [2]. Applying classical communication standards, the NoC

community consider previously designed mechanisms, however there exists a strong

need to design new protocols and algorithms for on chip communication, which are

reliable, consuming low power and acting extremely fast.

In order to understand the research work done today in relation to NoC architectures,

it is convenient to partition the fields of NoC research into four areas: 1) system and

application, 2) network interface, 3) router and 4) physical link. Figure 1.2.2 shows

the relation between these research areas; the NoC stack, with the corresponding

components, and the TCP/IP layers are compared.

Figure 1.2.2 : Layered research approach, TCP/IP stack vs. NoC stack


8 | P a g e

1.2.1 Transport layer

In macro-scale networks, the Transport layer provides transparent transfer of data

between end users, thus relieving the upper layers from any concern while providing

reliable and cost-effective data transfer. It controls the reliability of a given link

through flow control, segmentation/de-segmentation, and error control. Some

protocols are state and connection oriented. This means that the transport layer can

keep track of the packets and retransmit those that fail. The best known is the

Transmission Control Protocol (TCP).

From the NoC point of view, the Network Interface (NI) is the main component at the

Transport Layer. It interfaces the core to the network and makes communication

services transparently available with a minimum of effort from the core. It handles

the end-to-end flow control, through encapsulation of the messages generated by the

IP core. These data are broken into packets, which may or may not have information

about their destination. In the latter case there must be a path setup phase prior of the

actual packet transmission.

Figure 1.2.3 : The Network Adapter

NI is known also with the name Network Adapter (NA). In this situation the Network

Interface assumes a different mean, and often it is used to represent a part of the NA.

In particular Figure 1.2.3 shows the NA structure; the component exposes a Core

Interface (CI) to the core and a Network Interface (NI) to the network side. NA


P a g e | 9

decouples the core from the network, enabling the implementation of a layered

system design approach. Typically the CI of the Network Adapter is implemented to

adhere to a SoC socket standard1. The CI of the Network Adapter allows, in principle,

any IP core compliant to the given socket to be attached to the network. Furthermore,

IP cores attached to the network through different sockets can communicate together

without noticing this radical difference. Figure 1.2.4 shows an example of this.

Figure 1.2.4 : The Network Interface hides the protocol communication to each IP core

The Network Adapter performs encapsulation of the traffic for the underlying

communication media. The base tasks are packets creation in a packet based NoC,

buffer management in order to prevent network congestion, global addressing and

routing. Moreover, re-order buffering and data acknowledgement could be performed.

The design of the Network Adapter is a critical task in the overall NoC design

process. Often this component handles tasks as frequency conversion and data size

conversion between core side and network side, in order to improve flexibility.

1 Socket standards are almost always identified with some legacy bus protocols, examples are ARM AMBA Bus [16], OCP Bus [18], IBM CoreConnect [17], etc.


10 | P a g e

From this point forward, the word Network Interface will be used such as

synonymous of Network Adapter, except the cases when it refers to the upper part of

NA (see Figure 1.2.3). In this situation there will be a clear explication, while in all

other cases NI and NA will be regarded as the same thing.

1.2.2 Network layer

The Network level provides the hardware support for the basic communication

primitives, in order to deliver the message from the source to the destination. It is

possible to define the Network layer using basically two concepts:

Topology, which specifies the layout and connectivity of nodes and links,

and Protocol that dictates how these nodes and links are used.

Topology

The network topology is defined by the connection pattern of the routers via the

physical links.

Figure 1.2.5 : Typical regular network topologies


P a g e | 11

The choice of network topology has a significant impact on the SoC

price/performance ratio. There are two basic approaches to interconnecting the

routers in a NoC: either a well-defined regular topology (see Figure 1.2.5) is used, or

the routers can be interconnected in a way that is specific to the application (irregular

topology). The latter approach is clearly more versatile in terms of the system

configurability, but it presents severe disadvantages in terms of design and

verification time and exposes SoC architects to network design issues such as

deadlock avoidance that require specific expertise.

Figure 1.2.6 : Irregular network topologies

A simple way to distinguish different regular topologies is in terms of k-ary and n-

cube, where k is the degree of each dimension and n is the number of dimensions.

The k-ary tree and the k-ary n-dimensional fat tree are two alternate regular networks

explored in NoC (see Figure 1.2.5). Most NoCs implement topologies that can be

easily laid out on a chip surface. For example, k-ary 2-cube, typical grid topologies

like mesh and torus.


12 | P a g e

Figure 1.2.7 : ST OctagonTM and ST SpidergonTM topology

ST Microelectronics developed its proprietary SpidergonTM topology [19], which

promises to deliver the best trade-off between theoretical performance and the

commercial realities of the SoC market. The SpidergonTM topology is similar to a

simple polygonal ring (see Figure 1.2.5), except that each node has, in addition to

links to its clockwise and counter-clockwise neighboring nodes, a direct link to its

diagonally opposite neighbor. From a routing point of view, any packet that arrives at

a node which is not its final destination can be forwarded clockwise, counter-

clockwise or across the network to its diagonally opposite node. The schematic

SpidergonTM topology translates easily into a low-cost practical layout: Figure 1.2.8

shows an example with N=16 nodes.

Figure 1.2.8 : SpidergonTM topology layout


P a g e | 13

Networks where every node is connected to a source or a sink for the messages are

called direct networks. Conversely, topologies with a subset of nodes, which are not

connected to any source or sink, are called indirect networks (see Figure 1.2.9).

Figure 1.2.9 : Direct and indirect network

Protocol

The protocol is concerned with the strategy of moving data through the NoC. It

includes two basic concepts: switching, which is the only transport of data and

routing, which determines the path of the data transport.

These and other relevant argument related to the protocol will be exposed below.

Switching policy: Circuit vs. packet switching. In circuit switching an entire

path (circuit) from source to destination is setup and reserved only for one

communication until the transport of data is complete. Packet switched traffic,

on the other hand, is forwarded on a per-packet basis, each packet containing

routing information as well as data. According to [20] packet switching is

more common and it is utilized in about 80% of the NoCs.

Deterministic vs. Adaptive routing. In a deterministic routing strategy, the

traversal path is determined by its source and destination alone. Popular

deterministic routing schemes for NoC are source routing and X-Y routing


14 | P a g e

(2D dimension order routing). In source routing, the source core specifies the

route to the destination. In X-Y routing, the packet follows the rows first, then

moves along the columns toward the destination or vice versa. With adaptive

routing the routing decision is taken at each hop. Adaptive mechanisms

involve dynamic arbitration mechanisms, where the arbiter takes into account

the local state of the network, for example the local link congestion. This

results in a more complex router implementation for avoiding deadlock, but

often it offers benefits like load balancing. According to [20] packet-switched

network mostly utilize deterministic routing (about 70% of cases), but some

means of adaptivity or routing policy reprogramming are necessary for fault

tolerance. Many works has been presented on this topic. An interesting one is

to split the traffic across several paths to reduce congestion on certain area of

the network [21]. Out-of-order packet delivery is in contrast with the benefits

of the adaptive routing. In many works this phenomena is totally neglected,

while others assume that the software performs the re-ordering of packets.

Minimal vs. non-minimal routing. A routing algorithm is minimal if it

chooses only among shortest paths between source and destination, otherwise

is non-minimal.

Delay vs. Loss. In the delay model, datagrams (flits, phyts1) are never lost.

The worst thing that can happen is that the arrival of data is delayed. In the

loss model instead, datagrams can be dropped. In this case means for data

retransmission are required at the level of routers, introducing significant

overhead. There are however some advantages with this model. For example

dropping flits can be used for resolving network congestion.

Central vs. distributed control. In centralized control systems routing

decision are taken globally, for example by means of an arbiter. In distributed

control instead routing decisions are made locally. NoCs usually employ the

latter solution

1 They are sub-parts of a packet. The mean of these two words will be clear in next sections.


P a g e | 15

The protocol defines the use of the available resources, and thus the node

implementation reflects design choices based on the aspect described above. Figure

1.2.10 shows the major components of any routing node: buffers, switch, routing and

arbitration unit and link controller. The switch connects the input buffers to the output

buffers, while the routing and arbitration unit implements the algorithm that

implements the routing policy. In a centrally controlled system, the routing and the

arbitration units would be common for all nodes.

As already mentioned, the optimal design of a router is strictly related to the services

that it has to provide. For example, support for adaptive bandwidth control can be

provided simply adding to the basic architecture of Figure 1.2.10 an additional bus,

allowing the crossbar switch to be bypassed when congestion occurs. [22]

Figure 1.2.10 : Generic Router model

The three common choices of how packets are forwarded and stored at routers are

store-and-forward, cut-through and wormhole. Before entering in details of these

techniques, we introduce the meaning of flit and phit.


16 | P a g e

A message is a contiguous group of bits that is delivered from source terminal to

destination terminal. A message consists of packets, which are the basic unit for

routing and sequencing. They may be divided into flits (flow control digit), which is

the basic unit of bandwidth and storage allocation. Flits do not have any routing or

sequence information and have to follow the route for the whole packet. Instead a phit

(physical transfer digits) is the unit that is transferred across a channel in a single

clock cycle; phit and flit width could be identical. Figure 1.2.11 shows these units of

resource allocation.

Figure 1.2.11 : Units of resource allocation

Store-and-forward. It waits for the whole packet before making routing

decisions, so any node of the network stores the entire packet before

forwarding it to the next node along the route. The transmission is stalled

when the node downstream does not have sufficient space on its internal

buffers to hold the entire packet. It needs a buffering capacity for one full

packet at minimum.

Cut-through. It does not wait the whole packet, but forward this latter

already when the header information is available. The header is propagated to

the next node only if the node itself guarantees to have space enough to hold

the whole packet. Otherwise propagation is stalled and the packet is gathered

at the current node. Also for cut-through forwarding the minimum buffering

capacity has the dimension of a packet.


P a g e | 17

Wormhole. It is the most popular and well suited on chip. Here routing is

done as soon as possible, similar to cut-through, but the buffer space can be

smaller (only one flit at smallest). Therefore the packet may spread into many

consecutive routers and links like a ―worm‖. Latency on the single node is

significantly reduced with respect to that of store-and-forward. The major

drawback is that stalling the packet has the effect of stalling all the links

occupied by the packet along the path. In the following we will see that the

use of Virtual Channels can alleviate this problem.

Flow control

Flow Control determines how the resources of a network, such as channel bandwidth

and buffer capacity are allocated to packets traversing a network. The basic purposes

of flow control policies are to ensure correctness in the packet propagation process

and to use resources as efficiently as possible supporting a high throughput. An

efficient flow control is a prerequisite to achieve a good network performance. Flow

control primitives thus also form the basis of differentiated communication services.

In the following a selection of the topics related to flow control will be discussed.

[35]

The concept of Virtual Channels (VC) deals with the sharing of a physical channel

by several logically separated channels, which have individual and separated buffer

queues (see Figure 1.2.12).

Figure 1.2.12 : The concept of Virtual Channel (VC)


18 | P a g e

Generally in NoCs, the number of VCs per physical channel varies between 2 and 16.

Usage of Virtual Channels can cause significant implementation overhead, especially

for the hardware cost of additional buffer queues and the more sophisticate control

logic of the physical channel, but it offers a number of important advantages. Among

these are:

Deadlock avoidance. Since VCs are not mutually dependent on each other,

by adding VCs to links and choosing the routing scheme properly, it is

possible to break cycles in the resource dependency graph [24].

Optimizing wire utilization. In future technologies, wire costs are projected

to dominate over transistor costs. Having several logical channels actually

using a single physical channel enables more efficient wire utilization.

Advantages include also reduced leakage power and wire routing congestion.

Improving performance. VCs are used to relax the inter-resource

dependencies in the network, thus minimizing the frequency of stalls.

According to [23], it is possible to improve the network performance at high

loads, dividing a fixed buffer size across a number of VCs.

The most important task of any flow control mechanisms is to ensure deadlock

avoidance. Deadlock can occur in an interconnection network, when a group of

packets cannot make progress, because they are waiting on each other to release

resource (buffers, channels). [35]

Figure 1.2.13 : Wormhole routing deadlock example


P a g e | 19

If a sequence of waiting agents forms a cycle, the network is deadlocked. We are in

presence of a deadlock if packets are allowed to hold some resources while requesting

others. Wormhole routing is particularly susceptible to deadlock. Figure 1.2.13 shows

an example of wormhole deadlock. It is possible to solve a deadlock situation by

allowing the involved packets to be preempted. Preemption packets can be:

rerouted through adaptive non-minimal routing techniques

or discarded, they are recovered at the source and retransmitted

Although it is possible to solve a deadlock, these methods are not used in most direct

networks architectures. It is more common to avoid deadlocks through the routing

algorithm, by ordering network resources and requiring that packets use these

resources in strictly monotonic order. In particular circular wait is avoided.

Figure 1.2.14 : Channel dependencies graph method

According to Duato theorem, ―A routing function R is deadlock-free if there are no

cycles in its channel dependency graph‖. So, to avoid deadlocks it is sufficient to

break cyclic dependencies in the resource dependency graph. Actually this condition

can be relaxed, as shown by [24]. It is in fact enough to require the existence of a

channel subset, which defines a connected routing sub-function with no cycles in the


20 | P a g e

extended channel dependency graph. Using VCs it is sometimes possible to avoid

stalls due to packets already blocked inside the network.

Figure 1.2.15 shows a general example of router with Virtual Channels.

Figure 1.2.15 : VCs Router model

Quality of Service (QoS)

Quality of service is defined as the ability to provide different priority to different

applications, users, or data flows, or to guarantee a certain level of performance to a

data flow. For example, a required bit rate, delay, jitter, packet dropping probability

and/or bit error rate may be guaranteed. The nature of QoS, in relation to NoC, is

identified by two basic QoS classes, best-effort services (BE) which offer no

commitment, and guaranteed services (GS) which do. This latter also presents

different levels of commitment: 1) correctness of the result, 2) completion of the

transaction, 3) bounds on the performance, etc.


P a g e | 21

BE service refers to communication for which no commitment can be given

whatsoever. In most NoC-related works, however, BE covers the traffic for which

only correctness and completion are guaranteed, while with GS additional guarantees

are given (e.g., on the performance of a transaction). In macro-networks service

guarantees are often of statistical nature. Instead, the guarantees offered by NoC

systems are almost always hard guarantees. In order to provide them, GS

communication must be logically independent from other traffic in the system. This

requires connection-oriented routing. Connections are instantiated as virtual circuits

which use logically independent resources, thus avoiding contention. The virtual

circuits can be implemented by virtual channels, time-slots, parallel switch fabric,

etc. As the complexity of the system increases and as GS requirements grow, so does

the number of virtual circuits and resources (buffers, arbitration logic, etc) needed to

sustain them. [35]

1.2.3 Link and Physical layer

Link-level research studies the architectures of node-to-node links. These links consist

of one or more channels, which can be virtual or physical. In the following, we will

present two of the areas of interest in link-level research: 1) synchronization and 2)

implementation.

Synchronization

For link-level synchronization in a multi-clock domain SoC, the critical problem is

the FIFO design. It is very important for the multi-clock FIFOs to be particularly

robust with regards to metastability. The FIFO development can be made arbitrarily

robust with regards to metastability as settling time and latency can be traded off.

Nowadays implementing links using asynchronous circuit techniques is an obvious

possibility. This is gaining very much attention thanks to the emerging of the GALS

concept (Globally Asynchronous Locally Synchronous). In the GALS model, a

system is built putting together a number of blocks that communicate with each other


22 | P a g e

one through asynchronous links, while internal communication is fully synchronous

with a given local clock. One of the major advantages of asynchronous design styles,

relevant for NoC, is the fact that, apart from leakage, no power is consumed when the

links are idle. On the other hand, asynchronous logic is necessary to implement local

handshake control. This logic implies some area and power overhead with respect to

synchronous logic. Examples of NoCs based on asynchronous circuit techniques are

CHAIN [25] and MANGO [26].

Implementation issues

As chip technologies scale into the DSM domain, the effect of wires on link delays

and power consumption increase.

A number of techniques have been proposed in the literature to improve the

performance of NoC node-to-node links in the context of DSM technology. The first

of these techniques is wire segmentation. A common solution has been for sometime

to apply repeaters at regular intervals, in order to keep the delay linearly dependent

on the length of the wire. Another technique widely used is pipelining of wire links.

In this way the link throughput is effectively increased. Use of pipelining implies

some overhead in terms of area, since pipeline stages are more complex that simple

repeaters. But as in future DSM technology wire effects tend to dominate on area

occupation, the overhead associated to pipelining is supposed to decrease.

1.3 Research Activities

The communication aspect is becoming the bottleneck in SoC architecture, both from

physical and distributed computation point of view. Wiring delays is dominating the

gate delays. In larger SoC the overall computation is heterogeneous and localized.

These factors motivate NoC, which brings the techniques developed for macro-scale

network in a single chip. NoCs have been largely reported in many papers, special

1.3 RESEARCH ACTIVITIES

P a g e | 23

journals and numerous special sessions on conferences. Therefore recently, a

dedicated NoC symposium1 has been created.

The major goal of communication-centric design and NoC paradigm is to achieve

greater design productivity and performance by handling the increasing parallelism,

manufacturing complexity, wiring problems and reliability. The three critical

challenges for NoC are power, latency and CAD (Computer Aided Design)

compatibility [20] [27].

Currently, more than 30 NoC research projects are active, both in Universities and

Industries [28]. Figure 1.3.1 shows the most important ones.

According to [28] and to [20] analysis, the chosen techniques converge to packet

wormhole switching (80% of cases), 2D mesh/torus topology (50-60%) and

deterministic routing (about 70%).

Figure 1.3.1 : Current NoC state of art.

1 www.nocsymposium.org


24 | P a g e

Also asynchronous NoCs, which include GALS concepts are becoming ever more

important, as shown in the previous illustration (Figure 1.3.1), and important

universities and research institutes, such DTU (Denmark) and LETI (France), are

working on this aspect.

Network-on-Chip is a very active research field with many practical applications in

industry. Recently new NoC start-ups were born (INoCS). Furthermore the previous

start-ups are becoming successful industries (e.g., Tilera). In 2006 Intel realized a

research chip with 80 cores (160 FP engines), which communicates with a 2D mesh

interconnection network. The name of the chip is TeraFLOPS [29], since it is the first

on-chip solution able to reach the Teraflop of processing. The first computer able to

reach this performance was realized in 1996 (ASCI Red). Although only 10 years

have passed, the difference of the power consumption between ASCI Red and

TeraFLOPS are amazing (see Figure 1.3.2).

Figure 1.3.2 : TeraFLOPS vs. ASCI Red – Source: Maurizio Palesi (Catania University, IT)

1.4 NOC DESIGN FLOW

P a g e | 25

NoCs are mostly evaluated with simulation and synthesis, but should be

complemented with analytical studies and (FPGA) prototypes. The work in [20]

identifies the following topics as crucial, in order to continue the success of NoC

paradigm: procedures and test cases for benchmarking, traffic characterization and

modeling, design automation, latency and power minimization, fault-tolerance, QoS

policies, prototyping and network interface design.

1.4 NoC design flow

The ―Network-on-Chip Architecture‖ project started in 2001 and jointly conducted by

the Laboratory of Electronics and Computer Systems at the Royal Institute of

Technology and VTT, was one of the first research projects with the goal of

developing a new architecture template, called Network on chip (NoC), for future

integrated communication systems. During the workflow, the following concepts

were described: physical issues, NoC architecture with definition of communication

layers, high-level design flow methodology and working NoC simulator.

Every company (or research laboratory) develops dedicated and proprietary solutions

for creating NoC, which are used to connect and manage the communication between

the variety of design elements and intellectual property blocks required in complex

system-on-chips.

Figure 1.4.1 shows the basic concept of the NoC design flow. In the most general

case, the Design Tool provides design support both for application-specific standard

and custom network topologies, and therefore it lends itself to the implementation of

both homogeneous and heterogeneous system interconnects.


26 | P a g e

Figure 1.4.1 : NoC design flow

The design flow is subdivided in three phases. In the first one, the design

requirements, necessary to specify the on-chip interconnection network, are set.

Based on these specifications, the NoC design tool generates the hardware description

of on-chip network interconnect (Verilog, VHDL, SystemC, etc), which together with

the description of the IP cores compose the whole system-on-chip (second phase). In

the last phase synthesis, floorplanning, placement and routing are performed. After

each one of this phase, there is a verification of the design constrains. In case of

negative verification, it is necessary to go back to the first phase and reset the design

parameters, in order to solve the constrain violations. In the case of NoC the common

violations that cause failed timing closure are related to the routing phase.

2.1 TOPOLOGY AND STRUCTURE

P a g e | 27

Chapter 2: NEC NoC

In this chapter we describe the NEC NoC architecture which is a configurable tile-

based Network on Chip able to scale to hundreds of IP cores. The figures marked

with the star symbol (*) are courtesy of NEC Laboratories America Inc.

2.1 Topology and structure

NEC NoC is a heterogeneous tile-based architecture where a two-dimensional fabric

of tiles is connected to form a mesh or torus architecture (Figure 2.1.1). Each tile

typically consists of one or more bus based subsystems (internal tile architecture) and

each subsystem can contain multiple IP cores (processors, memory modules and

dedicated hardware components).

The NoC tile wrapper provides access and isolation to each tile by routing and

buffering messages between tiles. The tile wrapper is connected to four other

CHAPTER 2: NEC NOC

28 | P a g e

neighboring tiles through input and output channels. A channel consists of two

unidirectional point-to-point links between two tile wrappers. A tile wrapper has

internal queues to handle congestion.

Figure 2.1.1 : Tile-based NoC architecture (*)

On the right side of Figure 2.1.1, the internal organization of a tile is shown: there are

four routers (SW1 to SW4), a receiver and a sender. Dedicated receiver and sender

units act as adaptation layers (interfaces) between the internal bus protocol and the

tile wrapper.

We can notice in Figure 2.1.1 that the NoC architecture differs from classical

concentrated one, which has a router for every tile. The NEC distributed routers

architecture can be achieved starting from a concentrated one (see Figure 2.1.2).

Firstly we imagine separating each one of the concentrated routers in four equal parts

(a single part is called distributed router). Now we can move the distributed routers,


P a g e | 29

which are outside the NoC, in an opposite position inside the network, in order to

obtain the final configuration.

Figure 2.1.2 : From concentrated to distributed routers architecture

The position of the switch1 both at intra-tile and inter-tile level allow one to obtain

both a mesh and a torus topology.

1 The term switch is used as a synonymous of router

CHAPTER 2: NEC NOC

30 | P a g e

Figure 2.1.3 shows the direction of input/output router connections. Each one of the

routers contained in a wrapper can reaches only one of the adjacent tiles. This involve

that every tile will be reached from four of the neighbor routers (see Figure 2.1.3).

Figure 2.1.3 : Input/output routers directions

In a wrapper, each router has four outgoing connections (see Figure 2.1.4a): outside

(0), straight (1), internal (2) and across (3), but the data path is common for all

destinations in order to save wiring. Apposite control signals, which are different for

each outgoing connection, allow the destination to understand if the delivered

information is valid or not. Similarly, each router has four incoming links (Figure

2.1.4b).


P a g e | 31

Figure 2.1.4 : Internal tile signals (*)

NEC NoC supports packet switching. All the messages that two IP cores have to

exchange are packetized to cross the network switch-to-switch in order to reach their

destination (see Figure 2.1.5). This technique implies contention because packet

arrival within a switch cannot be predicted; therefore arbitration mechanisms and

buffering resources must be implemented at each router.

Figure 2.1.5 : Block diagram of the network architecture (*)

CHAPTER 2: NEC NOC

32 | P a g e

AMBA AXI is used as the end-to-end communication protocol between different

cores, and the network interface takes care of protocol conversion to adapt it to the

network protocol.

The network topology is a configurable tile-based organization which gives the

designer the possibility to customize the architecture better suited for his design.

Whatever the final tile-based configuration is, the network makes use of a

deterministic source routing algorithm.

2.2 NoC components

In this section we describe all the elements shown in Figure 2.1.5 in order to give a

quick and complete overview of the NEC NoC. Major details about the NEC

Network Interface (NI) and the Routers will be explored in the next chapters.

2.2.1 AMBA AXI Network Interfaces (NIs)

The NEC NoC network interface (NI) is designed as a bridge between an AXI

interface and the NEC NoC network (as shown in Figure 2.1.5).

Its purposes are:

the packetization of AXI transactions into NEC NoC flits and vice versa, to

hide the details about the network communication protocol to the core;

the computation of routing information;

the buffering of flits (basic unit of a packet) to improve performance.

The NEC NoC NI is designed to comply with AMBA AXI specifications (3rd

generation of AMBA).

The NI hides all the details about the network communication protocol to the core.

This means that a NI must be able to read the message that one IP might want to send

2.2 NOC COMPONENTS

P a g e | 33

to another IP, build the packets to send through the network, receive the response

packets and finally decode them for the IP that generated the request.

Depending on the data flow direction, two different kinds of NIs can be identified: the

initiator (see Figure 2.2.1) and the target (see Figure 2.2.2). The NI initiator is

attached to a system master and according to its requests sends messages towards the

network. The NI target is attached to the system slave and receives messages from the

network and translates them into the AXI standard signals, so the receiver IP core can

satisfy the original requests.

The NI block can be divided further both in the vertical and the horizontal

dimensions. From the vertical point of view, it can be split into two different

channels, one for requests and one for responses. The former carries system master

commands towards slaves, while the latter provides a way for slaves to respond.

From the horizontal point of view, the NI block can be split into three stages: front-

end, queues and back-end. The front-end deals with (un-)packetization and protocol

conversion (from AXI to NEC NoC packets and vice versa), the queues provide

internal buffering and the back-end implements routing and flow control.

Figure 2.2.1 : NI initiator block diagram (*)

CHAPTER 2: NEC NOC

34 | P a g e

The NEC NoC network interface supports up to four response channels (see Figure

2.2.1) which allow the interface to avoid blocking transactions every time a request

expecting a response is issued (e.g. reads or non-posted writes).

Similarly, up to four request channels are supported at the destination (see Figure

2.2.2).

In the current implementation, request and response channels are decoupled.

Figure 2.2.2 : NI target block diagram (*)

The back-end is more strictly related to the NoC architecture, as it explicitly

communicates with its routers. It is composed of an input buffer and an output buffer,

with dual interfaces towards the network.

The NI initiator is attached to the IP core master and its task is to initiate the request

transmission and then wait for the response coming from the slave.

2.2 NOC COMPONENTS

P a g e | 35

As shown in Figure 2.2.1 it is divided horizontally into two different sub-modules to

allow a separation between the different functionalities performed, in terms of:

communication data flowing;

external supported interface (AXI or NEC NoC).

The two sub-modules are:

NI Sender: request and data flow from AXI to NEC NoC;

NI Rdata Receiver: response and data flow from NEC NoC to AXI.

The NI target is attached to the slave IP core and its task is to wait for the request

transaction coming from the master and then initiate the response transmission.

As shown in Figure 2.2.2, it is divided horizontally into two different sub-modules to

allow a separation between the different functionalities performed, in the same terms

just said for the initiator.

The two sub-modules are:

NI Receiver: request and data flow from AXI to NEC NoC;

NI Rdata Sender: response and data flow from NEC NoC to AXI.

All the following AXI transactions are provided:

Single read

Single write

Burst read

Burst write posted (the non posted is not yet supported)

A quick overview of these transactions is reported below.

CHAPTER 2: NEC NOC

36 | P a g e

SINGLE READ

Only the two header flits (NoC header and AXI header) are sent through the network,

because it is just necessary to know the path till the receiver IP and the address of the

internal memory location requested for the reading. The response packet will be

composed by the header (only the NoC header with some AXI related bits) and the

payload (containing the information read).

SINGLE WRITE

The header is not sufficient. The message itself is embodied in the payload flits. The

current implementation of a single write is posted, which means that the return

signals required by the AXI protocol are generated by the NI initiator itself, rather

than by the destination.

BURST READ

This kind of transmission is combined with the TLEN field which represents the burst

length (see Figure 2.4.1). In the read case, only the two header flits are sent across the

network, as in the single read. On the other hand, a stream of flits is generated in

response by the destination and the number of generated flits depends on the burst

length.

BURST WRITE POSTED

This kind of transmission is combined with the TLEN field which represents the burst

length. In the write case, a stream of flits is sent from the source to the destination and

the number of generated flits depends on the burst length. At the end of the

transmission, the return signals required by the AXI protocol are generated by the NI

initiator itself, rather than by the destination (posted transaction).

In this section we analyze the architecture of the only NI Sender (Figure 2.2.3).

However, is possible to see the architecture diagrams also of the other NIs, which are

reported in Appendix 1, with a summary table of the synthesis data together.

2.2 NOC COMPONENTS

P a g e | 37

Figure 2.2.3 : NI Sender architecture (*)

CHAPTER 2: NEC NOC

38 | P a g e

NI Sender supports read and write transfers from AXI to NEC NoC protocol and its

structure is shown in Figure 2.2.3.

The architecture is divided into a front-end (AXI dependent) and a back-end (NEC

NoC dependent). The following is a list of its sub-blocks.

Front-end:

get_nextQ: computes the nextQ and destination ID in case of Read/Write

transactions. The nextQ is a field of the network header. It is used by the first

Router of the Network to route the packet. In order to obtain the nextQ value

we use the same routing algorithm performed for the Router. This module has

to know with which one of the four routers the NI Sender is connected.

Get_nextQ must have this information before or at the same when the

transaction starts from the AXI Master: the Router is chosen by the

selection_logic module or is fixed at the beginning.

selection_logic: chooses transaction-by-transaction which back-end to use

(thus which Router, too). It checks the destination address (AXI signal

AWADDR or ARADRR) and computes the decision in terms of shortest

source-destination path.

rtl_func: handles the traffic coming through the AXI AW or AR channel. It

deals with the generation and delivery temporary data to the

assign_data_in_h sub-block. It manages the write control signal of the FIFO

header as well.

assign_data_in_h: assembles the temporary data signal coming from rtl_func

and from get_nextQ modules in a format suitable for the internal FIFOs.

rtl_runc_W: handles the traffic coming through the AXI W channel.

bresp_fsm_nslogic: state machine that generates the response handshake.

bresp_fsm_update: completes the state machine for the response handshake.

2.2 NOC COMPONENTS

P a g e | 39

Back-end:

ns_write_logic_h: handles the write pointer update for the header FIFO.

ns_write_logic_p: handles the write pointer update for the payload FIFO.

Header_FIFO: Header FIFO.

Payload_FIFO: Payload FIFO.

ns_read_logic_h: handles the read pointer update for the header FIFO.

ns_read_logic_p: handles the read pointer update for the payload FIFO.

assign_backend_signals: drives the output ports.

assign_flit_id_out: computes the flit_id signal.

bkend_ns_logic: assembles the NoC packet fetching information from the

FIFOs.

The sub-blocks (process in the Verilog code) ns_write_logic_h, Header_FIFO and

ns_read_logic_h compose the entity FIFO, which has been used to redesign the

Router (see Chapter 3: NoC Router redesign).

Figure 2.2.4 shows the FIFO architecture. It has been developed at low level in order

to obtain better performance in terms of total slack. We can notice in the figure below

that the components parts of the FIFO are as shown broken down. There is a module

for write inside the FIFO the new value and another for the read outside.

Figure 2.2.4 : FIFO architecture (*)

CHAPTER 2: NEC NOC

40 | P a g e

The white block represents all registers presented in the FIFO, while, the gen_flags

block is responsible to the empty and the full flags generation.

If the full flag is not active, then the ns_write_logic can stores a datum in the position

given by the pointer_write and update this latter. On the other hand, when the empty

flag is not active, the ns_read_logic can read the datum from the pointer_read

address and after update it.

Figure 2.2.5 : Data flow direction

The write_pointer points to the next invalid location, instead, the read_pointer to the

next valid one. Moreover there are other two special pointers called wrap_pointers,

one is relative to the write pointer (write_wrap_pointer) and the other to the read one

(read_wrap_pointer). They are 1-bit wide and change value when the corresponding

pointer whereto are connected reach the end of the FIFO elements and returns to the

initial position (elements wrapping). Using these four pointers is possible to generate

the empty and full flags. Table 2.2.1 shows in which way they are set.

2.2 NOC COMPONENTS

P a g e | 41

Table 2.2.1 : FIFO flags setting conditions (*)

The FIFO needs a clock cycle for data writing and a clock cycle for data reading [3].

2.2.2 Router

The NEC NoC Router implements a 4-cycle latency, input-queued router which

supports round robin arbitration on the input lines, and on-off switch-to-switch flow

control. It has four input lines and one output line and it consists of four stages:

enqueue filter, input queues, arbiter and routing logic (see Figure 2.2.6).

The enqueue filter checks the validity of the flits on all input ports and enqueues only

the valid ones. The arbiter constantly monitors the status of the queues and at every

cycle tries to send one flit to the output port of the router. Because more than one

queue might have at least one flit to send out, an arbitration step is performed. Prior

to going out, the flit dequeued by the arbiter is sent to the routing logic stage which

checks the destination coordinates and the nextQ1 fields contained in the header.

NextQ is used to set the appropriate valid bit at the output of the router, which is

associated to the outgoing direction to be taken.

The destination coordinates are used to compute the routing and update the nextQ

field in the header prior to send out the flit.

All these operations are done only on the head flit, while all subsequent flits will find

the arbiter locked and will not be modified by the routing logic.

1 NextQ identifies one of the 4 outgoing router output

CHAPTER 2: NEC NOC

42 | P a g e

Finally, the tail flit will unlock the arbiter.

Figure 2.2.6 : Router architecture (*)

An input flit can be rejected, due to one of the following reasons:

The valid bit is not set;

The buffering space for that input port is already filled.

After a packet has won the arbitration, its header flit is properly modified in order to

prepare it for the next router along the routed path to its destination. The end to end

path is chosen following a XY routing algorithm: two bits (nextQ) are used to indicate

the route of all flits of the same packet. So, the routing path from the sender to the

receiver is deterministic and is decided at the source by the NI sender.

The complexity of the router can be reduced based on the number of input ports

actually required in the system architecture. The number of input ports can vary from

1 to 4 and lower numbers correspond to a lower complexity of the arbiter, which is

actually not needed in case of one single input port.

The depth of each individual input buffer is also configurable.

2.3 NI MESSAGE ENCODING AND ROUTING ALGORITHM

P a g e | 43

2.3 NI Message encoding and Routing Algorithm

All the messages sent from an IP core can be divided into packets and the basic unit

of each packet is called flit: so, from a physical point of view, each message is

encoded by the NI into a series of flits as shown in Figure 2.3.1. The flit size is 35

bits (3 control bits and 32 data bits). The first two flits are the NoC header and the

AXI header, followed by a variable number of up to 16 payload flits. Therefore the

maximum packet size is 18 flits.

As said previously, the first two flits are header, but just the first is marked as Head.

In fact it contains NoC related information, while the second header flit carries AXI

related information. Then follow a variable number of flits containing the message

payload, the last flit of which is marked as Tail. In the case of a packet initiating an

AXI read transactions, no payload flits are sent and hence the second header flit is

marked as Tail.

NEC NoC uses wormhole routing which means that a connection is setup by the head

flit and then released by the tail flit.

When considering NoCs, it is typically observed that significant bandwidth is

available, and more can be added simply by increasing the link width. In contrast,

latency can easily grow to unacceptable levels. For these reasons, the choice of

slightly constraining the NEC NoC format was made. Instead of aiming at maximum

packing of information within the flits, this NoC uses instead a format with fixed field

offset and an immediate forwarding policy since it is more effective in terms of area

and latency. This also helps keep the amount of packing and unpacking logic low.

According to the contents, all the messages can be split into two different sections:

header and payload. The former embodies information about the sender, the receiver

and the type of transaction along with the routing information. Instead the latter

embodies the message itself.

Header and payload are never allowed to mix in the same flit, thus simplifying the

required logic.

CHAPTER 2: NEC NOC

44 | P a g e

The header is always fully transmitted while, depending on the type of transaction,

part of the payload can be useless (e.g., the AXI W wires are meaningful in a write

request, but not in a read operation), so a variable amount of payload flits may be

sent. Another difference between header and payload is that the header is present just

once, instead, the payload might be present more than once if the transaction is a

burst one (the number of times depends on the burst length).

Figure 2.3.1 : Multi-flit NoC packet format (*)

Every flit has a Flit Type tag, which is three bit wide: valid bit (V), head bit (H) and

tail bit (T) (see Figure 2.3.2).

These tags are related to the physical sequence of flits, not to the logical packet

content they embody: thus, only the very first flit of a packet is marked Head, the

very last Tail, and everything else is Body. The minimum size of a packet is two flits,

as it is the case of packets sent from a NI initiator in case of a read request, thus it can

never happen that a flit is both the head and the tail of the packet.

Figure 2.3.2 : Flit type encoding (*)

2.4 HEADER AND PAYLOAD STRUCTURES

P a g e | 45

As said before the NEC NoC implements a wormhole routing. The selected path from

the sender to the receiver is deterministic and decided at the source by the NI sender.

In this type of routing, connections between sender and receiver is of the type

connection oriented, because the path is known a priori to transmission. The

algorithm used for routing is called XY and minimizes number of hops. This takes

place firstly by using paths with only one turn. This can be explained more

thoroughly by referring to Figure 2.3.3: in that example we have a 4x3 tiles NoC with

a Sender and a Receiver. If we imagine sending data from the Sender to the Receiver

the green are the only valid paths while the red one is invalid because it contains

more than one turn. So if the traffic is going to take X direction then Y direction,

minimization will be run on the X direction first and then on the Y axis and vice

versa.

2.4 Header and payload structures

Request phases make use both of NoC and AXI headers, while response phases

forward only the NoC header since there is less information related to the AXI

protocol and it can be packed within the NoC header itself.

The structure of the AXI header is shown in Figure 2.4.1 and it does not depend upon

the type of requested transmission.

Figure 2.3.3 : Supported and unsupported routing

CHAPTER 2: NEC NOC

46 | P a g e

Figure 2.4.1 : AXI header structure (request phase) - (*)

In the following lines there is a short explanation of all the AXI header fields:

W/R: one bit identifier of the type of transaction (0 = write transaction; 1=

read transaction);

TID: tag identifier for the read/write group of signals;

TLEN: burst length;

TSIZE: burst size;

TBRST: burst type;

TLOCK: lock type;

TCACHE: cache type;

TPROT: protection type;

Un: bits currently unused.

The structure of the NoC header for the request phase is shown in Figure 2.4.2 and it

does not depend upon the type of requested transmission, but it is different in case of

response because in this phase the AXI header is not sent and the few AXI related

bits are encoded in the NoC header.

Figure 2.4.2 : NoC header structure (request phase) - (*)

Here is a complete explanation of all the NoC header fields:

2.4 HEADER AND PAYLOAD STRUCTURES

P a g e | 47

Un: bits currently unused;

NextQ: 2 bits indicating the outgoing direction that the packet needs to take

(00: outside, 01: straight, 10: internal, 11: across);

Y: Y coordinate of the destination tile;

X: X coordinate of the destination tile;

Tile: identifier of the destination tile;

LocalAddress: address to be accessed in the destination tile.

Two bits for the Y coordinate and five bits for the X coordinate are used and hence

we the NoC support a 32x4 mesh structure. The header can also be reconfigured pre-

synthesis when the size of the mesh changes.

The structure of NoC header for the response phase is shown in Figure 2.4.3. Besides

NoC related information (NextQ, Y and X), the header carries the following AXI

related information:

RID: transaction ID;

RRESP: transaction response.

Figure 2.4.3 : NoC header structure (response phase) - (*)

No additional information is needed for the proper reconstruction of the return data

transaction, since the number of payload flits equals the number of words required to

transfer back to the master all the data read.

CHAPTER 2: NEC NOC

48 | P a g e

2.5 The backpressure protocol

The data traffic passing through the network is handled by a simple protocol, which

ensures the packets delivery from the source to the destination.

The control signals of the protocol are basically two:

The valid bit

The backpressure bit

The meaning of the valid signal has been already explained before in Section 2.3. If

the valid bit is high, then the corresponding flit can be enqueued otherwise it is

discarded. Instead, the backpressure bit task is to stall the link whenever the

corresponding Router or Receiver is busy, because it is serving another channel, or

because the bandwidth is limited by the following part of path located before the

destination.

Figure 2.5.1a represents a temporal diagram of a single input Router channel. The

figure shows the action of the backpressure bit. When it goes high the current data is

kept and the input data flow stalled. This situation can happen in the two cases said

before:

The router is serving another channel;

The bandwidth is limited by the following part of the NoC.

Analogously, the backpressure performs the same function also in the backend of the

NIs. The NoC protocol is used in three different connections as shown in Figure

2.5.1b.

2.6 PROS AND CONS

P a g e | 49

a) b)

Figure 2.5.1 : Backpressure action (*)

2.6 Pros and cons

+ VIRTUAL OUTPUT QUEUING The NEC NoC structure includes the Virtual Output Queuing (VOQ) concept. The

NEC NoC router is an input-queuing router. This means that the buffering is done

before the routing.

The memory architecture of a general NxN input-queuing router (Figure 2.6.1) is

composed by N FIFO queue, one for each input line. In every single time slot, at each

one of the N queues can coming at maximum one flit. When a flit is at the head of the

CHAPTER 2: NEC NOC

50 | P a g e

queue (or Head of Line1), it has to contend for the access to the output ports with the

(N-1) flits of other queue. In the worst case we would have N flits directed to the

same output port.

When in a single time slot, there is contention for an output, only one queue is served

in that time slot. The other queues are stalled and have to wait the next time slots. On

the other hand the flits of the stalled queues that follow the first one could be routed

because directed to other output port.

This phenomenon is known as Head of the Line Blocking and [32] demonstrates that

it limit the maximum theoretical throughput at 58.6%.

One possible solution to the Head of the Line Blocking is the Virtual Output Queuing

(VOQ). Here packets are stored at an input port according to the output port they are

destined to (Figure 2.6.2). This assumes to have N queues for each input and not just

one for each one. So, the total number of queues is N2. At every cycle the scheduler

decide which VOQ can forward the flit and configure the crossbar and all hardware

problems of memory access and throughput are solved. While in the input-queuing

routing the choice is among N HoL (Head of Line) flits, for the virtual-output-

queuing it is among N2. Thus the complexity of the crossbar increases significantly.

1 The first packet of each queue is called Head of Line

Figure 2.6.1 : Generic Input-queuing router

2.6 PROS AND CONS

P a g e | 51

The distributed router NEC NoC architecture copes with this complexity. In fact each

distributed router has N input queues and this results in a much simpler scheduler

module.

Figure 2.6.2 : Generic concentrated Virtual Output Queuing Router

Figure 2.6.3 : VOQs in the NEC NoC architecture

CHAPTER 2: NEC NOC

52 | P a g e

+ FEWER NUMBER OF HOPS The NEC NoC topology compared to the generic tile-based one is able to save

basically one hop among the path sender- receiver. In order to understand this benefit

we consider the example in the following figure.

Figure 2.6.4 : Used routers to reach destination: (a) standard tile-based topology, (b) NEC NoC one

2.6 PROS AND CONS

P a g e | 53

Looking at Figure 2.6.4, we notice that in a NEC NoC the effective number of used

routers is always lower than in a generic tile-based NoC. This usefulness is derived

from the possibility to use any of the four routers contained in a tile for the

connection with the NI Sender and analogously the possibility to use four of the

twelve neighbor routers of a tile for the NI Receiver connection (see Figure 2.6.5).

Figure 2.6.5 : The connection of the Sender/Receiver with the Routers in the NEC NoC

That is not possible in a generic tile-based architecture because normally the IP cores

are connected only with one of the four neighbor routers (Figure 2.6.4).

+ MULTICAST and BROADCAST The NEC NoC lends itself to a broadcast and multicast data sending. Multicast is not

yet implemented at the moment, and we have a unicast data sending, but it is easy to

extend the architecture in order to have the other kind of sending. Going back to

Section 2.1, recall that a router has four possible destinations but they share the same

output data path (in order to save wiring). Since to choose the destination we have to

set one of the four valid bits (see Figure 2.1.4) we can think of setting all output valid

CHAPTER 2: NEC NOC

54 | P a g e

bits to high logic level in order to obtain the broadcast sending, or more than one

valid bit to high level to obtain the multicast.

Naturally, it will be necessary to include in the NoC header flit the appropriate

information to execute this kinds of sending.

+ OTHER BENEFIT OF THE DISTRIBUTED ROUTER ARCHITECTURE

With the distributed routers architecture we have a flexible ―physical architecture

easy to be reconfigurable into multiple ―logical topologies‖ and a high predictability

of the inter-tile signals. Furthermore this kind of structure enables the independent

design of every tile:

Frequency selection (local clock generator)

Voltage selection (individual power ring)

Therefore it naturally leads to hierarchical layout:

Multi-frequency design

GALS (Globally Synchronous Local Asynchronous)

– ACROSS LINK IS LONGER THAN THE OTHERS Another ―drawback‖ of the NEC NoC regards the across link. It is in fact physically

longer that the other. That could compromise the regularity of the NoC and introduce

clock skew issues.

Basically, we can consider two application field of it:

MPSoCs (Multi Processor System on Chips): they are characterized by a high

level of customization. For this reason is a good direction limit the links

number and the router complexity. In this situation it is better avoid to use the

right routing, thus it is reasonable to remove the across link in order to

simplify the NoC.

CMPs (Chip Multi Processors): a CMP is characterized by a high regularity.

In this case we can think to preserve the across link, using a homogeneous

2.6 PROS AND CONS

P a g e | 55

version of NEC NoC architecture, where each link has an equal length.

However this will results in a total wire length major than a concentrated

router NoC architecture.

Figure 2.6.6 : Wiring options

CHAPTER 3: NOC ROUTER REDESIGN

56 | P a g e

Chapter 3: NoC Router redesign

In the first months at the NEC Labs I worked for a short period of time to redesign

the router architecture. At that moment, the router was the only component of the

network that was implemented at a level higher than all the other ones. In particular,

it was written in BDL code (Behavioral Design Language). This kind of language

allows writing hardware code in ―C-like‖ mode, and the RTL version of the

behavioral router representation was obtained using a behavioral synthesis system,

called Cyber Work Bench (CWB). It takes hardware behavioral description in BDL

as input and generates the RTL description in VHDL or Verilog. [REF]

Using some modules already designed and included in the NIs, I developed a new

version of the router directly at RTL level. Although the new router improved the

performance, it has been the starting point of the RTL version and in the future will

be updated and improved again. However it is important to underline that with this

step all the NoC architecture has been described at RTL level in Verilog code.

In this phase of my job I used Modelsim to simulate and debug the hardware code

and Synopsys Design Compiler, with the NEC CB-90 library (90 nm) [34], to

synthesize it to gate level.

3.1 PREVIOUS ROUTER

P a g e | 57

3.1 Previous Router

As discussed, this previous version of the router was described and developed in

BDL. BDL is based on C language with extensions for hardware description, and was

developed to describe hardware at levels ranging from the algorithmic level to the

functional level. BDL is relatively easy to learn because it is based on a well-known

programming language.

In order to describe hardware, BDL extends the C language with the following

features, [30]:

Physical type of variable: in addition to the traditional C variables, which are

called logical variables, the BDL language integrates new types called

physical variables. They are used in order to represent hardware like

terminals, registers, ports, and so on;

Bit width: any bit width can be specified for variables in ascending or

descending order;

I/O type: these types can be used in declarations of variables for circuit inputs

and outputs. The declared variables are ―in” for input type, ―out” for output

type, and ―inout” for bidirectional (I/O) type;

Process declaration: in C, the function called "main" (hereafter referred to as

the main function) is a special function, in that this main function is always

called to start execution of a program. However, in BDL, program execution

is started by a function declared by the process declaration (hereafter referred

to as the process function). This means that each description must include a

process function;

Data transfer type: in addition to the assignment operator (=) from C, other

kinds of data transfer type have been included. The continuous assignment

(::=) method indicates a physical connection. The register transfer (<-)


58 | P a g e

method indicates assignment to the variable types that represent registers and

the terminal transfer (:=) method indicates assignment to variables, which

describe terminals;

Operators: new operators have been created in order to describe hardware

starting form C. An example is the concatenation operator, which allows one

to link variables with other variables;

Timing descriptor: it is used to specify the clock cycle boundaries;

Control statement: used to represent multiplexers;

Special constant like HiZ, used to represent high impedance.

Because the BDL level of abstraction is higher than a normal hardware language like

Verilog and VHDL, it is difficult to show a detailed block diagram of the router

architecture, but it mirrors the one represented in Figure 2.2.6.

We can obtain a Verilog description of the BDL code using a high-level synthesizer.

In particular I use a NEC tool called Cyber Work Bench (CWB) [30].

In order to have data results comparable in terms of performance and area with the

new version of the router, the achieved Verilog version has been synthesize with

Synopsis Design Compiler (using the NEC C90 library). The synthesis results of this

previous version of router are showed in the Table 3.1.1.

SYNOPSYS SYNTHESIS RESULTS

levels of

logic

critical path [ns]

clock [ns]

frequency [GHz]

cell count

area [µm2] nets VOQ

size Date

27 0.94 1.1 0.909 2,643 24,916 2,789 2 19-Feb-09

Table 3.1.1 : Synthesis results of the previous router

3.2 THE ROUTER REDESIGNING

P a g e | 59

3.2 The Router redesigning

The target is to obtain a router with the same functionality of the previous, but

developed directly at RTL level in Verilog reusing the NI FIFO queues and the NI

Round Robin Arbiter. The code reusing allows a fast developing, furthermore the

common elements inside the NIs and the Router will be the same and thus the NoC

hierarchy will become simpler and more homogeneous.

Figure 3.2.2 shows the router architecture. It is a hierarchical structure with three

modules:

Input line;

Arbiter;

Routing Logic.

The following figure (Figure 3.2.1) represents the diagram of the router hierarchy

Figure 3.2.1 : Modules hierarchy

ROUTER

INPUT LINE

FIFO

ARBITER

ROUTING LOGIC


60 | P a g e

Figure 3.2.2 : Architecture of Router

While the FIFO and the Arbiter are those of the NIs the Routing Logic has been

developed starting from that one of an asynchronous router with an appropriate

conversion from asynchronous to synchronous architecture.


P a g e | 61

We need an FSM (Finite State Machine) in order to interface and synchronize the

components. Its development has been the main task of the redesign since the other

components have been reused.

I implemented the FSM using three states, one where it waits the packets arrival, one

where it operates (manage the data flits inside the queues choosing the input line

decided by the arbiter) and one for a delay necessary to the arbiter booting. The

startup state is the WAIT one.

Figure 3.2.3 shows the states diagram of the Router FSM (RFSM).

Figure 3.2.3 : RFSM states diagram

WAIT FLIT

OPERATING

BOOT

DELAY

RESET

(empty_0==”0" || empty_1==”0" ||

empty_2==”0" || empty_3==”0" ||

fetch_packet==”1")

&&

boot == ”0”

(empty_0==”0" || empty_1==”0" ||

empty_2==”0" || empty_3==”0" ||

fetch_packet==”1")

&&

boot == ”1”

header == “01”


62 | P a g e

The FSM design choice depends mainly on the arbiter: the RFSM has to wait the

arbiter decision in presence of the channel contention. So, it will have to be designed

for communicating with the arbiter packet-by-packet, in order to know which input

channel is actually active.

The arbiter monitors the requests coming from the input lines and gives the grant to

only one with a Round Robin priority policy. It starts the arbitration when a signal

called ―arbitrate‖ goes high, computes the priority count and decides which input

channel wins the contention by setting a signal called ―sel‖ with the number of the

chosen input line. At the same time it sets the signal named ―fetch_packet‖ to high

level.

The RFSM interfaces itself with the arbiter exploiting these control signals: when

there is a request (the queue is not empty) coming from one of the input lines or when

the control signal ―fetch_packet‖ is high, then the RFSM goes to the OPERATING

state. Here it sends the flit data to the Routing Logic only if there is no backpressure

coming from the connected component of the NoC. The FSM goes back to the WAIT

state when the tail flit has been sent to the Routing Logic. The router state machine

has to set the ―arbitrate‖ signal to high logic level at the correct time so that the

arbiter can handle the new request. Going back on

Figure 3.2.2 we can notice the importance of the ―sel‖ signal, it is the input of 4

processes: the RFSM, a de-multiplexer for the read control signal of the input lines

and two other multiplexers, one for the empty signal coming from the input line and

go inside the RSFM, the other for the input line data output.

The Router FSM also deals with the arbiter booting. The state machine sets a signal

called boot to high level and goes to the DELAY state only for the first contention

handled by the arbiter.

The router has been tested and debugged in several experimental platforms in order to

validate its functionality, first in a dedicated one without NIs and AXI cores,

subsequently in a real NoC platform. The router resulted fully-functional and it

replaced the previous version.


P a g e | 63

In order to compare the performance of this new version with the previous one, the

Synopsys synthesis results are reported to follow in Table 3.2.1.

SYNOPSYS SYNTHESIS RESULTS

levels of

logic

critical path [ns]

clock [ns]

frequency [GHz]

cell count

Area [µm

2]

nets VOQ size

date

13 0.59 0.7 1.429 3,237 31,694 3,381 4 13-May-09

Table 3.2.1 : Synthesis results of the new router redesigned

We notice that the clock frequency increase from 0.909 to 1.429 GHz. On the other

hand also the area increases from 24,916 to 31,694 µm2, this is inevitable because the

frequency depends proportionally on area: for breaking the critical path, we need to

add new sequential logic. However the synthesis results are good enough considering

we used four buffers for the FIFO instead of two buffers like in the previous one.

Another strong point of this new router is the partial auto-configuration of the

parameters. While in the previous it was necessary to set four parameters for every

possible next tile destination (16 values), now the router deals with this computation

directly, and needs only the local position in the NoC (tile coordinates) and in the tile

(switch number). This speeds up the NoC setup and avoids manual configuration

errors.

The initial target1 has been obtained successfully. However this new version of router

has been the starting point for the direct RTL development. Although it has strong

points, like the partial auto-configuration, and increase the performance compared to

the previous one, there are also some weak points: mainly the latency, which

increased from 3 to 4 clock cycle and the level of logic that are less than of the

previous router but are still high. Starting from this version of the router, we have

then developed a new version using a simpler arbiter and which provides better

performance both in terms of latency and in terms of area utilization.

1 Obtain a router with the same functionality of the previous, but developed directly at RTL level in Verilog reusing the NI FIFO queues and the NI Round Robin Arbiter

CHAPTER 4: FAULT TOLERANT NOC

64 | P a g e

Chapter 4: Fault Tolerant NoC

In this Chapter we describe the concept of Fault Tolerance and the reason why it is

particularly connected to the NoCs.

4.1 Fault tolerance and Network on Chip

Fault-tolerance or graceful degradation is the property that enables a system to

continue operating properly in the event of the failure of (or one or more faults

within) some of its components. If its operating quality decreases at all, the decrease

is proportional to the severity of the failure, as compared to a naively-designed

system in which even a small failure can cause total breakdown.

Fault-tolerance is not just a property of individual machines; it may also characterize

the rules by which they interact. For example, the Transmission Control Protocol

(TCP) is designed to support reliable two-way communication in a packet-switched

4.1 FAULT TOLERANCE AND NETWORK ON CHIP

P a g e | 65

network, even in the presence of communications links which are imperfect or

overloaded. Within the scope of an individual system, fault-tolerance can be achieved

by anticipating exceptional conditions and building the system to cope with them,

and, in general, aiming for self-stabilization so that the system converges towards an

error-free state. [4]

Nowadays, the SoC design challenges concern at first the design complexity; the

goals are the separation of computation from communication, and the use of

structured communication means. It is also important to achieve design reliability in

order to cope with process variability, and to guarantee resilience against soft and

hard errors. Another key point is the power and thermal management; the chips

operate ever more at very low voltage level, and, a lower voltage supply is strictly

correlated with the error rates.

In this decade, many attempts have been done in order to provide a structured

methodology for realizing on chip communication in terms of modularity and

flexibility, to cope with inherent limitations of busses (performance and power of

busses do not scale up), and to support reliable operation by the use of layered

approaches to error detection and correction.

The Networks on Chip are the best means to achieve conceptual and physical

separation. NoCs are a SoC sub-system fully devoted to realize the on-chip

communication, they are reconfigurable and customizable, they can enhance error

resiliency and support array-based design and 3D integration. Furthermore they are

becoming a prerequisite as the technologies scale down, with consequent higher

system and communication complexity, higher defect and failure rates. [5]

We can distinguish three different period of time in the semiconductor life where the

failure rate changes due to several reasons. In the following figure (Figure 4.1.1) the

so-called ―bathtub‖ curve is shown, which represents the Semiconductor Failure

Rate.


66 | P a g e

Figure 4.1.1 : Semiconductor Failure Rate (courtesy of M. Lajolo, NEC LA Inc.)

The Semiconductor Corporations will ceaselessly proceed CMOS scaling by

introducing various new technologies. Figure 4.1.2 shows the CMOS technology

scaling down during the current decade [6].

Figure 4.1.2 : CMOS technology scaling


P a g e | 67

This continuous run up, toward extremely reduced transistor dimensions, involves

variations of dopants, thresholds and of geometries. At the same time, a very low

voltage level operation reduces the noise immunity, which is the reason of the soft

errors. Furthermore permanent malfunctions increase due to high temperature and to

variations of it.

All these effects are generally known as ―design variations‖. In the finer process

technologies the product yield tends to decrease due to:

Increase of systematic defects caused by fine lithography (Figure 4.1.4) and

small device inherent defect;

Increase of parametric failures caused by the higher variation sensitivity

(Figure 6.1.3).

Figure 4.1.3 : Failures influence the yield


68 | P a g e

Reference: M. Lajolo, ―Toward NoC adoption at NEC‖, DATE 2009

Variations are a new challenge in LSI design. Several instances of the same chip are

different. Moreover there are variations also inside the chip: identical component of a

chip can be different (Figure 4.1.5).

We can distinguish two mainly sources of variations:

Manufacturing-induced variations: they can be subdivided in two sub-

categories. The systematic one regards the irregularities caused by the

lithography process. Masks are ideal and we never have a regular grid, hence

our layout will be subject to variations. The other category is called random. It

includes all the variation phenomenon at atomic level, such as random dopant

fluctuations and oxide thickness variations (at an atomic point of view the

transistor is not flat but it is characterized by steps).

Operation-induced variations: they appear during the operating phase of the

device. The most important are the spatial temperature variation and the

temporal voltage dropping.

Figure 4.1.4 : Immersion lithography


P a g e | 69

Figure 4.1.5 : VLSI design variations (courtesy of M. Lajolo, NEC LA Inc.)

The ever increasing number of transistors in SoCs makes them extremely difficult to

validate. Nowadays, the yield is getting ever lower due to unexpected performances,

which are originated from variations and power issue. Many chips do not pass the

burn-in and that involve heavy economic loss for the silicon foundries.

Figure 4.1.6 : Yield loss (courtesy of M. Lajolo, NEC LA Inc.)


70 | P a g e

E.g., for 15 million good units, we have a 20% of yield loss. That entails $ 25 million

extra cost (Figure 4.1.6). [7]

The Networks on Chip can be made resilient to errors, and can compensate

malfunctions in computing/storage elements by supporting multi-path communication

and reconfiguration. They can extend the SoC life by supporting communication

redundancy and network reconfiguration

With the presence of the NoC, we can handle the defect immunity and the error

resilience directly on-line. This is a new paradigm: it is not necessary attempt to make

an almost perfect design, but we can tolerate the malfunctions by operating at higher

abstraction levels.

The conventional ―worst case‖ design model is too conservative, there is too high

overhead in order to reach the target performance, and, the device-level solutions

cannot completely solve this problem. We need “variation-aware‖ design model in

order to achieve aggressive approaches better then worst case. Recently, self-

calibrating circuits have been developed, which operate at the edge of the failure. An

example is the concept of dynamic voltage scaling used in [8].

Figure 4.1.7 : Razor Double Data Sampling Technique

4.2 HOW TO MAKE NOCS RELIABLE

P a g e | 71

It is an aggressive, better than worst case approach, presented for processor pipelines.

In such a design, the voltage margins that traditional methodologies require are

eliminated and the system is designed to dynamically detect and correct circuit timing

errors that may occur when the worst-case noise variations occur. Dynamic Voltage

Scaling (DVS) is used along with the aggressive design methodology, allowing the

system to operate robustly with minimum power consumption.

The DVS concept used in [8] has been introduced in NoC as well, in fact in T-error

[9] (a timing-error tolerant mechanism to make the interconnect resilient against

timing errors arising due to such delay variations on wires) the double data rate

technique (Figure 4.1.7) is used on the links of the network in order to correct timing

errors.

With the ever increasing complexity of embedded and multi-core architectures, NoCs

provide a new paradigm for the design of such systems. While lot of research has

been already performed in this topic, several key open points remain to be faced,

which involve the design of the overall system as well as the communication system

itself. The aspect of fault tolerance assume in these systems and increasing

importance, due to the several fabrication and design constraints. They imply a

number of devices more and more sensitive to hard and soft failures of the system or

part of it.

4.2 How to make NoCs reliable

Communication links can fail when:

violation of the critical path delay due to over-clocked circuits;

very low supply levels in order to minimize power (circuit is under-powered).


72 | P a g e

From the NoC point of view, this means that some flits/packets cannot reach the

destination. So, it becomes necessary to introduce a flow control mechanism in order

to cross the failures. The most common techniques derive from the nature of NoCs, in

fact, since a NoC is a network, we can think of using the traditional mechanisms of

the standard communication networks like error detection/retransmission (―ed‖) and

error correction (―ec‖).

In error detection/retransmission, data are transferred with error detection coding (i.e.,

Parity bit and Cyclic Redundancy Check (CRC) codes). They are checked at the

destination, and, in the case of an error, upon the detection data are retransmitted

from the source. There are basically two options: end-to-end or switch-to-switch

retransmissions.

In error correction scheme the data are transferred with an error correction code (i.e.,

Hamming code). At the destination a decoder corrects the potential error.

There is the possibility to use both techniques together achieving a hybrid scheme.

In order to apply end-to-end ―ed” schemes on NoCs, Parity Check or CRC codes

have to be added to each packet/flit. CRC/Parity encoders have to be integrated into

each sender NI, and the encoded packets have to be transmitted and stored (for

retransmission). The NI Receiver will have to check for error

Ack/Nack signal to sender can be either piggybacked with response packet or

separate;

Open Core Protocol requires request-response transaction.

Moreover we need a sequence numbers (relative position) for each packet to re-order

and identify duplicate packets. A time-out mechanism will be necessary too.

Analogous consideration can be made for the switch-to-switch approach.

In [10], an architectural level support for fault tolerance is presented, which makes

use of error detection/correction mechanisms that are used in traditional macro-

networks, in order to protect the system from transient errors that occur in the

communication sub-system. This work compares the detection/correction capability,

area-power overhead and performance of the various error detection/correction


P a g e | 73

schemes enounced before. Some experimental results of [10] is reported in Figure

4.2.1 (4x4 Mesh Network, 4 flits / packet, flit size 64 bits, 200MHz clock rate).

Figure 4.2.1 : Experimental results - Source [10]

Looking at the previous figure, we notice that with an error-aware control flow it is

possible save power and improve the latency performance. Moreover, according to

[5] and [10], the end-to-end error control flow is more efficient when the length of the

links is long, thus it is suggested for multi-cycle links. While switch-to-switch is more

efficient when the link is short and the hop-count is high. Here the NI buffering may

become an issue.

In the presence of permanent failures, the approaches mentioned before can do

anything for coping with the errors. In this case the network has to support

communication redundancy, using more than one source-destination path. Another

approach is the reconfiguration of the network, which is able to cross the failure and

repair the end-to-end communication.

The NoC routing scheme can be either static or dynamic in nature. In static routing,

one or more paths are selected, at design time, for driving the traffic flows in the

NoC. In the case of dynamic routing, the paths are selected based on the current

traffic characteristics of the network. Due to its simplicity and to the fact that

application traffics can be well characterized for most SoC designs, static routing is

widely employed for NoCs. [11]


74 | P a g e

New paths means more buffer and hence more network power consumption. In

particular, special buffer could be required at the destination in order to re-order the

packets. On the other hand with multi-paths there is an improvement of the end-to-

end latency performance and an even traffic spreading in the network. We know that

for most SoC designs, the NoC operating frequency can be set to match the

application requirements. In this case, reducing the traffic bottlenecks (with the use of

multipath) leads to lower required NoC operating frequency. The reduced operating

frequency is translated to lower power consumption in the NoC. That compensates

for the more power required from the redundant paths.

Verifying SoCs initial correctness and operational errors is increasingly more

difficult. That is another reason whereby fault tolerant approaches are ever more

required in SoC design. For many years in the past, the testing of the chip was the

unique method for checking the functionality: if a unit does not pass the testing phase

it was been ―throw away‖. Only recently the systems are reconfigurable and it is

possible to safe also the unit with partial error. NoCs are a new frontier for the fault

tolerant system design. They will be necessary for downscaled technologies and will

provide SoCs with fast and reliable interconnect fabrics NoC solutions.

An example of on-line solution for error checking and correction after the

manufacturing is represented by the Intel Larrabee GPU. The first Larrabee chips will

feature 32 x86 processor cores and come out in late 2009, fabricated on a 45

nanometer process. Chips with a few defective cores due to yield issues will be sold

as a 24-core version.


P a g e | 75

According to [12], NoCs can help solve power management, system architecture,

process variability and application debug. Moreover they are significant enablers to

realize future many core designs. In many core architectures the application isolation

is important in order to prevent errors. NoCs can enforce the application isolation

allowing: domain isolation for performance and security, ―arbitrary‖ shaped domains

and shared channel reservation. NoCs can provide support to ―route around‖ router

and link failures, and to route to spare tiles in case of core failures. Applications and

interconnect must work together to provide platform reliability.

Figure 4.2.3 : Enable Dynamic Fault Discovery & Repartitioning (Source Intel [12])

Figure 4.2.2 : Intel Larrabee (Source: [31])


76 | P a g e

4.3 Redundancy in the NEC NoC

The NEC research about NoC aims at comparing the performance of a NoC solution

versus a traditional shared BUS one of commercial NEC media chips. These chips are

an evolution of older multi-core ones and the future trend is aimed at a many cores

solution. [13]

Many of today's NoC architectures are based on static single-path routing. The reason

of this choice is that with a multi-path routing, packets can reach the destination out-

of order, due to a different path length and to a different path traffic load. Many

applications do not tolerate the out-of order data delivery and, in consequence, a

packets re-ordering is necessary at the destination. Multi-processor chips requiring

data coherency are typical examples of architectures where in order packet arrival at

the destination is mandatory

The NEC NoC uses a static single-path routing where the source-destination path is

chosen at design time, in fact, it has to guarantee an in order packet delivery because

the media chips are employed in video and multimedia applications. Packet ordering

needs to be maintained for displays and for many of the processing blocks in the

application.

At the moment of my arrival at the NEC Laboratories, the NEC NoC did not employ

any form of Fault Tolerance. The target of my job has been to introduce adaptiveness

in the network adding redundant source destination paths, in order to guarantee the

communication also in the presence of permanent errors that could be discovered

post manufacturing.

The goal is to integrate post-manufacturing diagnosis and self-repair (Figure 4.3.1).

This is achieved by means of routing constraints determined based on diagnostic

feedback reporting the list of faulty links and faulty processing elements like routers.

4.3 REDUNDANCY IN THE NEC NOC

P a g e | 77

Faulty links can have an associated backpressure bit that is kept constantly high thus

resulting in the capability for the network to route the traffic through alternative

paths.

A faulty processing element like a router can also be managed in the same way since

it is sufficient to map as faulty the link just before the faulty resource in the topology

of the network. The result is the capability to sustain a satisfactory level of yield at

the cost of decreased performance.

The result is the capability to enhance the basic static single-path routing, with the

capability to ―route around‖ faults.

Constrained routing is thus a way to implement an online self-repair strategy based on

post manufacturing information.

Figure 4.3.1 : Self-repair NoC (courtesy on NEC Labs)

CHAPTER 5: CASE STUDY: A 5X2 TILE NOC WITH 2 AXI MASTERS AND 5 AXI SLAVES

78 | P a g e

Chapter 5: Case study: a 5x2 tile

NoC with 2 AXI Masters and 5 AXI

Slaves

The redundancy injected inside the NEC NoC has been thought in order to avoid chip

defect, which appear after the manufacturing phase. On the other hand, we can use

the alternative paths to increase the performance in terms of latency.

In this chapter we show the explored case study: a 5x2 tile NoC with AXI Masters

and Slaves. The NoC shape and the Processing Elements (PEs) disposition depend on

the NEC research goal on the NoCs, which aims at estimating the performance of a

NoC solution for data communication in the video-processing chips (see Figure

4.3.1).

5.1 EXPERIMENTAL PLATFORM

P a g e | 79

Figure 4.3.1 : Architecture evolution of NEC media chips (courtesy of M. Lajolo, NEC LA Inc.)

In Figure 4.3.1 we notice that the media chips PEs are disposed horizontally. For this

reason the NoC experimental platform, used as case study, inherits a common shape

with 5 horizontal tiles and 2 vertical ones.

In this phase of my thesis job, we used the following CAD tools and applications:

a text editor for the code writing;

Cyber Work Bench for the generation of the AXI masters and the slaves;

Mentor Modelsim 6.2h simulator for the functional simulations;

Synopsis Design Compiler for the components synthesis;

and Xilinx ISE 9.2i for the netlist assembling.

5.1 Experimental platform

The experimental platform (or simulation platform) is the 5x2 tile NoC shown in

Figure 5.1.1. There are 2 AXI Masters, one in tile1 (M1) and one in tile2 (M2), while

in the bottom tiles we find the AXI Slaves (SL1, 2, 3, 4, 5). Every master reaches


80 | P a g e

every slave through the NoC, using the routers represented in the figure. The design

choice of the source-destination paths has been made in order to minimize the number

of the routers, obtaining the basic logical configuration of the NoC. It uses three 2-

input routers and five 1-input routers.

Figure 5.1.1 : Experimental platform (base or standard configuration)

The platform is able to handle single write and burst write AXI transactions. The read

transaction is not supported in fact, looking at Figure 5.1.1, we can notice that there

are no return links (NI Rdata Receiver and NI Rdata Sender missing). Although the

platform supports the only write transaction, it is a good example considering also

that no similar works have previously been done at NEC Labs.

The NEC NoC is defined at the RTL level and, at the moment, a connection with a

software application is missing. The PEs are behavioral descriptions used only for

NoC simulation purposes, this results in the impossibility to simulate a failure. For

these reasons, in this chapter several configurations of the experimental platform are

analyzed and compared in terms of latency. This assumes the presence of a re-order


P a g e | 81

module at the destination in order to maintain the coherence of the data. The

alternative paths have been chosen in such that the number of hops, from source to

destination, is always the same for each path. Moreover with the introduction of the

redundant paths the traffic is spread on the network homogeneously. For these

reasons we have an inherent in-order delivery. The first simulations confirmed this

theory and, in order to have a simpler platform too (we recall that this is the first time

that we use a multipath routing in NEC NoC, the reorder module was not used in the

platform. However, the NoC needs it and one of the first future improvements could

concern its implementation and integration.

The developed work includes also the addition of some utilities to the NoC. One of

these is the possibility to choose among three different typologies of traffic:

single destination (only one slave);

round robin destination;

pseudo-random destination.

In order to achieve this functionality, we generated masters, via Cyber Work Bench

(CWB), in another way different from the advised one. Adding two ports to the top of

the master, it became customizable in terms of intra-packets delay and address

destination.

We developed an external Verilog module (called Master Customizer), which is able

to set the new master parameters. With this new component it is possible to choose

one of the kind of traffic shown above, while before the unique possibility was the

single-destination. Moreover with the possibility of inserting a delay among packets,

we avoid the re-generation, starting from the CWB, every time that we want a

different delay.

Operating on the delay parameter, it is possible to run simulations at different data

rate. Figure 5.1.3 and Figure 5.1.4 show latency simulation results at several data rate,


82 | P a g e

in case of random traffic (prgn, Pseudo Random GeNerator), because it is the most

interesting. The intra-packets delay was changed with the following sequence of

values: 1000, 100, 10, 1, 0 clock cycles.

Figure 5.1.2 : Connection between M1 and its Master Customizer in tile1

Figure 5.1.3 and Figure 5.1.4 show the end to end average latencies for packets that

start from M1 and M2 respectively. On the x-axis there are the five destinations,

while along the y-axis the injected data rate, expressed in byte/s. The two graphs have

to be read together, this means that, when M1 injects 23 MB/s (see Figure 5.1.3), at

the same time M2 will inject 23 MB/s (see Figure 5.1.4) and so on for the other data

rate values.


P a g e | 83

Figure 5.1.3 : End to end average packets latency from M1 @ prgn traffic

Figure 5.1.4 : End to end average packets latency from M2 @ prgn traffic


84 | P a g e

We can notice naturally that the latency increases when the injected data rate grows.

The latency data have obtained by including a dedicated monitor, which saves on

files the appropriate information necessary to compute the latency. I wrote this

monitor (called latency_counter) in Verilog and I included it at top level in the netlist

(Figure 5.1.5).

Figure 5.1.5 : Top level with the tiles and the latency_counter module

Latency_counter monitors some signals of the masters and of the slaves, saving the

transition traces on file. In particular it keeps the trace of the time (clock cycle) when

a packet starts from the master X and when it reaches the slave Y. A piece of a trace

relative to packets that reach Slave 1 is shown below.


P a g e | 85

ClockCycle: Trigger sample signal: AWADDR: WDATA:

60 WVALID_S1 00002000 16

71 WVALID_D1 00012000 16

75 WVALID_S2 00003000 48

86 WVALID_S1 00001000 64

89 WVALID_S2 00000000 80

89 WVALID_D1 00023000 48

93 WVALID_S1 00002000 80

97 WVALID_D1 00011000 64

104 WVALID_D1 00020000 80

112 WVALID_D1 00012000 80

122 WVALID_S1 00002000 144

... … … …

... … … …

Starting from these traces, with an appropriate elaboration of them, it is possible to

obtain the end-to-end latency for packets that reach Slave 1. Awk scripts for each one

of the traces generated with the Modelsim simulation have been written. These scripts

read the trace and generate the latency final results. The final format of the latency

data is compatible with GNU Plot1 for a direct plotting from file. [14] [15]

An example of the format is reported below.

##########################################

# End to end Latency MASTER_1 -> SLAVE_1 #

##########################################

#Packet_ID Latency

1 11

2 15

3 16

4 11

… …

1GNU Plot is a command-driven interactive function and data plotting program


86 | P a g e

With these results we know the end-to-end latency of every NoC packet and we can

handle these data in the most appropriate manner. The simplest solution is the direct

plotting with GNU Plot. In this way it is possible to look at the latency of each single

packet from Master X to Slave X. As an example, Figure 5.1.6 reports the end-to-end

latency plot of the packets, which start from Master1 / Master2 and reach Slave 1.

Figure 5.1.6 : Example of end-2-end latency plot generated with GNU Plot

Another solution is to process again the data in order to plot the average latency

(examples are Figure 5.1.3 and Figure 5.1.4)

Figure 5.1.7 compares the average latency data that have been obtained with the

simulation at different data rate. Analyzing the figure we notice that the latency value

depends on the destination. E.g., in case of Master1, the average latency at SL1 and

SL2 destinations is lower than for the other slaves. This is correct, because the


P a g e | 87

Master1 position in the platform is nearer to SL1 and to SL2 than to the other ones

(see the platform in Figure 5.1.1). Analogous situation happens for Master2.

Other kinds of monitors have been implemented. They keep trace of the FIFO buffers

utilization, of the backpressure bits value, and of the NI Sender selected queue signal.

They have been developed in Verilog. Using Xilinx ISE it is possible to connect them

to the signal under test directly, specifying the path and the name of the text file

where the trace will be saved. We named this module general_monitor, it can monitor

four FIFO buffer, four backpressure bit and one NI Sender selected queue signal

(queue_selector in the code). In the following figures, the tile1 Xilinx schematic

netlist (Figure 5.1.8) and the window where it is possible to set the monitor file

parameters (Figure 5.1.9) are shown.

11

9

13

16

18

11

9

13

16

18

11

9

13

16

18

1211

15

18

21

1312

16

20

22

SL1 SL2 SL3 SL4 SL5

from Master 1 - prgn

23 MB/s 224 MB/s 1411 MB/s

2869 MB/s 3078 MB/s

14 14

9

12

1414 14

9

12

1414 14

9

12

1415

14

10

14

1615

14

11

15

17

SL1 SL2 SL3 SL4 SL5

from Master 2 - prgn

23 MB/s 224 MB/s 1141 MB/s

2977 MB/s 3265 MB/s

Figure 5.1.7 : Average latencies comparison @ different injected data rate


88 | P a g e

Figure 5.1.8 : Xilinx schematic file of tile1

Figure 5.1.9 : General monitor setting window

The traces obtained from the general monitors are also in this case in a format

compatible with GNU Plot. So, it is possible to plot the graphs directly from file.


P a g e | 89

All these data could be used in order to understand which router is the busiest in the

network. Furthermore, this analysis will help us find an alternative routing solution

for relieving the work of this router. The utilization buffer traces allow us to

understand the optimal number of the FIFO elements.

The data simulation trace can to be used in many other ways. We limit ourself to

present the data results. An example is shown below.

Figure 5.1.10 : Queue utilization and backpressure graphs of the router T1_R4 (channel 0)

Starting from the standard experimental platform, I add to it alternative paths in order

to have the possibility to reconfigure the NoC, in presence of post-manufacturing

failures. As mentioned before, the benefits of these solutions depend on a

compromise between the latency and the utilized resources.

I introduced several redundant paths in the NoC in order to relieve especially the load

of routers T34_R2 and T1_R4 (see Figure 5.1.1). Basically, the alternative paths are


90 | P a g e

two and can be used also together. So, we obtain three new configurations of the

simulation platform in addition to the standard one:

Standard (std)

Standard with alternative path 1 (alt1)

Standard with alternative path 2 (alt2)

Standard with alternative path 1 and 2 (alt1+2)

Figure 5.1.11 shows the platform with the alternative path number 1.

Figure 5.1.11 : Standard with alternative path #1

The redundant path is represented in red color. As we can notice it involves the use of

four additional 1-input routers compared to the standard configuration. Moreover it is

necessary to replace the 2-output NI Sender (two backends) of tile2, with a 3-output

one (three backends). The description of the 3-output NI Sender will be done in the

following section (5.2 - The NI Sender queue selection policy).


P a g e | 91

Figure 5.1.12 shows the experimental platform with the alternative path number 2. It

is drawn with a violet color and it needs five 1-input routers more than standard

configuration platform and a 2-input one, which replaces the T2_R4 (1-input). Also

for this configuration is necessary to employ a NI Sender with 3 backends.

Figure 5.1.12 : Standard with alternative path #2

We can use both alternative paths together. This configuration needs four 1-input and

one 2-input routers more than standard configuration platform. Moreover, it will be

necessary to replace one 1-input router (T2_R4) with 2-input one. In Figure 5.1.13

the simulation platform with the presence of both the alternative paths together is

represented.


92 | P a g e

Figure 5.1.13 : Standard with alternative path #1 and #2

A first comparison (in terms of resources utilized) of the several platform

configurations can be done. As we notice, the three new platform solution needs more

routers and more complex NI Sender than the standard configuration. Table 5.1.1

shows a summary of the utilized resources in the three cases.

single area

[µm2]

Standard Alternative #1 Alternative #2 Alternative #1+#2

# units

units total area

# units

units total area

# units

units total area

# units

units total area

1-in

Router 5951 5 29755 9 53559 9 53559 8 47608

2-in

Router 11142 3 33426 3 33426 4 44568 5 55710

2-out

Sender 40810 2 81620 1 40810 1 40810 - -

3-out

Sender 64888 - - 1 64888 1 64888 2 129776

1-in

Receiver 17185 4 68740 2 34370 2 34370 2 34370

2-in

Receiver 33428 1 34428 3 103284 3 103284 3 103284

Total: 247969 330337 341479 370738

Table 5.1.1 : Comparison of the resources utilization in the several platform configurations

5.2 THE NI SENDER QUEUE SELECTION POLICY

P a g e | 93

5.2 The NI Sender queue selection policy

The NI Sender queue selection policy has been implemented following two basic

ideas:

we can switch to an alternative path when the backpressure reaches the source

and then, switch back to the standard path when there is backpressure again;

or, we switch to the alternative path and switch back continuously to standard

path packet-by-packet.

These two approaches are explained in detail with Algorithm 1 and Algorithm 2 in

the next page.

DEST_RANGE_DOWN and DEST_RANGE_UP are two of the NI Sender Frontend

parameters. They define a mask of address. Packets with destination address lower

than DEST_RANGE_DOWN are routed using the backend #1. If the destination

address of a packet is between DEST_RANGE_DOWN and DEST_RANGE_UP, it is

routed using the backend #2. Instead, packets with destination address higher than

DEST_RANGE_UP are routed using one among the backend #2 and the backend #3.

The backend #1 and #2 are connected to the standard paths while the backend #3 is

the beginning of the alternative path.

Another important parameter is QS_AUTOCOMPUTATION. The name means queue

selection with auto-computation, because the NI Sender Frontend decides packet-by-

packet which backend to use. We can choose the queue selection policy by setting the

value of this parameter to one of the following values:

―1‖ : standard path;

―2‖: alternative path with repetitive switching packet-by-packet;

―3‖: alternative path with backpressure detection switching.


94 | P a g e

Algorithm 1: Repetitive Switching Path Selection

1: old path = backend #2; 2: wait AXI transaction; 3: read destination address (DEST_ADDR); 4: if (DEST_ADRR <= DEST_RANGE_DOWN){ 5: selected queue = backend #1; 6: }else if (DEST_ADRR > DEST_RANGE_DOWN && DEST_ADRR <= DEST_RANGE_UP){ 7: selected queue = backend #2; 8: old path = backend #2; 9: }else if (old path == backend #2){ 10: selected queue = backend #3; 11: old = backend #3; 12: }else{ 13: selected queue = backend #2; 14: old = backend #2; 15: } 16: goto line 2;

Algorithm 2: Path Selection Switching with Backpressure Detection

1: old path = backend #2; 2: wait AXI transaction; 3: read destination address (DEST_ADDR); 4: if (DEST_ADRR <= DEST_RANGE_DOWN){ 5: selected queue = backend #1; 6: }else if (DEST_ADRR > DEST_RANGE_DOWN && DEST_ADRR <= DEST_RANGE_UP){ 7: selected queue = backend #2; 8: old path = backend #2; 9: }else if (bp_detected ==1){ 10: if (old path == backend #2){ 11: selected queue = backend #3; 12: old path = backend #3; 13: }else{ 14: selected queue = backend #2; 15: old path = backend #2; 16: } 17: }else{ 18: selected queue = old path; 19: } 20: bp_detected = detection(bp); //it monitors bp signal when the packet is injected inside the NoC 20: goto line 2;


P a g e | 95

The queue selection policies are used for all destinations except these cases:

from Master1 to SL2 we use always the standard path (starting from T1_R4)

from Master2 to SL3 we use always the standard path (starting from T2_R4)

These exceptions depend on the routing algorithm. It allows us to minimize the hops

by turning right or left just once (see Figure 2.3.3 of Section 2.3). For this reason M2

cannot reach SL3 through the alternative path #1. Analogously M1 cannot reach SL2

through the redundant path #2.

The implementation of the path selection algorithm is integrated in the NI Sender

Frontend. This choice has been done due to the necessity to know the SwitchNumber1

immediately when the AXI transaction arrives at the NI Sender.

The algorithms implementation results in two finite state machines (FSM): one for

the backpressure detection (bp-detector) and the other for the backend selection

(queue-selector). Figure 5.2.2.a shows a block diagram of them. The queue-selector

is a FSM with two states. Figure 5.2.1 shows the FSM diagram: the states are two,

one where we wait for the beginning of a new AXI transaction (state A) and the other

where we wait for the end of it (state B). State A monitors the AXI signal in order to

understand when a transaction reaches the NI Sender; when a transaction starts it goes

to State B. Moreover it chooses the backend by setting a signal named

queue_selector. This choice is based on the value of the QS_AUTOCOMPUTATION

parameter and on the output of the bp-detecor. This computation is performed in one

clock cycle, because the frontend needs this information as soon as possible in order

to perform the routing computation and to generate the nextQ field.

State B maintains the value of the queue_selector signal stable until the AXI

transaction has been packetized and stored in the chosen backend. The other task of

State B is saving the value of the queue_selector signal, which is necessary to State A

for choosing the next backend.

1 Number that identifies which router is connected to the NI Sender Backend


96 | P a g e

Figure 5.2.1 : Queue selector FSM

The bp-detector monitors the VALID_X and BP_X signals of the selected backend

(backend X) and if, at the same time, both signals are high then it set the

corresponding detected_bp_X to high level and maintain constant that of the other

backend. The detected_bp_X signal will set low when the next AXI transaction comes

inside the frontend. It is possible to identify the beginning of the transaction by

monitoring the AWVALID signal in the case of write transactions, or the ARVALID

signal in the case of read ones. When one these two signals go high, the

detected_bp_X signal will be reset.

The bp-detector Verilog code is shown below.

WAIT BEGINING

OF

TRANSACTION

(A)

WAIT END OF

TRANSACTION

(B)

It compares the AWADRR or ARADRR with the DEST_RANGE

and chooses which backend to use based on QS_AUTOCOMPUTATION

value.

It maintains stable the signal that selects the chosen backend until the

AXI transaction is stored in the queue. Furthermore, it saves the last value

of the queue selector signal, necessary to implement the algorithm.

AWVALID == 1 || ARWALID == 1

BVALID == 1


P a g e | 97

Verilog code part 1: Backpressure detector

1: always @(posedge ACLK) 2: begin : bp_detection_update 3: │ if (ARESETn == 1'b0) 4: │ begin 5: │ │ detected_bp_2 <= 1'b0; 6: │ │ detected_bp_1 <= 1'b0; 7: │ │ valid_reg <= 1'b0; 8: │ end 9: │ else 10: │ begin 11: │ │ detected_bp_2 <= detected_bp_next_2; 12: │ │ detected_bp_1 <= detected_bp_next_1; 13: │ │ if (AWVALID == 1'b1 || ARVALID == 1'b1) 14: │ │ valid_reg <= 1'b1; 15: │ │ else 16: │ │ valid_reg <= 1'b0; 17: │ end 18: end 19: 20: always @(VALID_2 || VALID_1 || BP_2 or BP_1 || fend_queue_selector or detected_bp_2 || detected_bp_1) 21: begin: bp_detection 22: │ case (fend_queue_selector) 23: │ │ 2'b10: begin //R2 24: │ │ │ if (VALID_2 == 1'b1 && BP_2 == 1'b1) 25: │ │ │ begin 26: │ │ │ │ detected_bp_next_2 <= 1'b1; 27: │ │ │ │ detected_bp_next_1 <= detected_bp_1; 28: │ │ │ end 29: │ │ │ else 30: │ │ │ begin 31: │ │ │ │ if ((AWVALID == 1'b1 && valid_reg == 1'b0) || (ARVALID == 1'b1 && │ │ │ │ valid_reg == 1'b0)) 32: │ │ │ │ detected_bp_next_1 <= 1'b0; 33: │ │ │ │ else 34: │ │ │ │ detected_bp_next_1 <= detected_bp_1; 35: │ │ │ │ detected_bp_next_2 <= detected_bp_2; 36: │ │ │ end 37: │ │ end 38: │ │ 2'b01: begin //R4 39: │ │ │ if (VALID_1 == 1'b1 && BP_1 == 1'b1) 40: │ │ │ begin 41: │ │ │ │ detected_bp_next_1 <= 1'b1; 42: │ │ │ │ detected_bp_next_2 <= detected_bp_2; 43: │ │ │ end 44: │ │ │ else 45: │ │ │ begin 46: │ │ │ │ if ((AWVALID == 1'b1 && valid_reg == 1'b0) || (ARVALID == 1'b1 && │ │ │ │ valid_reg == 1'b0)) 47: │ │ │ │ detected_bp_next_2 <= 1'b0; 48: │ │ │ │ else 49: │ │ │ │ detected_bp_next_2 <= detected_bp_2; 50: │ │ │ │ detected_bp_next_1 <= detected_bp_1; 51: │ │ │ end 52: │ │ end 53: │ │ default: begin 54: │ │ │ detected_bp_next_1 <= detected_bp_1; 55: │ │ │ detected_bp_next_2 <= detected_bp_2; 56: │ │ end 57: │ endcase 58: end


98 | P a g e

Figure 5.2.2 : Block diagram of the algorithm implementation: at behavioral level (a) and in detail (b)

BP

DETECTOR

QUEUE

SELECTOR

Input signals Input signals

Frontend

parameters

Switch

parametersChosen Switch

BP

DETECTOR

QUEUE

SELECTOR

ARADDR

QS_AUTOCOMPUTATION

mySwitchNum

Detected bp

ARADDR

AWADDR

BP2

BP1

VALID2

VALID1

AWADDR

BVALID

ARVALID

AWVALID

ARESETn

ACLK

detected_bp_2

detected_bp_1

SWITCH_NUM_2

SWITCH_NUM_1

SWITCH_NUM_0

DEST_RANGE_UP

DEST_RANGE_DOWN

fend_queue_selector

Selected backend signal

(a)

(b)

5.3 LATENCY RESULTS

P a g e | 99

5.3 Latency results

I organize the project folder adding 3 sub-folders; Figure 5.3.1 shows this

organization. In the monitors folder there are all the monitor files generated during

the simulation. Latency_computation contains the awk scripts necessary to the end-to-

end latency computation. Instead the gnuplot_graphs folder includes all GNU Plot

scripts that we use for plotting the graphs of end-to-end latency and of the queue

utilization (qu).

Figure 5.3.1 : Project sub-folders

In order to obtain the experimental results we perform the sequence of actions shown

in Figure 5.3.2. At the beginning we erase all monitors in order to make sure that all

previous data are removed.

Then we have to set the parameters for the current simulation. Figure 5.3.3 shows a

summary of simulation parameters.


100 | P a g e

Figure 5.3.2 : Sequence of actions in order to obtain the experimental results

Figure 5.3.3 : Simulation parameters

5.3 LATENCY RESULTS

P a g e | 101

At this point, we can run the simulation and generate the latency results.

We run several simulations in order to explore all the possible configuration of the

NoC in presence of the redundant paths. We used a pseudo-random traffic (prgn)

because it stresses the network more than the other ones (round robin and single

destination – see Section 5.1). The injection data rate at the source is the maximum

possible (the delay parameter corresponds to ―0‖). We distinguish nine different

configurations of the network in terms of alternative path and of selection policy of it

(see Figure 5.3.3).

Figure 5.3.4 and Figure 5.3.5 show the average end-to-end latency from Master 1 and

from Master 2 correspondingly.


102 | P a g e

Figure 5.3.4 : Average end-to-end latency (from Master 1) measured in clock cycle

Figure 5.3.5 : Average end-to-end latency (from Master 2) measured in clock cycle

SL1 SL2 SL3 SL4 SL5

1312

16

20

22

1312

16

19

22

14

11

14

16

18

13

10

14

17

19

1312

16

20

22

13

11

16

18

20

13

11

15

1920

13

11

15

1718

1312

15

18

20

Master 1 @ prgn , delay = (0,0)

Standard Alternative1_rep Alternative2_rep Alternative1+2_rep Alternative1_bp

Alternative2_bp Alternative1+2_bp Alt1_bp+Alt2_rep Alt1_rep+Alt2_bp

SL1 SL2 SL3 SL4 SL5

16

15

12

16

19

15 15

11

14

16

14

11

14

16

18

15

13

12

14

17

16

15

12

15

17

16

14

12

16

19

16

14

12

15

18

16

14

12

16

17

15

14

11

14

16

Master 2 @ prgn , delay = (0,0)

Standard Alternative1_rep Alternative2_rep Alternative1+2_rep Alternative1_bp

Alternative2_bp Alternative1+2_bp Alt1_bp+Alt2_rep Alt1_rep+Alt2_bp

5.3 LATENCY RESULTS

P a g e | 103

We notice that the configuration that gives us the best latency results both for M1 and

M2 is the ―Alternative1+2_rep‖ one. Thus, a repetitive switching, packet-by-packet,

from the standard backend to the alternative one improves the end-to-end latency

better than a switching policy based on the backpressure detection. Looking Figure

5.3.4 and Figure 5.3.5, we can notice that there are some configurations, which could

seem better than the best case earlier chosen. The problem of these configurations is

that they allow the network to obtain this improvement only for packets coming from

one of the two masters; instead for the other one we do not notice the same things.

Figure 5.3.6 shows the best NoC configuration in term of end-to-end average latency.

This configuration allows the NoC to reduce the latency delay of a value near to 10 %

globally. On the other hand, if we analyze each single destination there are cases (see

Figure 5.3.7) where the latency does not improve. Figure 5.3.7 shows the latency

improvement percentage and summarizes all the previous graphs in only one.

Considering this latter figure we understand that the NoC frequency could be reduced

of a value equal to 10% of the actual one, preserving the same performance that is

chieved without alternative paths. In other words, the introduction of redundancy in

the NoC can be used for improving the latency performance and for saving energy.

0

5

10

15

20

SL1 SL2 SL3 SL4 SL5

M2 best case

Standard Alternative1+2_rep

0

5

10

15

20

25

SL1 SL2 SL3 SL4 SL5

M1 best case

Standard Alternative1+2_rep

Figure 5.3.6 : The best NoC configuration in terms of average latency improvement


104 | P a g e

This thesis does not explore (through simulation) the energy aspect, but other works

[10] proved that, using a static multipath routing and reducing the NoC frequency, it

is possible to note an effective energy saving, also in presence of the re-order buffer

at the destination.

Figure 5.3.7 : Percentage of latency improvement

5.4 3-master case

Using the current 2 Masters / 5 Slaves experimental platform, we are managing

average end-to-end values between 10 and 20 clock cycles. The reason for this is that

in such configuration, also in the presence of maximum injected data rate at the

sources, it is impossible to stress the network much more. In fact the NoC is able to

manage the current amount of data traffic, performing such average latency values.

The values shown in Table 5.4.1 are the proof of this consideration. The table shows

0

5

10

15

20

SL1SL2

SL3SL4

SL5

% o

f im

pro

ve

me

nt

Destination

Performance increase

% latency profit M1

% latency profit M2

5.4 3-MASTER CASE

P a g e | 105

the single component latency in the absence of backpressure. Summing the values

together following the path from source to destination we obtain the final end-to-end

latency value. We notice that it is comparable to the average ones of the previous

paragraph. This means that the network is not sufficiently stressed and it can handle

the amount of traffic with a minimal contention of resources.

COMPONENT LATENCY

NI Sender 2x 5

NI Sender 3x 5

Router 1x 2

Router 2x 3

Router 3x 3

Router 4x 3

NI Receiver 1x 2

NI Receiver 2x 3

Table 5.4.1: Latency of NoC components in absence of backpressure

For these reason in this section we analyze another platform, with 3 Master and 5

Slaves, in order to introduce more resource contention inside the NoC and to increase

the latency at the destination. The new AXI Master, named Master 3, is located in

Tile 3. Its introduction involves a sequence of modification, which are listed below:

Introduction of an NI Sender 3x1 and a Router 1x (T3_R3) inside Tile 3;

Modification of the Routers T3_R2, T3_R4, T35_R2 and T2_R3 from 1x to

2x;

Removing of link between T1_R2 and T2_R4.

Figure 5.4.1 shows the new experimental platform.

1 1x = 1 input or 1 output, 2x 2 input or 2 output and so on.


106 | P a g e

Figure 5.4.1 : Experimental platform with one Master more, placed in Tile 3

With one more Master, we have 27 possible configurations of the network, in terms

of source-destination path and alternative path selection policy. In fact we have 3

Master and 3 possible path selection policies (standard path, repetitive switching and

backpressure detection switching). Table 5.4.2 shows all these configurations.

CONFIGURATIONS

MA

STER

SELE

CTI

ON

PO

LIC

Y

Con

fig 1

11

Con

fig 1

12

Con

fig 1

13

Con

fig 1

21

Con

fig 1

22

Con

fig 1

23

Con

fig 1

31

Con

fig 1

32

Con

fig 1

33

Con

fig 2

11

Con

fig 2

12

Con

fig 2

13

Con

fig 2

21

Con

fig 2

22

Con

fig 2

23

Con

fig 2

31

Con

fig 2

32

Con

fig 2

33

Con

fig 3

11

Con

fig 3

12

Con

fig 3

13

Con

fig 3

21

Con

fig 3

22

Con

fig 3

23

Con

fig 3

31

Con

fig 3

32

Con

fig 3

33

M1

Standard x x x x x x x x x

Repetitive x x x x x x x x x

Bp detection x x x x x x x x x

M2




M3




Table 5.4.2 : Possible configurations of the experimental platform

5.4 3-MASTER CASE

P a g e | 107

We run the Modelsim RTL simulations for each configuration in order to obtain the

end-to-end latency. The simulation results are summarized in Table 5.4.3: the row

highlighted in red color corresponds to the standard configuration; instead the yellow

row is the best configuration. As in the 2 Masters / 5 Slaves platform, we discard the

configurations that increase the latency with respect to the standard one and we save

the only ones that give us improvements for every destinations. We identified as best

case the configuration 312. This means having the bp detection switching path

section policy for Master 1, the standard path for Master 2 and the repetitive

switching policy for Master 3 (for more details about the selection policy see Section

5.2).

AVERAGE LATENCY

CONFIG. M1 M2 M3

S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5

111 16 14 19 25 28 26 24 15 21 23 30 30 22 13 16

112 16 15 20 25 28 26 25 16 22 25 31 30 23 13 14

113 16 15 19 25 27 27 23 15 22 23 31 29 23 13 16

121 16 14 19 25 28 26 25 14 19 19 31 30 22 15 19

122 16 14 19 24 29 26 25 15 18 19 32 31 22 15 15

123 16 14 20 25 28 27 25 14 18 19 32 31 22 15 17

131 16 14 19 24 27 26 25 15 20 21 31 30 22 14 17

132 16 14 19 24 27 26 25 15 20 23 32 30 22 14 15

133 16 14 19 25 27 27 25 15 20 21 31 30 22 14 17

211 16 13 17 20 21 26 22 15 21 24 29 27 22 15 18

212 17 13 17 20 21 26 23 15 21 24 31 29 22 15 15

213 16 13 17 21 21 26 23 15 20 24 30 28 22 15 17

221 16 13 17 21 22 26 23 14 18 20 31 29 21 16 19

222 16 13 17 22 22 26 24 16 20 19 32 29 22 17 15

223 16 13 18 21 23 25 23 15 19 21 30 28 20 16 19

231 16 13 18 21 21 26 24 15 20 21 31 29 22 16 19

232 16 13 17 21 22 26 24 15 21 22 26 24 15 21 22

233 16 13 18 20 22 26 23 15 20 23 30 27 21 16 18

311 16 13 18 22 23 26 24 16 21 24 31 28 22 14 18

312 16 13 18 21 24 25 22 15 20 23 29 26 21 13 15

313 16 13 18 23 25 26 23 16 21 25 31 27 22 14 17

321 16 14 18 22 25 26 23 15 18 20 30 30 22 15 18

322 16 14 18 23 26 26 23 15 19 21 32 30 21 16 17

323 16 13 18 23 24 26 23 16 18 20 30 30 22 16 18

331 16 13 18 22 25 26 23 15 21 22 31 30 21 16 19

332 16 14 18 23 24 26 23 15 20 22 30 29 22 15 16

333 16 13 18 23 24 26 23 15 20 21 31 29 22 15 17

Table 5.4.3 : Average latency value for each configuration of 3 Master / 5 Slaves platform


108 | P a g e

0

5

10

15

20

25

30

s1 s2 s3 s4 s5

LATE

NC

Y

DESTINATION

Best configuration - from M1 111 (standard)

312 (best)

0

5

10

15

20

25

30

s1 s2 s3 s4 s5

LATE

NC

Y

DESTINATION


312 (best)

0

5

10

15

20

25

30

s1 s2 s3 s4 s5

LATE

NC

Y

DESTINATION


312 (best)

0

5

10

15

20

s1 s2 s3 s4 s5

%

DESTINATION

M1 - Latency improvement [%] benefit [%]

0

2

4

6

8

10

12

14

s1 s2 s3 s4 s5

%

DESTINATION


0

2

4

6

8

10

s1 s2 s3 s4 s5

%

DESTINATION


(a) (b)

Figure 5.4.2 : Standard vs. best configuration average latency (a), Latency improvement adopting the best

configuration (b)

5.5 SUMMARY

P a g e | 109

More details on the simulation results are presented in the Appendix 2.

5.5 Summary

In this chapter we analyzed a particular case study, that is a 5x2 tile NoC with 2 AXI

Masters and 5 AXI Slaves. Applying the ideas of Section 4.3, we introduced

alternative paths from source to destination. A new version of NI Sender with 3

backend and the logic for the selection of the output queue has been developed. We

distinguish several platform configurations in terms of queue selection policy

(standard / repetitive switching / backpressure detection switching) and alternative

path (used / not used). In Section 5.4 we explored the previous case study, but adding

one additional Master, in order to increase the amount of data traffic in the NoC. The

Master has been placed in Tile 3 and named Master 3.

We compared the latency improvement with the increase of area occupation.

Table 5.5.1 shows a summary of these experimental results. The table considers the

two cases 2M/5S and 3M/5S; for both of them standard and best configuration

(chosen in the previous sections) of the platform are explored.

Using the best configurations, it is possible to improve the end-to-end latency. In the

2M/5S case, the benefit in terms of end-to-end average latency is about 10 %

globally1, but the increase of the total NoC area occupation is about 49 %. In the

other case 3M/5S, the global average latency benefit is near 6 % with an increase of

the total NoC area occupation around 28 %.

The trade-off benefit/drawback is more or less the same; in fact in the first case

(2M/5S) we favor the latency improvement with a resulting big increase of area.

Instead in the other case (3M/5S), we have a small increase of area, but also the

latency benefit is smaller than in the first case.

1 It is the average of ten values. Ten because we have two Masters and five Slaves: M1 to SL1 (1st value), M1 to SL2 (2nd value), and so on.


110 | P a g e

2 MASTER / 5 SLAVES 3 MASTER / 5 SLAVES

STANDARD BEST STANDARD BEST

Average latency

[clock/cycle]

Average

latency

[clock/cycle]

Latency

benefit

[%]

Average latency

[clock/cycle]

Average

latency

[clock/cycle]

Latency

benefit

[%]

MA

ST

ER

1

SLAVE 1 13 13 0 16 16 0 SLAVE 2 12 10 16.67 14 13 7.14 SLAVE 3 16 14 12.50 19 18 5.26 SLAVE 4 20 17 15.00 25 21 16.00 SLAVE 5 22 19 13.64 28 24 14.29

MA

ST

ER

2

SLAVE 1 16 15 6.25 26 25 3.85 SLAVE 2 15 13 13.33 24 22 8.33 SLAVE 3 12 12 0 15 15 0 SLAVE 4 16 14 12.5 21 20 4.76 SLAVE 5 19 17 10.53 23 23 0

MA

ST

ER

3

SLAVE 1 - - - 30 29 3.33 SLAVE 2 - - - 30 26 13.33 SLAVE 3 - - - 22 21 4.56 SLAVE 4 - - - 13 13 0 SLAVE 5 - - - 16 15 6.25

Mean value improvement - - 10 - - 6

STANDARD BEST STANDARD BEST

# of units Area

[µm2] # of units

Area

[µm2] # of units

Area

[µm2] # of units

Area

[µm2]

AR

EA

OC

CU

PA

TIO

N

Router 1x 5 29755 8 47608 3 17853 6 35706

Router 2x 3 33426 5 55710 7 77984 8 89136

NI Sender 2x 2 81620 - - 3 122430 1 40810

NI Sender 3x - - 2 129776 - - 2 129776

NI Receiver 1x 4 68740 2 34370 3 51555 2 34370

NI Receiver 2x 1 34428 3 103284 2 66856 3 100284

Total area [µm2] 247969 370738 336678 430082

Increase [%] - 49.50 - 27.75

Table 5.5.1 : Summary of the latency/area results (standard and best configuration)

The thesis work allowed us to obtain important results, which can be used as starting

point for a fault-tolerant design of the NEC NoC. A simple case of spatial redundancy

has been analyzed with encouraging results.

Since faulty links can have an associated backpressure bit that is kept constantly high,

adopting the backpressure detection switching policy, the NI Sender does not route

packets through those links where the corresponding backpressure bit is high. This

results in a self-repair capability of the network in presence of failures due to delay

variations. In fact, the real bandwidth of some links could be lower than the expected

5.5 SUMMARY

P a g e | 111

ideal value at the designing phase (this due to the technology variation influence).

Therefore a way to use anyway the ―slow links‖ is by monitoring the associated

backpressure bits and when they are congested we change path for a while in order to

guarantee the ideal performance. This it is possible by adopting the ―bp-detection‖

switching policy presented in Section 5.2.

The same NoC link will appear in multiple copies of the same chip. In some of these

chips, the link delay will be acceptable, while in others it will be too high. The

associated backpressure bit will be generated more frequently in the second case and

the ―bp-detection‖ switching policy results in an adaptive routing scheme allowing to

use links according to their available bandwidth.

The result is an adaptive fault-tolerant scheme able to increase the level of yield as a

trade-off with an increase in area and resource utilization.

Actually [7], given a forecasted yield of the 80%, we notice yield loss values near to

25%, considering all parametric variations. Therefore, a fault-tolerant-design

technique is needed in order to sustain an acceptable level of yield. In other words,

fault-tolerant design can allow to produce reliable chips from unreliable components.

This thesis work does not perform failures simulations in order to explore yield

enhancement, but a related work [33] shows that, by adopting method based on the

use of redundant links and crosspoints, it is possible to obtain significant interconnect

yield improvements (up to 72%). Redundant components must be carefully planned

in order to maximize their contribution to the yield increase while keeping acceptable

the area overhead.

This thesis work is just a first step toward the design of fault-tolerant NoCs. The

proposed approach has to be integrated with post-manufacturing diagnostic feedback

for error discovery and localization. Packet reordering is also needed at the

destination and this could be addressed at the application level.

CHAPTER 6: CONCLUSIONS AND FUTURE WORK

112 | P a g e

Chapter 6: Conclusions and future

work

The aim of testing the chips is to detect errors occurred during the fabrication process.

Fault tolerance means the ability for a system to operate in the presence of faults.

There are five key elements to tolerate faults: avoidance, detection, containment,

isolation and recovery. Fault tolerance can be achieved via error detection and

correction, stochastic communication, adaptive routing, and both temporal and spatial

redundancy. Decreasing feature size, higher frequencies, and increased process

variation expose the modern SoC to various faults and countermeasures must be

actively sought and studied.

Even though the NoC concept is still in its infancy in terms of commercial adoption,

it is very actively explored and discussed in the research community. The thesis

analyzed the concept of fault tolerance in relation to the Network on Chip design

paradigm.

P a g e | 113

The NEC Network on Chip was overviewed in order to allow the reader to

understand the explored case study. In Chapter 4 the design of a new version of router

has been also presented.

In Chapter 5 we have explored an example of spatial redundancy solution, which has

been implemented and validated on the NEC NoC. The spatial redundancy was

performed adopting a static multipath routing approach. During the correct operation

of the network, the alternative paths are intelligently employed in order to reduce the

latency and the results of Chapter 5 show that it is possible to obtain clear advantages.

Experimental simulation show that, by adopting the static multipath routing approach,

we can reduce the average end-to-end latency up to the 10 % with a 49.50 % of area

overhead, in the case of 2Masters/5Slaves platform. Instead, in a 3Masters/5Slaves

platform, the latency improvement corresponds to the 6 % with the 27.75 % of area

overhead. Moreover, although we do not perform this evaluation, related work

demonstrates that the adopted approach can reduce the power consumption

maintaining the performances constant.

In presence of faults on one of the two paths (default or alternative) the network is

reconfigured in order to use only the operative one. In this way the end-to-end

communication is ensured.

The latency benefits, obtained by adopting spatial redundancy, allow the network to

use links according to their available bandwidth. If the delay of some links is too high

(due to delay variations) the network can choose another redundant link. This results

in a self-repair capability of the network in presence of failures due to delay

variations. A consequence of these benefits is the increase of the on-chip area due to

the use of redundant alternative paths.

State of the art for NoC has been introduced in order to emphasize the challenges in

this research area. Comparison with buses solutions shows many advantages of the

NoC concept, particularly for huge on-chip systems. Research activities were

presented in order to build an overall picture and show tendencies of the research

community.

CHAPTER 6: CONCLUSIONS AND FUTURE WORK

114 | P a g e

The ideas for future developments of this thesis project are the following:

Re-order buffers:

As logical finalization of the work it is necessary to integrate re-order buffers

in the NoC which permit to order the packets at the destination.

Faults detection:

Another step for continuing this work concerns the possibility to diagnose a

fault. This can be obtained only with the development of a software entity

able to detect faults and set the NoC in order to avoid them.

Routing algorithm modification:

The present routing algorithm used in NEC NoC limits the use of multipath

approaches because it allows packets to turn left or right only once before

reaching their destination. Since it is impossible to ―route around‖ some

faults, without performing two turns, it will be necessary to modify the

routing algorithm in order to allow packets to reach the destination also in the

presence of these faults.

BIBLIOGRAPHY

P a g e | 115

Bibliography

[1] L. Benini, G. De Micheli ―Network on Chip: A new SoC paradigm‖, IEEE

Computer, vol. 35, no 1, pp. 70-78, Jan 2002.

[2] J W. Dally, B. Towles, ―Route Packets, not Wires: On-Chip Interconnection Networks", Proc. DAC, pp. 684-689, June 2001.

[3] J. Williams, ―Digital VLSI Design with Verilog‖, a Textbook from Silicon

Valley Technical Institute, Springer, 2008. [4] http://en.wikipedia.org/wiki/Fault-tolerant_system [5] G. De Micheli, ―On-Chip Networks and Reliable SoC Design", keynote address

of “Diagnostic Services in Network-on-Chips” Workshop in DATE, 2007. [6] M. Lajolo, ―Toward NoC adoption at NEC", in DATE, 2009. [7] M. Lajolo, ―Network on Chip", slides of the intensive Master Course on NoC at

ALaRI Institute (University of Lugano, CH), 2009. [8] D. Ernst, N. S. Kim, S. Pant, S. Das, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T.

Austin, K. Flautner, T. Mudge, ―Razor: A low-power pipeline based on circuit level timing speculation", Proc. of the International Symposium on

Microarchitecture, pp. 7-18, Dec 2003. [9] R. Tamhankar, S. Murali, S. Stergiou, A. Pullini, F. Angiolini, L. Benini, and

G. De Micheli, ``Timing Error Tolerant Network-on-Chip Design Methodology,'' to appear in IEEE Transactions on Computer Aided Design, 2007.

[10] S. Murali, ―Methodologies for reliable and efficient design of Networks on

Chips", Ph.D dissertation, Stanford University, 2007.

BIBLIOGRAPHY

116 | P a g e

[11] J. Hu, R. Marculescu, ―Energy-Aware Mapping for Tile-based NOC Architectures Under Performance Constraints", Proc. ASPDAC, pp. 233-239, Jan 2003.

[12] P. Kundu, ―On Chip interconnects for Tera-scale Processors‖, New

Developments and Trends in Networks on Chip, in DATE, 2009. [13] http://www.necel.com/digital_av/en/mpegdec/emma3sllp.html

[14] http://www.gnuplot.info

[15] http://sourceforge.net/projects/gnuplot

[16] AmbaTM AXI protocol v1.0 specification, http://www.arm.com [17] The IBM CoreConnectTM Bus architecture,

http://www.chips.ibm.com/products/coreconnect [18] Open Core Protocol specification, release 2.1, http://www.ocpip.org [19] R. Locatelli, G. Maruccia, L. Pieralisi, A. Scandura and M. Coppola,

―Spidergon: a novel on-chip communication network‖, Proceedings of

International Symposium on System-on-Chip, 2004. [20] E. Salminen, A. Kulmala and T. D. Hämäläinem, ―Survey of Network-on-Chip

Proposals‖, White paper, OCP-IP, 2008. [21] D. Bertozzi et al., ―NoC synthesis flow for customized domain specific

multiprocessor system-on-chip‖, IEEE Trans. Parallel and Distributed Systems, vol. 16, no 2, pp. 113-129, Feb 2005.

[22] J. Duato, S. Yalamanchili and L. Ni, ―Interconnection networks: An

Engineering Approach‖. Morgan Kaufmann, 2003. [23] W. J. Dally, ―Virtual-channel flow control‖ IEEE Trans. Parallel and

Distributed Systems, vol. 3, no. 2, pp. 194-205, Mar 1992. [24] W. J. Dally and C. L. Seitz, ―Deadlock-free message routing in multiprocessor

interconnection networks‖, IEEE Trans. Compt. 36, pp.547-553, May 1987. [25] J. Bainbridge, and S. Furber, ―CHAIN: A delay-insensitive chip area

interconnect‖, IEEE Micro 22, pp. 16–23, Oct. 2002. [26] T. Bjerregaard and J. Sparsø, ―A router architecture for connectionoriented

service guarantees in the MANGO clockless network-on-chip‖, in Proceedings

BIBLIOGRAPHY

P a g e | 117

of Design, Automation and Testing in Europe Conference (DATE), pp. 1226–

1231, 2005. [27] J. Owens et al., ―Research challenges for on-chip interconnection network‖,

IEEE Micro, vol. 27, no.5, pp. 96–108, 2007. [28] P. Vivet, ―Efficient NoC Design for MPSoC, based on GALS Architecture and

Fine Grain DVFS‖, NoC tutorial, DATE, 2009. [29] S. Vangal et al., ―An 80-tile Sub 100-W TeraFLOPS processor in 65-nm

CMOS‖, IEEE Solid-State Circuits, vol. 43, no. 1, pp. 29-41, Jan 2007. [30] Behavioral Synthesis System Cyber Reference Manual (Rev. 2.6), NEC

internal Document. [31] Seiler, L., Carmean D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P.,

Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P., 2008. "Larrabee: A Many-Core x86 Architecture for Visual Computing", ACM Transactions on Graphics, 27, 3, 2008.

[32] M.J. Karol et al., ―Input versus output queuing on a space-division packet

switch", in IEEE Transactions on Communications, COM-35(12), pp. 1347-1356, 1987.

[33] Cristian Grecu, André Ivanov, Res Saleh, Partha Pratim Pande, ―NoC

Interconnect Yield Improvement Using Crosspoint Redundancy", in

proceedings of the 21st IEEE International Symposium on Defect and Fault-

Tolerance in VLSI Systems (DFT'06), 2006. [34] http://www.necel.com/cbic/en/core/memory_cb90.html

[35] Gianni. Mereu, ―Conception, analysis, design and realization of a multisocket

network on chip architecture and of the binary translation support for a VLIW core targeted to system on chip", Ph.D dissertation, Cagliari University, 2006.

APPENDIX 1

P a g e | 119

Appendix 1

Architecture diagrams of NEC NoC NIs, Area

occupation data of NoC elements and AXI-NoC

comparison graphs.

APPENDIX 1

120 | P a g e

NI Receiver Architecture (courtesy of NEC Laboratories America Inc.)

APPENDIX 1

P a g e | 121

NI Rdata Sender Architecture (courtesy of NEC Laboratories America Inc.)

APPENDIX 1

122 | P a g e

NI Rdata Receiver Architecture (courtesy of NEC Laboratories America Inc.)

APPENDIX 1

P a g e | 123

courtesy of NEC Laboratories America Inc.

INSTANCE NAME

# LOGIC LEVEL MINIMUM CLOCK

PERIOD[ns] MAXIMUM

FREQUENCY[GHz]

NUMBER OF COMPONENT INSTANCES x PLATFORM

Max. Freq.

400 MHz

1M/5S 2M/5S 3M/5S 8M/8S 16M/16S

AXI_NI_Sender_1x 6 13 0.43 2.325 0 0 0 1 1

AXI_NI_Sender_2x 8 13 0.44 2.272 1 2 3 7 15

AXI_NI_Sender_3x 0 0 0 0 0 0 0 0 0

AXI_NI_Receiver_1x 10 15 0.51 1.96 5 4 3 1 1

AXI_NI_Receiver_2x 12 18 0.57 1.754 0 1 2 7 15

AXI_NI_Receiver_3x 15 19 0.6 1.666 0 0 0 0 0

AXI_NI_Receiver_4x 0 0 0 0 0 0 0 0 0

AXI_NI_Rdata_Sender_1x 8 11 0.42 2.38 5 4 3 1 1

AXI_NI_Rdata_Sender_2x 0 0 0 0 0 1 2 7 15

AXI_NI_Rdata_Sender_3x 0 0 0 0 0 0 0 0 0

AXI_NI_Rdata_Receiver_1x 8 13 0.44 2.272 0 0 0 1 1

AXI_NI_Rdata_Receiver_2x 7 15 0.44 2.272 1 2 3 7 15

AXI_NI_Rdata_Receiver_3x 0 0 0 0 0 0 0 0 0

Router_1x [2 elements] 9 10 0.41 2.439 11 10 9 2 2








124 | P a g e

APPENDIX 1


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1M/5S 2M/5S 3M/5S 8M/8S 16M/16S

Maximum Frequency (GHz)NoC

AXI interconnect (LPI)-Low

AXI interconnect (LPI)- High

APPENDIX 1

P a g e | 125


1M/5S - MaxFreq

2M/5S - MaxFreq

3M/5S - MaxFreq

8M/5S - MaxFreq

16M/16S - MaxFreq

0

10

20

30

40

50

60

70

80

# of logic levels1M/5S - MaxFreq

1M/5S-400 Mhz

2M/5S - MaxFreq

2M/5S-400 Mhz

3M/5S - MaxFreq

3M/5S-400 Mhz

8M/5S - MaxFreq

8M/5S - 400 Mhz

APPENDIX 1

P a g e | 126

NI Master

total20%

NI Slave total 47%

Routers total [2el]

33%

NoC area composition - 1M/5S @ Max Freq.

Routers with 2-element buffer

NI Master total29%

NI Slave total 41%

Routers total [2el]

30%

NoC area composition - 2M/5S@ Max Freq.


NI Master total34%

NI Slave total 38%

Routers total [2el]

28%



NI Master total36%

NI Slave total 34%

Routers total [2el]

30%



NI Master total36%

NI Slave total 34%

Routers total [2el]30%



NI Master total16%

NI Slave total 40%

Routers total [4el]

44%



NI Master total24%

NI Slave total 35%

Routers total [4el]

41%



NI Master total29%

NI Slave total 32%

Routers total [4el]

39%



NI Master total30%

NI Slave total 29%

Routers total [4el]

41%



NI Master total30%

NI Slave total 28%

Routers total [4el]

42%




APPENDIX 2

P a g e | 127

Appendix 2

5x2 Tiles NoC with 3 AXI Masters / 5 AXI Slaves

experimental platform: latency results for each

NoC configuration and latency graph for single

packet.

P a g e | 128

APPENDIX 2

0

5

10

15

20

25

30

35

111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 311 312 313 321 322 323 331 332 333

LATE

NC

Y [

clo

ck c

ycle

]

CONFIGURATION

End-to-end average Latency of packets coming from Master1

S1 S2 S3 S4 S5

APPENDIX 2

P a g e | 129

0

5

10

15

20

25

30

111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 311 312 313 321 322 323 331 332 333

LATE

NC

Y [

clo

ck c

ycle

]

CONFIGURATION

End-to-end average Latency of packets coming from Master 2

S1 S2 S3 S4 S5

P a g e | 130

APPENDIX 2

0

5

10

15

20

25

30

35

111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 311 312 313 321 322 323 331 332 333

LATE

NC

Y [

clo

ck c

ycle

]

CONFIGURATION

End-to-end average Latency of packets coming from Master3

S1 S2 S3 S4 S5

APPENDIX 2

P a g e | 131

(a) (b)

End-to-end latency from M1, M2, M3 to SLAVE_1 with standard configuration (a) and best configuration (b)

P a g e | 132

APPENDIX 2

(a) (b)


APPENDIX 2

P a g e | 133

(a) (b)


P a g e | 134

APPENDIX 2

(a) (b)


APPENDIX 2

P a g e | 135

(a) (b)


fault tolerant network on chip design - home page...

Documents