noc.design.and.optimization.of.multicore.media.processors.thesis

8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

1/194

NoC Design & Optimization of Multicore Media

Processors

A Thesis

Submitted for the Degree of

Doctor of Philosophy

in the Faculty of Engineering

by

Basavaraj T

DEPARTMENT OF ELECTRICAL AND COMMUNICATION

ENGINEERING

INDIAN INSTITUTE OF SCIENCE

BANGALORE – 560 012, INDIA

July 2013


2/194

Abstract

Network on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as well

as Chip Multiprocessor (CMP) designs. Network on Chips (NoCs) help manage high com-

plexity of designing large chips by decoupling computation from communication. SoCs

and CMPs have a multiplicity of communicating entities like programmable processing el-

ements, hardware acceleration engines, memory blocks as well as off-chip interfaces. With

power having become a serious design constraint[5], there is a great need for designing

NoC which meets the target communication requirements, while minimizing power using

all the tricks available at the architecture, microarchitecture and circuit levels of the de-

sign. This thesis presents a holistic, QoS based, power optimal design solution of a NoC

inside a CMP taking into account link microarchitecture and processor tile configurations.

Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-

nections and deterministic latencies in communication paths. Label Switching based

Network-on-Chip (LS-NoC) uses a centralized LS-NoC Management framework that en-

gineers traffic into QoS guaranteed routes. LS-NoC uses label switching, enables band-

width reservation, allows physical link sharing and leverages advantages of both packet

and circuit switching techniques. A flow identification algorithm takes into account band-

width available in individual links to establish QoS guaranteed routes. LS-NoC caters

to the requirements of streaming applications where communication channels are fixed

over the lifetime of the application. The proposed NoC framework inherently supports

heterogeneous and ad-hoc SoC designs.

A multicast, broadcast capable label switched router for the LS-NoC has been de-

signed, verified, synthesized, placed and routed and timing analyzed. A 5 port, 256

i


3/194

Abstract ii

bit data bus, 4 bit label router occupies 0.431 mm2 in 130nm and delivers peak band-

width of 80Gbits/s per link at 312.5MHz. LS Router is estimated to consume 43.08 mW.Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming ap-

plications like HiperLAN/2 and Object Recognition Processor, Constant Bit Rate traffic

patterns and video decoder traffic representing Variable Bit Rate traffic. LS-NoC was

found to have a competitive Area×PowerThroughput

figure of merit with state-of-the-art NoCs provid-

ing QoS. We envision the use of LS-NoC in general purpose CMPs where applications

demand deterministic latencies and hard bandwidth requirements.

Design variables for interconnect exploration include wire width, wire spacing, repeatersize and spacing, degree of pipelining, supply, threshold voltage, activity and coupling

factors. An optimal link configuration in terms of number of pipeline stages for a given

length of link and desired operating frequency is arrived at. Optimal configurations of all

links in the NoC are identified and a power-performance optimal NoC is presented. We

presents a latency, power and performance trade-off study of NoCs using link microar-

chitecture exploration. The design and implementation of a framework for such a design

space exploration study is also presented. We present the trade-off study on NoCs by

varying microarchitectural (e.g. pipelining) and circuit level (e.g. frequency and voltage)

parameters.

A System-C based NoC exploration framework is used to explore impacts of various

architectural and microarchitectural level parameters of NoC elements on power and per-

formance of the NoC. The framework enables the designer to choose from a variety of

architectural options like topology, routing policy, etc., as well as allows experimentation

with various microarchitectural options for the individual links like length, wire width,

pitch, pipelining, supply voltage and frequency. The framework also supports a flexible

traffic generation and communication model. Latency, power and throughput results us-

ing this framework to study a 4x4 CMP are presented. The framework is used to study

NoC designs of a CMP using different classes of parallel computing benchmarks[6].

One of the key findings is that the average latency of a link can be reduced by increasing

pipeline depth to a certain extent, as it enables link operation at higher link frequencies.


4/194

Abstract iii

There exists an optimum degree of pipelining which minimizes the energy-delay product

of the link. In a 2D Torus when the longest link is pipelined by 4 stages at which pointleast latency (1.56 times minimum) is achieved and power (40% of max) and throughput

(64% of max) are nominal. Using frequency scaling experiments, power variations of up

to 40%, 26.6% and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC

between various pipeline configurations to achieve same frequency at constant voltages.

Also in some cases, we find that switching to a higher pipelining configuration can actually

help reduce power as the links can be designed with smaller repeaters. We also find that

the overall performance of the ICNs is determined by the lengths of the links needed tosupport the communication patterns. Thus the mesh seems to perform the best amongst

the three topologies (Mesh, Torus and Folded Torus) considered in case studies.

The effects of communication overheads on performance, power and energy of a multi-

processor chip using L1, L2 cache sizes as primary exploration parameters using accurate

interconnect, processor, on-chip and off-chip memory modelling are presented. On-chip

and off-chip communication times have significant impact on execution time and the en-

ergy efficiency of CMPs. Large caches imply larger tile area that result in longer inter-tile

communication link lengths and latencies, thus adversely impacting communication time.

Smaller caches potentially have higher number of misses and frequent of off-tile communi-

cation. Energy efficient tile design is a configuration exploration and trade-off study using

different cache sizes and tile areas to identify a power-performance optimal configuration

for the CMP.

Trade-offs are explored using a detailed, cycle accurate, multicore simulation frame-

work which includes superscalar processor cores, cache coherent memory hierarchies, on-

chip point-to-point communication networks and detailed interconnect model including

pipelining and latency. Sapphire, a detailed multiprocessor execution environment in-

tegrating SESC, Ruby and DRAMSim was used to run applications from the Splash2

benchmark (64K point FFT). Link latencies are estimated for a 16 core CMP simulation

on Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Different

sizes of L1 and L2 lead to different tile clock speeds, tile miss rates and tile area and hence


5/194


6/194

Acknowledgements

I thank my advisor, Prof. Bharadwaj Amrutur for his invaluable guidance throughout my

Ph.D. I thank all of you who have shared many precious moments with me and enriched

my journey through life.

v


7/194

Publications

Journals• Basavaraj Talwar and Bharadwaj Amrutur, “Traffic Engineered NoC for Streaming

Applications”, Microprocessors and Microsystems , 37(2013), 333-344.

Conferences

• Basavaraj Talwar and Bharadwaj Amrutur, “A System-C based Microarchitectural

Exploration Framework for Latency, Power and Performance Trade-offs of On-Chip

Interconnection Networks”, First International Workshop on Network on Chip Ar-

chitectures , Nov. 2008.

• Basavaraj Talwar, Shailesh Kulkarni and Bharadwaj Amrutur, “Latency, Power

and Performance Trade-offs in Network-on-Chips by Link Microarchitecture Explo-

ration”, 22nd Intl. Conference on VLSI Design , Jan. 2009.

vi


8/194

Contents

Abstract i

Acknowledgements v

1 Introduction 11.1 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Switching Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Packet Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 QoS Guaranteed NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . 71.5.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . 81.5.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Related Work 112.1 Traffic Engineered NoC for Streaming Applications . . . . . . . . . . . . . 11

2.1.1 QoS in Packet Switched Networks . . . . . . . . . . . . . . . . . . . 122.1.2 QoS in Circuit Switched Networks . . . . . . . . . . . . . . . . . . . 132.1.3 QoS by Space Division Multiplexing . . . . . . . . . . . . . . . . . . 15

2.1.4 Static routing in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.5 MPLS and Label Switching in NoCs . . . . . . . . . . . . . . . . . 162.1.6 Label Switched NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Link Microarchitecture and Tile Area Exploration . . . . . . . . . . . . . . 172.2.1 NoC Design Space Exploration . . . . . . . . . . . . . . . . . . . . 17

2.3 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Link Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Router Power and Architecture Exploration Tools . . . . . . . . . . 202.3.3 Complete NoC Exploration . . . . . . . . . . . . . . . . . . . . . . 212.3.4 CMP Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.5 Communication in CMPs - Performance Exploration . . . . . . . . 24

vii


9/194

CONTENTS viii

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Link Microarchitecture Exploration 303.1 Motivation for a Microarchitectural Exploration Framework . . . . . . . . 323.2 NoC Microarchitectural Exploration Framework . . . . . . . . . . . . . . . 33

3.2.1 Traffic Generation and Distribution Models . . . . . . . . . . . . . 353.2.2 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Case Study: Mesh, Torus & Folded-Torus . . . . . . . . . . . . . . . . . . . 383.3.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Round Trip Flit Latency & NoC Throughput . . . . . . . . . . . . 403.3.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 42

3.3.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 463.3.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 483.4 Case Study: Torus, Reduced Torus & Tree based NoC . . . . . . . . . . . . 50

3.4.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.2 NoC Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 533.4.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 543.4.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 58

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Tile Exploration 61

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Observations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Communication Time and Energy Efficiency . . . . . . . . . . . . . . . . . 674.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 734.6 Effect of Link Latency on Performance of a CMP . . . . . . . . . . . . . . 804.7 Communication in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Program Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.9 Ideal Interconnects, Custom Floorplanning, L2 Banks and Process Mapping 96

4.10 Remarks & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Label Switched NoC 1005.1 Streaming Applications in Media Processors . . . . . . . . . . . . . . . . . 102

5.1.1 HiperLAN/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.2 Object Recognition Processor . . . . . . . . . . . . . . . . . . . . . 103

5.2 LS-NoC - Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 LS-NoC - The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 LS-NoC - Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Label Switched Router Design . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.1 Pipes & Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


10/194

CONTENTS ix

5.5.2 Label Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Simulation and Functional Verification . . . . . . . . . . . . . . . . . . . . 112

5.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 LS-NoC Management 1166.1 LS-NoC Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1.1 NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.2 Traffic Engineering in LS-NoC . . . . . . . . . . . . . . . . . . . . . 117

6.2 Flow Based Pipe Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Fault Tolerance in LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4 Overhead of NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.4.1 Computational Latency . . . . . . . . . . . . . . . . . . . . . . . . 1226.4.2 Configuration Latency . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Scalability of LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.5 Number of Pipes in an NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.1 Minimum, Maximum and Typical Pipes in a Network . . . . . . . . 125

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Label Switched NoC 1297.1 HiperLAN/2 baseband processing + Object Recognition Processor SoC . . 1307.2 Video Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Design Philosophy of LS-NoC . . . . . . . . . . . . . . . . . . . . . 1347.3.2 LS-NoC Application . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.3 LS-NoC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8 Conclusion and Future Work 1408.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . . . . . 1408.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . 1418.3 Label Switched NoC for Streaming Applications . . . . . . . . . . . . . . . 1438.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A Interface and Outputs of the SystemC Framework 146

B Testing & Validation of LS-NoC 150B.1 Implementation of LS-NoC Router . . . . . . . . . . . . . . . . . . . . . . 150B.2 Testing and Validation of LS-NoC Router . . . . . . . . . . . . . . . . . . 150

B.2.1 Individual Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2.2 Router in 8×8 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 152

B.3 Synthesis & Place and Route . . . . . . . . . . . . . . . . . . . . . . . . . 153


11/194

CONTENTS x

C The Flow Algorithm 155C.1 Ford-Fulkerson’s MaxFlow Algorithm . . . . . . . . . . . . . . . . . . . . . 155

C.2 Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Edges in the Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Bibliography 160


12/194

List of Tables

3.1 ICN exploration framework parameters. . . . . . . . . . . . . . . . . . . . . 353.2 Traffic Generation/Distribution Model and Experiment Setup for the Mesh,

Torus & Folded-Torus case study. . . . . . . . . . . . . . . . . . . . . . . . 363.3 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 403.4 DLA traffic, Frequency crossover points in 2D Mesh . . . . . . . . . . . . . 493.5 Comparison of 3 topologies for DLA traffic. . . . . . . . . . . . . . . . . . 493.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 513.8 Power optimal frequency trip points in a various NoCs. . . . . . . . . . . . 573.9 Comparison of 3 topologies. Maximum interconnect network performance

and power consumption for varying pipe stages. . . . . . . . . . . . . . . . 58

4.1 Configuration parameters of processors, caches & interconnection network

used in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Scaled processor power over L1 configurations. . . . . . . . . . . . . . . . . 774.3 Primary and Secondary cache parameters (access time, area) obtained from

cacti. L2 access latencies as a function of L1 access times is also shown. . . 774.4 Max operating frequencies, Dynamic energy per access of various L1/L2

caches. Values were calculated using cacti power models using 32nm PTM. 784.5 Lengths of links between L1/L2 caches & routers and between routers of

neighbouring tiles for a regular mesh placement. No. of pipeline stagesrequired to meet the maximum frequency are also shown. . . . . . . . . . . 79

4.6 FFT. Power spent in links (in mW). . . . . . . . . . . . . . . . . . . . . . 89

4.7 Total messages in transit (in Millions). . . . . . . . . . . . . . . . . . . . . 934.8 Clustered tile placement floorplan for L1: 256KB and L2: 512KB. Lengths

of links between neighbouring routers, number of pipeline stages are shown.Frequency: 1.38 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.1 Communication characteristics between HiperLAN/2 nodes. . . . . . . . . 1025.2 Routing table of a n port (n = 5) router with a lw bit (lw = 4) label

indexed by labels used in the label switched NoC. Size of the routing table= 2lw × n× lw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.3 Simulation parameters used for functional verification of the label switchedrouter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xi


13/194

LIST OF TABLES xii

5.4 Synthesis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Synthesis results 2 Router and Mesh networks. Area of a Router is 0.431

mm2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.1 NoC Manager Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.1 Pipes set up for HiperLAN/2 baseband processing SoC and Object Recog-nition Processor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PECcommunicates with every other PEC. . . . . . . . . . . . . . . . . . . . . . 130

7.2 Standard test videos used in experiments. . . . . . . . . . . . . . . . . . . 1327.3 Evaluation of the proposed Label Switched Router and NoC. CS: Circuit

switched, PS: Packet switched. . . . . . . . . . . . . . . . . . . . . . . . . . 136

A.1 ICN exploration framework parameters and their default values. . . . . . . 147

C.1 Routing tables at R0 I0, R0 I2 and R1 I4 nodes after pipes P0 and P1 haveb e e n s e t u p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 9


14/194

List of Figures

1.1 Design space exploration of NoCs in CMPs are closely related to link mi-croarchitecture, router design and tile configurations. . . . . . . . . . . . . 6

2.1 Floorplan used in estimating wire lengths. Wire lengths estimated fromthese floorplans are used as input to Intacte to arrive at a power optimalconfiguration and latency in clock cycles. Horizontal R-R: Link betweenneighboring routers in the horizontal direction, Vertical R-R: Link betweenneighbouring routers in the vertical direction. . . . . . . . . . . . . . . . . 25

3.1 Architecture of the SystemC framework. . . . . . . . . . . . . . . . . . . . 343.2 Flow of the ICN exploration framework. . . . . . . . . . . . . . . . . . . . 343.3 Flit header format. DSTID/SRCID: Destination/Source ID, SQ:Sequence

Number, RQ & RP: Request and Response Flags and a 13 bit flit id. . . . 36

3.4 Example flit header formats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x). . . . . 37

3.5 Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded and Processing Elements(PE) are not. . . . . . . . . . 39

3.6 Normalized average round trip latency in cycles vs. Traffic injection ratein all the 3 NoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7 Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus and Folded 2D Torus are 2.5mm, 8.15mm and 5.5mm. . . . . . . . . 42

3.8 Total NoC throughput in 3 topologies, DLA traffic. . . . . . . . . . . . . . 433.9 Avg. round trip flit latency in 3 NoCs, DLA traffic. . . . . . . . . . . . . . 43

3.10 2D Mesh Power/Throughput/Latency trade-offs for DLA traffic. Normal-ized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.11 2D Mesh Power/Throughput/Latency trade-offs for SLA traffic. . . . . . . 443.12 DLA Traffic, 2D Torus Power/Throughput/Latency trade-offs. Normalized

results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 DLA Traffic, Folded 2D Torus Power/Throughput/Latency trade-offs. Nor-

malized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Frequency scaling on 3 topologies, DLA Traffic. . . . . . . . . . . . . . . . 473.15 Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curve

for P=8 is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiii


15/194

LIST OF FIGURES xiv

3.16 Schematic representation of the three compared topologies (L to R: 2DTorus, Tree, Reduced 2D Torus). Shaded rectangles are Routers and white

boxes are source/sink Processing Elements(PE) nodes. . . . . . . . . . . . 503.17 Floorplans of the three compared topologies. . . . . . . . . . . . . . . . . . 513.18 Maximum attainable frequency by links in the respective topologies. Esti-

mated length of the longest link in a 2D Torus is 7mm. Estimated longestlink in the Tree based and Reduced 2D Torus is 3.5mm. . . . . . . . . . . . 52

3.19 Variation of total NoC throughput with varying pipeline stages in all threetopologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.20 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.21 Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized re-

sults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.22 Variation of NoC power with throughput for each topology. . . . . . . . . . 563.23 Effects of dynamic voltage scaling on the power and performance of a 2D

Torus. Highest frequency of operation for P=1, 2, 4 and 7 are .93GHz,1.68GHz, 2.92GHz and 4.22GHz. Power consumption of the frequencyscaled NoC is shown for comparison. . . . . . . . . . . . . . . . . . . . . . 57

4.1 Error in performance measurement between real and ideal interconnectexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Schematic of a multiprocessor architecture comprising of tiles and an in-terconnecting network. Each tile is made up of a processor, L1 and L2

caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Flowchart illustrating the steps in experimental procedure. . . . . . . . . . 754.4 Tile floorplans for different (L1, L2) sizes. From left: (8KB, 64KB), (64KB,

1MB), (128KB, 4MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Mesh floorplans used in experiments. From left: Conventional 2D Mesh

topology, a clustered topology, cluster topology with L2 bank and threadmapping and and a mesh topology with L2 bank and thread mapping. . . . 77

4.6 Benchmark execution time vs. Communication time - DRAM access timeand On-chip transit time vs. L2 cache size vs. Program completion time. . 80

4.7 Program energy vs. Communication time. . . . . . . . . . . . . . . . . . . 81

4.8 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB,512KB, 1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.9 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB,1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.10 Total messages over all the links during the execution of the benchmarkand Average transit time of a message. . . . . . . . . . . . . . . . . . . . . 86

4.11 FFT. Total instructions executed and power spent in the memory hierarchyand on-chip links during the execution. . . . . . . . . . . . . . . . . . . . . 88


16/194

LIST OF FIGURES xv

4.12 FFT Benchmark. Energy per Instruction and Instructions per second2 perWatt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.13 Y1:PCT, Y2:on-chip transit and off-chip comm. times. . . . . . . . . . . . 924.14 FFT benchmark results. (Program Completion Time, comm.: communi-

cation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.15 FFT benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.16 Program Completion Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 954.17 Alternative Tile Placements, custom process scheduling example and ideal

interconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K. . 98

5.1 (a) Process graph of a HiperLAN/2 baseband processing SoC[7] and (b)NoC of the Object recognition processor[8]. . . . . . . . . . . . . . . . . . . 103

5.2 A 64 Node, 8 × 8 2D LS-NoC along with NoC Manager interface to routingtables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3 Pipe establishment and label swapping example in a 3×3 LS-NoC. . . . . . 1095.4 Label Switched Router with single cycle flit traversal. Valid signal identifies

Data and Label as valid. PauseIn and PauseOut are flow control signalsfor downstream and upstream routers. Routing table has output port andlabel swap information. Arbiter receives input from all the input portsalong with the flow control signal from the downstream router. . . . . . . . 110

5.5 Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label. . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1 Surveillance system showing the application of LS-NoC in the Video com-putati on serv er. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.2 A 2 router, 6 communicating nodes linear network. (b) Multiple source,multiple sink flow calculation in a network. . . . . . . . . . . . . . . . . . . 126

6.3 (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipesin 2D Mesh (Fig. 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.1 (a) Process blocks of HiperLAN/2 baseband processing SoC and Objectrecognition processor mapped on to a 8 × 8 LS-NoC. Pipe 1: PEC0 →PEC6, Pipe 2: MP → PEC3. (b) Flows set up for CBR & VBR traffic. . . 131

7.2 Latency of HiperLAN/2 and ORP pipes in LS-NoC over varying injectionrates of non-streaming application nodes. Latency of non-provisioned pathsare titled (U). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.3 (a) Latency of CBR traffic over various injection rates of non-streamingnodes in LS-NoC. (b) Latency of VBR traffic over various injection ratesof non-streaming nodes in LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 133

7.4 LS-NoC being used alongside a best effort NoC. . . . . . . . . . . . . . . . 136

B.1 Modules in LS-NoC router design shown along with testbench, imple-mented in Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

B.2 Test cases used to verify an individual LS-NoC router. . . . . . . . . . . . 151


17/194

LIST OF FIGURES xvi

B.3 8×8 mesh used for testing LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 152B.4 Traffic test cases used to verify proper functioning of LS-NoC router. . . . 153

B.5 Flowchart illustrating steps in Synthesis and Place & Route steps of theLS-NoC router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

B.6 Placed and routed output - Single Router. . . . . . . . . . . . . . . . . . . 154

C.1 Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities of links after flows X→A→C→Y & X→B→C→Y are set up. (c) Residualnetwork showing available capacities of links in the forward direction andutilized capacity in the reverse. (d) Residual network after adding theflow: X→A→C→B→D→E→Y. (e) Final output of the maxflow algorithm

showing 3 flows from X to Y. . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 (a) A 2 router, 6 source+sink system used for validation of the LS-NoCrouter design. Graph representation of the system used as input to theflow algorithm is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . . . 157

C.3 The NoC after two pipes, P0 and P1 have been established. P0: R0S0→R1 D2and P1: R0S2→R1 D0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158


18/194

Chapter 1

Introduction

1.1 Network-on-Chip

Network on Chips[1][2][3][4] are critical elements of modern Chip Multiprocessors (CMPs)

and System on Chips (SoCs). Network on Chips (NoCs) help manage high complexity of

designing large chips by decoupling computation from communication. SoCs and CMPs

have a multiplicity of communicating entities like programmable processing elements,

hardware acceleration engines, memory blocks as well as off-chip interfaces. Using an

NoC enables modular design of communicating blocks and network interfaces. NoCs

help achieve a well structured design enabling higher performance while servicing larger

bandwidths compared to bus based systems[1]. Links in NoCs designed with controlled

electrical parameters can use aggressive singling circuits to reduce power and delay[9].

Network resources are utilized more efficiently in NoCs as compared to global wires[10].

Communication patterns between communicating entities are application dependent.

As a result, NoCs are expected to cater to diverse connections varying in forms of connec-

tivity, burstiness, latency and bandwidth requirements. NoC servicing communication re-

quirements in CMPs or SoCs are expected to meet Quality of Service (QoS) demands such

as maximum or average latency, typical or peak bandwidth and required throughput of

executing applications. Further, with power having become a serious design constraint[5],

1


19/194

CHAPTER 1. INTRODUCTION 2

there is a great need for designing NoC which meets the target communication require-

ments, while minimizing power using various strategies at architecture, microarchitectureand circuit levels of the design.

1.2 Switching Policies

Switching policies configure paths in routers to facilitate data transfer between input and

output ports. Programming of internal switches in routers to connect input ports to out-

put ports and determination of when and which data units are transferred is accomplishedusing switching policies. Flow control mechanisms synchronize data transfer between

router and traffic sources and between two routers. Switching policies and flow control

mechanisms influence the design of internal switches, routing and arbitration units, and

the amount of buffers in a router. The major types of switching policies are introduced

here.

1.2.1 Circuit Switching

Circuit switching is a reservation based switching policy in which network resources are

allocated to a communication path before data is transferred. At the end of data transfer,

reserved resources are de-allocated and are available for future circuits. As circuits are

used on a reservation basis, circuit switching requires a simple router design with a few

or no buffers.

Circuits are established using path identifying probe packets that reserve resources

as they propagate towards the destination. The circuit establishment is complete after

an acknowledgment message is received by the source. Data is transferred along the cir-

cuit without further monitoring or control. After the transfer is complete, the circuit is

torn down and resources freed using a tail packet. Popular examples of circuit switched

networks are Autonomous Error-Tolerant Cell[11], Asynchronous SoC[12], Crossroad[13],

dTDMA[14], Point to point network on real time systems[15], Programmable NoC for

FPGA-based systems[16], ProtoNoC[17], Space Division Multiplexing based NoC[18],


20/194


SoCBuS[19], Reconfigurable Circuit Switched NoC[7], etc.

1.2.2 Packet Switching

In packet switching, the message to be transmitted is partitioned and transmitted as

fixed-length packets. Routing and control is handled on a per packet basis. The packet

header includes routing and other control information needed for the packet to reach

the destination. Packet switching increases network resource utilization as communica-

tion channels share resources along the path. Buffers and arbitration units in routers

manage resource conflicts and storage demands in communication paths. Packet switch-

ing networks aid IP block re-use and are scalable[20]. Packet-switching is more flexible

than circuit switching though it requires buffering and introduces unpredictable latency

(jitter). Popular packet switched networks are Asynchronous NoC[21], FAUST[22], Ar-

teris NoC[23], Butterfly Fat Tree[24], DyAD[25], Eclipse[26], MANGO[27], Proteo[28],

QNoC[29], SPIN[30], etc. Some NoC designs can adaptively work in circuit or packet

switched modes based on traffic requirements. A few examples are Æthereal[31], Hetero-

geneous IP Block Interconnection[32], dynamically reconfigurable NoC[33], Octagon[34],

etc.

1.2.3 Label Switching

Label switching is used by technologies such as ATM[35][36] and Multiprotocol Label

Switching (MPLS)[37] as a packet relaying technique. Individual packets carry route in-

formation in the form of labels. A label denotes a common route that a set of data packets

traverse. Therefore, a minimalistic label identifies the source hop and the destination hop

along with the intermediate transit routers. Along with routing information, labels can

be used to specify service priorities to packets. This feature of labels enables use of dif-

ferentiated services for packets using common labels. Routers along the path use the

label to identify the next hop, forwarding information, traffic priority, Quality of Service

guarantees and the next label to be assigned. Label switching inherently supports traffic

engineering, as labels can be chosen based on desired next hop or required QoS services.


21/194


A few proposals of label switched NoCs are MPLS NoC[38], Nexus[39] and Blackbus[40].

1.3 QoS in NoCs

NoCs servicing CMPs and SoCs are expected to meet Quality of Service (QoS) demands

of executing applications. Latency sensitive applications demand a guaranteed average

and maximum latency on communication traffic. Jitter sensitive applications may tolerate

longer latencies but require fixed delay along communication paths. Further, in between

classes of application some have higher priority than others. For example, applicationdata usually has higher priority than acknowledgment packets or control information.

The two basic approaches in NoC designs to enable QoS guarantees are: creation of

reserved connections between source and destinations via circuit switching or support for

prioritized routing (in case of packet switched, connectionless paths).

Circuit switched NoCs guarantee high data transfer rates in an energy efficient manner

by reducing intra-route data storage[41]. Circuit switched NoCs provide guaranteed QoS

for worst case traffic scenarios leading to higher network resource requirements[42]. These

are well suited for streaming traffic generated by media processors where communication

requirements are well known a priori. One of the drawbacks here is under utilization

of network resources as resources are reserved for peak bandwidth while the average

requirement might be lesser.

Packet switched networks provide efficient interconnect utilization and high throughputs[43]

while providing fairness amongst best effort flows. However, network resources in packet

switched networks need to be over-provisioned to support QoS for various traffic classes

and have high buffer requirements in routers. Packet switching networks usually provide

QoS by differentiated services to traffic by classifying them into various classes[29]. Pri-

oritized services are provided to traffic belonging to each class. Due to the sharing of

network resources, packet switched networks can be configured to provide Guaranteed

Throughput (GT) for a few classes of traffic and Best Effort (BE) services for remaining

classes.

With traffic engineering enabled label switching networks, communication loads can


22/194


be distributed over the NoC resulting in fair allocation of network resources. Network

resource guarantees, enable paths with less or no jitter while keeping network utilizationfairly high. Further, design of routers is simplified compared to conventional wormhole

routers[40].

1.4 QoS Guaranteed NoC Design

Media processors with streaming traffic such as HiperLAN/2 Baseband Processors[7],

Real-time Object Recognition Processors [8] and H.264 encoders[44][45] demand ade-quate bandwidth and bounded latencies between communicating entities. They also have

well known communication patterns and bandwidth requirements. Adequate throughput,

latency and bandwidth guarantees between process blocks have to be provided for such

applications. Nature of streaming applications in media processors and characteristics of

streaming traffic are illustrated in Section 5.1 of Chapter 5.

Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-

nections and deterministic latencies in communication paths. This thesis proposes a QoS

guaranteeing NoC using label switching where bandwidth can be reserved while links are

shared. The traffic is engineered during route setup and it leverages advantages of both

packet and circuit switching techniques. We propose a QoS based Label Switched NoC

(LS-NoC) router design. We present a latency, power and performance optimal intercon-

nect design methodology considering low level circuit and system parameters. Further,

optimal tile configurations are identified using effects of application communication traffic

on performance and energy in chip multiprocessors (Figure 4.2).

A label switched, QoS guaranteeing NoC, that retains advantages of both packet

switched and circuit switched networks is the main focus of this thesis. Congestion free

communication pipes are identified by a centralized Manager with complete network vis-

ibility. Label Switched NoC (LS-NoC) sets up communication channels (pipes) between

communicating nodes that are independent of existing pipes and are contention free at the

routers. Deterministic delays and bandwidth are guaranteed in newly established pipes,

taking into account established flows. Residual bandwidth in links reserved by a pipe can


23/194


Figure 1.1: Design space exploration of NoCs in CMPs are closely related to link microar-

chitecture, router design and tile configurations.

be utilized by other pipes, thus enabling sharing of physical links between pipes without

compromising QoS guarantees. LS-NoC provides throughput guarantees irrespective of

spatial separation of the communicating entities.

Interconnect delay and power contribute significantly towards the final performance

and power numbers of a CMP[46]. Design variables for interconnect exploration include

wire width, wire spacing, repeater size and spacing, degree of pipelining, supply (V dd),

threshold voltage (V th), activity and coupling factors. A power and performance opti-

mal link microarchitecture can be arrived at by optimizing these low level link param-

eters. A methodology to arrive at the optimal link configuration in terms of number

of pipeline stages (cycle latency) for a given length of link and desired operating fre-

quency is presented. Optimal configurations of all links in the NoC are identified and a

power-performance optimal NoC thus achieved.

Primary and secondary cache sizes have a major bearing on the amount of on-chipand off-chip communication in a Chip Multiprocessor (CMP). On-chip and off-chip com-

munication times have significant impact on execution time and the energy efficiency of

CMPs. From a performance point of view, cache accesses should suffer minimum delay

and off-tile communication due to cache misses should be negligible. Large caches dissi-

pate more leakage energy and may exceed area budgets though they reduce cache misses

and decrease off-tile communication. Larger caches result in longer inter-tile communi-

cation link lengths and latencies, thus adversely impacting communication time. Small


24/194


caches reduce occupied tile area, have higher activity and hence dissipate lesser leakage

energy. Drawback of smaller caches is potentially higher number of misses and frequentof off-tile communication. This illustrates the trade off between cache size, miss rate,

NoC communication latency and power. Energy efficient tile design is a configuration

exploration and trade-off study using different cache sizes and tile areas to identify a

power-performance optimal cache size and NoC configuration for the CMP.

1.5 Contributions of the Thesis

Work in this thesis presents methodologies for label switched QoS guaranteed NoC design,

link microarchitecture exploration and optimal Chip Multiprocessor (CMP) tile configu-

rations. Contributions from this thesis are listed here:

1.5.1 Link Microarchitecture Exploration

• Optimal Link Design and Exploration Framework: We present simulation framework

developed in System-C which allows the designer to explore NoC design across low

level link parameters such as pipelining, link width, wire pitch, supply voltage, op-

erating frequency and NoC architectural parameters such as router type and topol-

ogy of the interconnection network. We use the simulation framework to identify

power-performance (Energy-Delay) optimal link configuration in a given NoC over

particular traffic patterns. Such an optimum exists because increasing pipelining

allows for shorter length wire segments which can be operated either faster or with

lower power at the same speed.

• Optimum Pipe Depth: Contrary to intuition, we find that increasing pipeline depth

can actually help reduce latency in absolute time units, by allowing shorter links

& hence higher frequency of operation. In some cases, we find that switching to

a higher pipelining configuration can actually help reduce power as the links can

be designed with smaller repeaters. Larger NoC power savings can be achieved by

voltage scaling along with frequency scaling. Hence it is important to include the


25/194


link microarchitecture parameters as well as circuit parameters like supply voltage

during architecture design exploration of NoCs.

1.5.2 Optimal CMP Tile Configuration

• Optimal Cache Size: The performance-power optimal L1/L2 configuration of a tile

is close to the configuration that spends least amount of time in on-chip and off-chip

communication.

• Effect of Floorplanning and Process Mapping: Communication aware floorplanningcan reduce up to 2.6% of the energy spent in execution of an instruction and up to

11% savings in communication power during the execution of the program. Mapping

L2 banks in the same core as the processes accessing it reduces time spent in commu-

nication and hence the overall program completion time and also has a bearing on

the Total Energy spent in the execution of the program. Experiments have revealed

that as much as 2% of energy per instruction can be saved by communication-aware

process scheduling compared to a conventional thread mapping policies in a 2DMesh architecture.

1.5.3 QoS in NoCs

• A Label Switching NoC providing QoS guarantees: We present a LS-NoC to service

QoS demands of streaming traffic in media processors. A centralized NoC Man-

ager capable of traffic engineering establishes bandwidth guaranteed communication

channels between nodes. LS-NoC guarantees deterministic path latencies, satisfies

bandwidth requirements and delivers constant throughput. Delay and throughput

guaranteed paths (pipes ) are established between source and destinations along con-

tention free, bandwidth provisioned routes. Pipes are identified by labels unique to

each source node. Labels need fewer bits compared to node identification numbers

- potentially decreasing memory usage in routing tables.

• NoC Manager with traffic engineering capabilities: The NoC Manager utilizes flow


26/194


identification algorithms to identify contention free, bandwidth provisioned paths

in LS-NoC called pipes . The LS-NoC Manager has complete visibility of the stateof LS-NoC. Bandwidth requirements of the application are taken into account to

provision routes between communicating nodes by the flow identification algorithm.

Flow based pipe establishment algorithm is topology independent and hence the

NoC Manager supports applications mapped to both regular chip multiprocessors

(CMPs) and customized SoCs with non-conventional NoC topologies. Additionally,

fault tolerance is achieved by the NoC Manager by considering link status during

pipe establishment.

• Design of a Label Switched Router: The Label Switched (LS) Router used in LS-

NoC achieves single cycle traversal delay during no contention and is multicast and

broadcast capable. Source nodes in the LS-NoC can work asynchronously as cycle

level scheduling is not required in the LS Router. LS router supports multiple clock

domain operation. Dual clock buffers can be used at output ports in the LS-NoC

router. This eases clock domain crossovers and reduces the need for a single globally

synchronous clock. As a result, clock tree design is less complex and clock power is

potentially saved.

1.6 Organization of the Thesis

Chapter 2 highlights several works from current literature related to the broad areas

of QoS guaranteed NoCs, link microarchitecture, design space exploration of NoCs and

effects of communication on energy and performance trade-offs in CMPs.

Chapter 3 presents a latency, power and performance trade-off study of NoCs through

link microarchitecture exploration using microarchitectural and circuit level parameters.

NoC exploration framework used in the trade-off studies is described. The interface to

the SystemC framework and sample output logs generater are presented in Appendix A.

Effects of on-chip and off-chip communication due to various CMP tile configurations

is explored in Chapter 4. The need to use detailed interconnection network models to


27/194


identify optimal energy and performance configurations is also highlighted. On-chip and

off-chip communication effects on power and performance of a CMPs is explored. Effects of communication on program execution times and program execution energy are presented.

Further, Energy-performance results for tile configurations and effects of custom L2 bank

mapping and thread mapping on power and performance of CMPs is presented.

Design and implementation of a label switching, traffic engineering capable NoC de-

livering guaranteed QoS for streaming traffic in media processors has been presented in

Chapter 5. Traffic characteristics of streaming applications are also presented in the chap-

ter. Functional verification of the LS-NoC router using various test cases is presented inAppendix B. Chapter 6 illustrates the LS-NoC management framework and the flow

identification algorithm used to establish pipes. An example of use of flow algorithm has

been presented in Appendix C. Streaming application test cases and various types of

video traffic are used to establish LS-NoC as a QoS guaranteeing framework in Chapter

7. The thesis concludes in Chapter 8 after enlisting some future advancements possible

on the proposed work.


28/194

Chapter 2

Related Work

Several publications have highlighted the need for solutions to pressing problems in various

domains in the broad area of Network-on-Chips[47][48][49][50]. This chapter introduces

relevant works in the broad areas of QoS guaranteed Network-on-Chips, design space

exploration of NoCs and effects of communication on energy and performance trade-offs

in CMPs.

2.1 Traffic Engineered NoC for Streaming Applica-

tions

Providing QoS guarantees in on-chip communication networks has been identified as one

of major research problems in NoCs[48]. QoS solutions in packet switched networks use

priority based services while circuit switched NoCs use some form of resource reservation.We introduce a few well known QoS solutions from literature and compare our work with

the state of the art. Packet switched NoCs use differentiated services for traffic classes

[29][22][21][8] to provide latency and bandwidth guarantees. Circuit switched NoCs use

resource reservation mechanisms to guarantee QoS[34][51][41][19]. Resource reservation

mechanisms involve identifying a sufficiently resource rich path, reserving resources along

the path, configuration, actual communication and path tear down. A fairly extensive

survey of NoC proposals has been presented in [50]. Relevant QoS NoCs are discussed in

11


29/194

CHAPTER 2. RELATED WORK 12

this section.

2.1.1 QoS in Packet Switched Networks

QoS NoC (QNoC) presented by Bolotin et. al. [29] is a customized QoS NoC architecture

based on a 2D Mesh to satisfy QoS by allocating frequently communicating nodes close-by,

doing away with unnecessary links, tailoring link width to meet bandwidth requirements

and balancing link utilization. Inter-module communication traffic is classified into four

classes of service: signaling, real-time, RD/WR and block-transfer. FAUST[22] is a recon-

figurable baseband platform based on an asynchronous NoC providing a programmable

communication framework linking heterogeneous resources. FAUST uses 2 level priority

based virtual circuit design in its Network Interface (NI) to provide QoS guarantees. Asyn-

chronous NoCs[21] use clock-free interconnect to improve reliability and delay-insensitive

arbiters to solve routing conflicts. A QoS Router with both soft (Soft GT) and hard (Hard

GT) guarantees for globally asynchronous, locally synchronous (GALS) NoCs is presented

in [52]. Leftover bandwidth in routers servicing Hard GT is utilized by Soft GT connec-

tions and best effort traffic. NoCs presented in [21], [52] and [53] employ multiple priority

levels to provide differentiated services and guarantee QoS. The MANGO [27][54] NoC

provides hard GT by prioritizing each GT connection and adopts Asynchronous Latency

Guarantee (ALG) scheduling to prevent starvation of packets with lower priority.

One of the major drawbacks of priority based QoS schemes is that increase in traffic

in one priority class effects the delay on traffic belonging to other classes. A priority

network will lose the differentiated services advantage if all traffic belong to the same

priority level. Further, deadlock-free routing algorithms using virtual circuits with a

priority approach may lead to degradation in NoC throughput. In cases where connections

cannot be overlapped with each other (eg. MANGO NoC), increased number of hard GT

connections will lead to increased cost in network resources.

Another class of packet switched NoCs using priority based QoS solutions are applica-

tion specific SoCs. A tree based hierarchical packet-switched NoC for a real-time object

recognition processor is implemented in [8]. The tree topology NoC with three crossbar


30/194


switches interconnects 12 IPs supports both bursty (for image traffic) and non-bursty (for

control and synchronization signals) traffic. Network resources in this NoC are tailoredto meet throughput and bandwidth demands of the application and hence the design is

not a generic solution for servicing QoS in an CMP environment.

2.1.2 QoS in Circuit Switched Networks

Resource reservation between communicating nodes involves identification of path us-

ing point-to-point links or a path probing service network or an intelligent, traffic aware

distributed or centralized manager. Hu et. al.[15] introduce point-to-point (P2P) commu-

nication synthesis to meet timing demands between communicating nodes using bus width

synthesis. Circuit switched bus based QoS solutions such as Crossroad[13], dTDMA[14]

and Heterogeneous IP Block Interconnection (HIBI)[32] rely on communication localiza-

tion to satisfy timing demands. NEXUS[39] is a resource reservation based QoS NoC

for globally asynchronous, locally synchronous (GALS) architectures. NEXUS uses an

asynchronous crossbar to connect synchronous modules through asynchronous channels

and clock-domain converters.

P2P networks do not share communication links between multiple nodes leading to

inefficient utilization of network resources. This increases wiring resources inside the

chip and results in poor scalability. Crossbar based solutions using protocol handshakes

(for example, 4-way handshakes in NEXUS[39] and ProtoNoC[17]) force communicating

nodes to wait till handshake is complete and path is established. Non-interference of

communication channels is achieved by over-provisioning resources in the crossbar. This

leads to complex and poorly scalable networks. Connecting frequently communicating

nodes on a single bus will increase demand on the bus and lead to larger waiting times at

the nodes. Static routing along shortest paths does not guarantee latency bound routes

due to arbitration delays in the network.

Amongst the NoCs that use a probe based circuit establishment solutions are Intel’s

8×8 circuit switched NoC[41], SoCBUS[19][55] and distributed programming model in


31/194


Æthereal[51]. In these NoCs, probe packets are used to reconnoiter shortest communica-

tion paths and configure routing tables if path (circuit) is available. Routers are lockeddown and no other circuits can use the port during the lifetime of an established circuit.

If the shortest X-Y path is not available, the probe packets initiate route discovery mech-

anisms in other paths. The method involves some dynamic behaviour as the probe might

repeat route discovery steps or try after a random period of time if circuit set up does

not succeed. This leads to indeterministic and sometimes large route setup times which

may be unacceptable for real time application performance.

Centralized Circuit Management

Reserved communication channels can be identified and configured using an application

aware hardware or software entity[51][34]. Such a traffic manager can provide programma-

bility of routes.

The Æthereal NoC [51] aims at providing hard guaranteed QoS using Time Divi-

sion Multiplexing(TDM) to avoid contention in a synchronous network. The centralized

programming model in Æthereal NoC[51] uses a root process to identify free slots and

configure network interfaces. Time slot tables are used in routers to reserve output ports

per input port in a particular time slot. To avoid collisions and the loss of data, con-

secutive time slots are then reserved in routers along the circuit path. The number of

paths established in the NoC is restricted by the scheduling constraints during time slots

reservation. Increasing the number of time slots in TDM based NoCs increases router size.

In cases where a communication channel cannot be found due to slots exhaustion, the

traffic division over multiple physical paths may be required[56]. Traffic division involves

reordering packets at the target node leading to increased memory and computational

costs.

TDM techniques using slot tables in Æthereal[51] and sequencers in Adaptive System-

on-Chip[12] require a single synchronous clock distributed over the chip. Accurate global

synchronous clock distribution is expensive in terms of power. Global synchronicity can be

achieved in a distributed manner using tokens such that every router synchronizes every


32/194


slot with all of its neighbors [57]. This method will bring down the operating speed of the

NoC as the slowest router will dictate the speed of the NoC. Further, power managementtechniques such as multiple clock domains is not feasible with this approach. AElite[58]

and dAElite[59] have been proposed as improved next generation Æthereal NoCs. AElite

inherits the guaranteed services model from Æthereal. To overcome the global synchronic-

ity problem, AElite proposes use of asynchronous and mesochronous links as a possibility.

As noted in the paper[58], using mesochronous links alone may not be sufficient if routers

and NIs are plesiochronous[60]. One of the drawbacks of AElite was number of slots

occupied by the header flits. A header flit in AElite occupied one in three slots and theoverhead rises to up to 33%. dAElite circumvents the header flit overhead by routing

based on the time of packet injection and packet receiving. One of the disadvantages of

dAElite is an increase in the number of link wires, due to the configuration network and

also because of separate wires for end-to-end credit communication.

The Octagon NoC[34] implements a centralized best fit scheduler to configure and

manage non-overlapping connections. The scheduler cannot establish a new connection

through a port if it is blocked by another connection. This results in increased connection

establishment time at the routers and also packet losses.

2.1.3 QoS by Space Division Multiplexing

As an alternative to TDM techniques, Spatial Division Multiplexing (SDM) techniques

for QoS have been proposed in [23],[61] and [62]. SDM techniques involve sharing fraction

of links between connections simultaneously based on bandwidth requirements of the

corresponding connections. An approach comparable to a static version of SDM called

Lane-Division-Multiplexing has been proposed in [7]. Lane-Division-Multiplexing is based

on a reconfigurable circuit switched router composed of a crossbar and data converters.

Disadvantage of the solution in [7] is that it does not support channel sharing and BE

traffic. An additional network is required for configuring the switches and for carrying

the BE traffic. Sharing a subset of wires between connections as in [63] leads to a more

complex switch design with huge delay. SDM and TDM techniques have been combined


33/194


in [64] allowing for increase in number of connections supported by increasing the number

of sub-channels in the link or by increasing the number of time slots. This increases pathestablishment probability in the NoC.

In SDM based techniques, sender serializes data on the wires allocated and the receiver

deserializes the data before forwarding to the IP block. One of the issues in SDM based

circuits is complexity of implementation of serializers and deserializers.

2.1.4 Static routing in NoCs

Most NoCs use traffic oblivious static routing[51] to establish communication channels

between nodes. Dimension ordered routing[41][53][17][53][51][34] or routes decided at de-

sign time[65] are not flexible and cannot circumvent congested paths. Routing in FPGAs

also present a similar scenario where routes between communicating nodes are bandwidth

and latency guaranteed, but are static. These routes occupy network resources along the

path for the entire lifetime of the application. QoS is guaranteed in this case by over

provisioning resources along the route.

2.1.5 MPLS and Label Switching in NoCs

Use of Multi-Protocol Label Switching for QoS[38] in NoCs and advantages of identifying

communication channels using labels have been investigated in [39],[40]. A conventional

NoC is connected to an MPLS backbone using Label Edge Routers (LERs)[38]. The

MPLS backbone uses traffic engineering and priority based QoS services to communication

channels identified by labels. The work is a direct mapping of the MPLS implementation

in the Internet to NoCs. The router and NoC design approach is not optimized for a

hardware implementation. Results from Network Simulator-2 (NS-2) are at a functional

level and may not reflect the exact performance achievable inside a chip.

Use of labels to identify communication channels instead of source and destination

identification numbers reduces the amount of metadata transmitted in the NoC. Unique

addressing at source allows label reuse and enables efficient use of label space. Imple-

mentation of label based addressing in streaming applications have resulted in significant


34/194


reduction in router area[40]. The work employs a method similar to label switching to

achieve non-global label addressing hence reducing label bit width. A C ×N → C routingstrategy is described in conjunction with the label addressing scheme. Work presented in

[40] presents a simple data transfer scheme and does not concentrate on rendering QoS

between communicating nodes. The route establishment process has not been explicitly

mentioned and one can assume that standard routing algorithms will be used.

2.1.6 Label Switched NoC

In the proposed work, we describe a Label Switched QoS guaranteeing NoC that retains

advantages of both packet switched and circuit switched networks. Contention at output

ports in is tackled using communication pipes. Pipes are communication routes estab-

lished along a bandwidth rich, contention free router path. Pipes are identified by a

centralized Manager with complete network visibility.

NoC Manager utilizes Flow identification algorithms[66][67] (Algorithm 1) to establish

pipes. Flow identification algorithm guarantees a deterministic delay in identifying and

configuring pipes. Flow identification algorithm takes into account bandwidth available

in individual links to establish QoS guaranteed pipes. This guarantees QoS serviced

communication paths between communicating nodes. Multiple pipes can be set up in

a single link if QoS requirements of all the pipes are satisfied. This enables sharing of

physical links between pipes without compromising QoS guarantees. LS-NoC provides

throughput guarantees irrespective of spatial separation of communicating entities.

2.2 Link Microarchitecture and Tile Area Exploration

2.2.1 NoC Design Space Exploration

Current research in architectural level exploration of NoC in SoCs concentrates on un-

derstanding the impacts of varying topologies, link and router parameters on the overall

throughput, area and power consumption of the system (SoCs and Multicore chips) using

suitable traffic models[68]. Impacts of varying topologies, link and router parameters on


35/194


the overall throughput, area and power consumption of the system (SoCs and Multicore

chips) using relevant traffic models is discussed in [68]. The paper illustrates a consistentcomparison and evaluation methodology based on a set of quantifiable critical parameters

for NoCs. The work suggests that evaluation of NoCs must consider applications into

account. The usual most critical evaluation parameters are not exhaustive and differ-

ent applications may require additional parameters such as testability, dependability, and

reliability.

Work in [69] emphasizes need for co-design of interconnects, processing elements and

memory blocks to understand the effects on overall system characteristics. Results fromthis work show that the architecture of the interconnect interacts with the design and

architecture of the cores and caches closely. The work studies the area-bandwidth-

performance trade-off on on-chip interconnects. The increase in area demands of shared

caches in CMPs is also documented. Not using detailed interconnect models during CMP

design leads to non-optimal larger shared L2 caches inside the chip.

2.3 Simulation Tools

Simulation tools have been developed to aid designers in interconnection network (ICN)

space exploration[70][71]. Kogel et. al.[70] present a modular exploration framework to

capture performance of point-to-point, shared bus and crossbar topologies.

2.3.1 Link Exploration Tools

Link exploration tool works make a case for microarchitectural wire management in future

processors where communication is a prominent contributor for power and performance.

Separate wire exploration tools such as those presented in [71], [72], [73], [74] and [75]

give an estimate of delay of the wire in terms of latency for a particular wire length and

operating frequency.

Orion [71] is a power-performance interconnection network simulator that is capable of

providing power and performance statistics. Orion model estimates power consumed by


36/194


router elements (crossbars, FIFOs and arbiters) by calculating switching capacitances of

individual circuit elements. Orion contains a library of architectural level parameterizedpower models.

The more recent Orion 2.0 presented in [76] is an enhanced NoC power and area

simulator offering improved accuracy compared to the original Orion framework. Some of

the additions into Orion 2.0 include flip-flop and clock dynamic and leakage power models,

link power models, leveraging models developed in [74]. Virtual Channel (VC) allocator

microarchitecture uses a VC allocation model, based on the microarchitecture and pipeline

proposed in [77]. Application-specific technology-level fine tuning of parameters usingdifferent Vth and transistor widths are used to increase accuracy of power estimation.

Work in [72] explores use of heterogeneous interconnects optimized for delay, band-

width or power by varying design parameters such as a buffer sizes, wire width and number

of repeaters on the interconnects. The work presented in the paper uses Energy-Delay2

as the optimization parameter. An evaluation of different configurations of heterogeneous

interconnects is made. The evaluation shows that an optimal configuration (for delay,

bandwidth, power or power and bandwidth) of wires can reduce the total processor ED 2

value by up to 11% compared to a NoC with homogeneous interconnect in a typical

processor.

Courtay et. al[73] have developed a high-level delay and power estimation tool for

link exploration that offers similar statistics as Intacte does. The tool allows changing

architectural level parameters such as different signal coding techniques to analyze the

effects on wire delay/power.

Work in [74] proposes delay and power models for buffered interconnects. The mod-

els can be constructed from sources such as Liberty[78], LEF/ITF[79], ITRS[80], and

PTM[81]. The buffered delay models take into account effects of input and output slews

of circuit elements in calculating intrinsic delays. The power models include leakage and

dynamic power dissipation of gates. The area models include technology dependent co-

efficients that can be estimated by linear regression techniques per technology node to

estimate repeater areas.


37/194


Intacte[82] is used for interconnect delay and power estimates. Design variables for

Intacte’s interconnect optimization are wire width, wire spacing, repeater size and spacing,degree of pipelining, supply (V dd) and threshold voltage (V th). Intacte can be used to arrive

at power optimal number of repeaters, sizes and spacing for a given wire length to achieve

a desired frequency. Intacte outputs total power dissipated including short circuit and

leakage power values.

A high level power estimation tool accounting for interconnect effects is presented in

[83]. The work presents an interconnect length estimation model based on Rent’s rule[84]

and a high level area (gate count) prediction method. Different place and route enginesand cell libraries can be used with this proposed model after some minor adaptations.

2.3.2 Router Power and Architecture Exploration Tools

Most router exploration tools model ICN elements at a higher level abstraction of switches,

links and buffers and help in power/performance trade-off studies[85][86]. These are used

to research the design of router architectures[87] and ICN topologies[34] with varying

area/performance trade-offs for general purpose SoCs or to cater to specific applications.

A high level power estimation methodology for NoC routers based on number of

traversing flits as the unit of abstraction has been proposed in [85]. The macro model of

the framework incurs a minor absolute cycle error compared to gate level analysis. Provid-

ing a fast and cycle accurate power profile at an early stage of router design enables power

optimizations such as power-aware compilers, core mapping, and scheduling techniques

for CMPs to be incorporated into the final design. The power macro model uses state

information of FSM in a router assigned to reserve channels during packet forwarding

for wormhole flow control. This enhances the accuracy of the power macro model. The

power macro model based on regression analysis can be migrated to different technology

libraries.

An architectural-level power model for interconnection network routers has been pre-

sented in [88]. The work specifically considers the Alpha 21364 and Infiniband routers


38/194


for modelling case studies. Memory arrays, crossbars and arbiters form the basic build-

ing blocks of all router models using this framework. Each of these building blocks havebeen modelled in detail to estimate switching capacitance. Switching activity is estimated

based on traffic models assuming certain arrival rates at the input ports. The power num-

bers for both Alpha 21364 and Infiniband routers have been found to be matching the

vendors’ estimates within a minor error margin.

The high level power model presented in [86] to estimate power consumption in semi-

global and global interconnects considers switching power, power due to vias and re-

peaters. The high level model estimates switching power within an error of 6% with aspeedup of three-to-four orders of magnitude. Error in via power is under 3%. A segment

length distribution model has been presented for cases where Rents rule is insufficient.

The segment length distribution model has been validated by analyzing netlists of a set

of complex designs.

A wormhole router implementing a minimal adaptive routing algorithm with near

optimal performance and feasible design complexity is proposed in [87]. The work also

estimates the optimal size of FIFO in an adaptive router with fixed priority scheme. The

optimal size of the FIFO is derived to be equal to the length of the packet in flits in this

work.

2.3.3 Complete NoC Exploration

Several frameworks have been proposed for complete NoC exploration[89][90][91]. These

frameworks can be used as tools to derive a first cut analysis of effect of certain NoC con-

figurations at an early design phase. Such frameworks are the first steps for roadmapping

future of on-chip networks.

A technology aware NoC topology exploration tool has been presented in [89]. The

NoC exploration is optimized for energy consumption of the entire SoC. The work char-

acterizes 2D Meshes and Torii along with higher dimensions, multiple hierarchies and

express channels, for energy spent in the network. The work presents analytical models

based on NoC parameters such as average hop count and average flit traversal energy to


39/194


predict the most energy-efficient topology for future technologies.

A holistic approach to designing energy-efficient cluster interconnects has been pro-posed in [90]. The work uses a cycle-accurate simulator with designs of an InfiniBand

Architecture (IBA) compliant interconnect fabric. The system is modelled to comprise

of switches, network interface cards and links. The study reveals that the links and

switch buffers consume the major portion of the SoC power. The work proposes dynamic

voltage scaling and dynamic link shutdown as viable methods to save power during SoC

operation. A system-level roadmapping toolchain for interconnection networks has been

presented in [91]. The framework is titled Polaris and iterates through available NoC de-signs to identify a power optimal one based on network traffic, architectures and process

characteristics.

Several complete NoC simulators have been developed and are in use by the NoC

research community[92][93][94]. The Network-on-Chip Simulator, Noxim[92], was devel-

oped at the University of Catania, Italy. Several NoC parameters such as network size,

buffer size, packet size distribution, routing algorithm, selection strategy, packet injection

rate, traffic time distribution, traffic pattern, hot-spot traffic distribution can be input

to this framework. The simulator allows NoC evaluation based on throughput, flit de-

lay and power consumption. The Nostrum NoC Simulation Environment (NNSE) [94] is

part of the Nostrum project[65] and contains a SystemC based simulator. Inputs to this

simulator are network size, topology, routing policy and traffic patterns. Based on these

configuration parameters a simulator is built and executed to produce a desired set of

results in a variety of graphs.

2.3.4 CMP Exploration Tools

Wattch was one of the first [95] architectural level frameworks for analyzing and optimizing

microprocessor power dissipation. Wattch was orders of magnitude faster than layout-

level power tools, and its accuracy was within 10% of verified industry tools on leading-

edge designs. Wattch was an architecture-level, parameterizable, simulator framework

that can accurately quantify potential power consumption in microprocessors. Wattch


40/194


framework quantifies power consumption of all the major units of the processor, param-

eterize them, and integrate these power estimates into a high-level simulator. Wattchmodels main processor units into array structures, fully associative content-addressable-

memories, combinational logic and wires or clocking elements. Individual capacitances

of each of these elements are estimated and power is calculated. Work presented in [95]

integrates Wattch into SimpleScalar architectural simulator[96].

A tool like Ruby[97], allows one to simulate a complete distributed memory hierarchy

with an on-chip network as in Orion. However, it needs to be augmented with a detailed

interconnect model which accounts for the physical area of the tiles and their placements.Network Processor exploration and power estimation tools utilize models for smaller

components and quote the integrated power for the system[98][99][100]. They use cycle

accurate register, cache and arbiter models introduced previously here. NePSim[99] is

an open-source integrated simulation infrastructure. Typical network processors can be

simulated with the cy

noc.design.and.optimization.of.multicore.media.processors.thesis

Documents