noc.design.and.optimization.of.multicore.media.processors.thesis
TRANSCRIPT
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
1/194
NoC Design & Optimization of Multicore Media
Processors
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Basavaraj T
DEPARTMENT OF ELECTRICAL AND COMMUNICATION
ENGINEERING
INDIAN INSTITUTE OF SCIENCE
BANGALORE – 560 012, INDIA
July 2013
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
2/194
Abstract
Network on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as well
as Chip Multiprocessor (CMP) designs. Network on Chips (NoCs) help manage high com-
plexity of designing large chips by decoupling computation from communication. SoCs
and CMPs have a multiplicity of communicating entities like programmable processing el-
ements, hardware acceleration engines, memory blocks as well as off-chip interfaces. With
power having become a serious design constraint[5], there is a great need for designing
NoC which meets the target communication requirements, while minimizing power using
all the tricks available at the architecture, microarchitecture and circuit levels of the de-
sign. This thesis presents a holistic, QoS based, power optimal design solution of a NoC
inside a CMP taking into account link microarchitecture and processor tile configurations.
Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-
nections and deterministic latencies in communication paths. Label Switching based
Network-on-Chip (LS-NoC) uses a centralized LS-NoC Management framework that en-
gineers traffic into QoS guaranteed routes. LS-NoC uses label switching, enables band-
width reservation, allows physical link sharing and leverages advantages of both packet
and circuit switching techniques. A flow identification algorithm takes into account band-
width available in individual links to establish QoS guaranteed routes. LS-NoC caters
to the requirements of streaming applications where communication channels are fixed
over the lifetime of the application. The proposed NoC framework inherently supports
heterogeneous and ad-hoc SoC designs.
A multicast, broadcast capable label switched router for the LS-NoC has been de-
signed, verified, synthesized, placed and routed and timing analyzed. A 5 port, 256
i
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
3/194
Abstract ii
bit data bus, 4 bit label router occupies 0.431 mm2 in 130nm and delivers peak band-
width of 80Gbits/s per link at 312.5MHz. LS Router is estimated to consume 43.08 mW.Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming ap-
plications like HiperLAN/2 and Object Recognition Processor, Constant Bit Rate traffic
patterns and video decoder traffic representing Variable Bit Rate traffic. LS-NoC was
found to have a competitive Area×PowerThroughput
figure of merit with state-of-the-art NoCs provid-
ing QoS. We envision the use of LS-NoC in general purpose CMPs where applications
demand deterministic latencies and hard bandwidth requirements.
Design variables for interconnect exploration include wire width, wire spacing, repeatersize and spacing, degree of pipelining, supply, threshold voltage, activity and coupling
factors. An optimal link configuration in terms of number of pipeline stages for a given
length of link and desired operating frequency is arrived at. Optimal configurations of all
links in the NoC are identified and a power-performance optimal NoC is presented. We
presents a latency, power and performance trade-off study of NoCs using link microar-
chitecture exploration. The design and implementation of a framework for such a design
space exploration study is also presented. We present the trade-off study on NoCs by
varying microarchitectural (e.g. pipelining) and circuit level (e.g. frequency and voltage)
parameters.
A System-C based NoC exploration framework is used to explore impacts of various
architectural and microarchitectural level parameters of NoC elements on power and per-
formance of the NoC. The framework enables the designer to choose from a variety of
architectural options like topology, routing policy, etc., as well as allows experimentation
with various microarchitectural options for the individual links like length, wire width,
pitch, pipelining, supply voltage and frequency. The framework also supports a flexible
traffic generation and communication model. Latency, power and throughput results us-
ing this framework to study a 4x4 CMP are presented. The framework is used to study
NoC designs of a CMP using different classes of parallel computing benchmarks[6].
One of the key findings is that the average latency of a link can be reduced by increasing
pipeline depth to a certain extent, as it enables link operation at higher link frequencies.
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
4/194
Abstract iii
There exists an optimum degree of pipelining which minimizes the energy-delay product
of the link. In a 2D Torus when the longest link is pipelined by 4 stages at which pointleast latency (1.56 times minimum) is achieved and power (40% of max) and throughput
(64% of max) are nominal. Using frequency scaling experiments, power variations of up
to 40%, 26.6% and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC
between various pipeline configurations to achieve same frequency at constant voltages.
Also in some cases, we find that switching to a higher pipelining configuration can actually
help reduce power as the links can be designed with smaller repeaters. We also find that
the overall performance of the ICNs is determined by the lengths of the links needed tosupport the communication patterns. Thus the mesh seems to perform the best amongst
the three topologies (Mesh, Torus and Folded Torus) considered in case studies.
The effects of communication overheads on performance, power and energy of a multi-
processor chip using L1, L2 cache sizes as primary exploration parameters using accurate
interconnect, processor, on-chip and off-chip memory modelling are presented. On-chip
and off-chip communication times have significant impact on execution time and the en-
ergy efficiency of CMPs. Large caches imply larger tile area that result in longer inter-tile
communication link lengths and latencies, thus adversely impacting communication time.
Smaller caches potentially have higher number of misses and frequent of off-tile communi-
cation. Energy efficient tile design is a configuration exploration and trade-off study using
different cache sizes and tile areas to identify a power-performance optimal configuration
for the CMP.
Trade-offs are explored using a detailed, cycle accurate, multicore simulation frame-
work which includes superscalar processor cores, cache coherent memory hierarchies, on-
chip point-to-point communication networks and detailed interconnect model including
pipelining and latency. Sapphire, a detailed multiprocessor execution environment in-
tegrating SESC, Ruby and DRAMSim was used to run applications from the Splash2
benchmark (64K point FFT). Link latencies are estimated for a 16 core CMP simulation
on Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Different
sizes of L1 and L2 lead to different tile clock speeds, tile miss rates and tile area and hence
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
5/194
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
6/194
Acknowledgements
I thank my advisor, Prof. Bharadwaj Amrutur for his invaluable guidance throughout my
Ph.D. I thank all of you who have shared many precious moments with me and enriched
my journey through life.
v
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
7/194
Publications
Journals• Basavaraj Talwar and Bharadwaj Amrutur, “Traffic Engineered NoC for Streaming
Applications”, Microprocessors and Microsystems , 37(2013), 333-344.
Conferences
• Basavaraj Talwar and Bharadwaj Amrutur, “A System-C based Microarchitectural
Exploration Framework for Latency, Power and Performance Trade-offs of On-Chip
Interconnection Networks”, First International Workshop on Network on Chip Ar-
chitectures , Nov. 2008.
• Basavaraj Talwar, Shailesh Kulkarni and Bharadwaj Amrutur, “Latency, Power
and Performance Trade-offs in Network-on-Chips by Link Microarchitecture Explo-
ration”, 22nd Intl. Conference on VLSI Design , Jan. 2009.
vi
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
8/194
Contents
Abstract i
Acknowledgements v
1 Introduction 11.1 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Switching Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Packet Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 QoS Guaranteed NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . 71.5.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . 81.5.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Related Work 112.1 Traffic Engineered NoC for Streaming Applications . . . . . . . . . . . . . 11
2.1.1 QoS in Packet Switched Networks . . . . . . . . . . . . . . . . . . . 122.1.2 QoS in Circuit Switched Networks . . . . . . . . . . . . . . . . . . . 132.1.3 QoS by Space Division Multiplexing . . . . . . . . . . . . . . . . . . 15
2.1.4 Static routing in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.5 MPLS and Label Switching in NoCs . . . . . . . . . . . . . . . . . 162.1.6 Label Switched NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Link Microarchitecture and Tile Area Exploration . . . . . . . . . . . . . . 172.2.1 NoC Design Space Exploration . . . . . . . . . . . . . . . . . . . . 17
2.3 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Link Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Router Power and Architecture Exploration Tools . . . . . . . . . . 202.3.3 Complete NoC Exploration . . . . . . . . . . . . . . . . . . . . . . 212.3.4 CMP Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.5 Communication in CMPs - Performance Exploration . . . . . . . . 24
vii
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
9/194
CONTENTS viii
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Link Microarchitecture Exploration 303.1 Motivation for a Microarchitectural Exploration Framework . . . . . . . . 323.2 NoC Microarchitectural Exploration Framework . . . . . . . . . . . . . . . 33
3.2.1 Traffic Generation and Distribution Models . . . . . . . . . . . . . 353.2.2 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Case Study: Mesh, Torus & Folded-Torus . . . . . . . . . . . . . . . . . . . 383.3.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Round Trip Flit Latency & NoC Throughput . . . . . . . . . . . . 403.3.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 42
3.3.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 463.3.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 483.4 Case Study: Torus, Reduced Torus & Tree based NoC . . . . . . . . . . . . 50
3.4.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.2 NoC Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 533.4.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 543.4.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 58
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Tile Exploration 61
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Observations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Communication Time and Energy Efficiency . . . . . . . . . . . . . . . . . 674.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 734.6 Effect of Link Latency on Performance of a CMP . . . . . . . . . . . . . . 804.7 Communication in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Program Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.9 Ideal Interconnects, Custom Floorplanning, L2 Banks and Process Mapping 96
4.10 Remarks & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 Label Switched NoC 1005.1 Streaming Applications in Media Processors . . . . . . . . . . . . . . . . . 102
5.1.1 HiperLAN/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.2 Object Recognition Processor . . . . . . . . . . . . . . . . . . . . . 103
5.2 LS-NoC - Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 LS-NoC - The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 LS-NoC - Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Label Switched Router Design . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1 Pipes & Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
10/194
CONTENTS ix
5.5.2 Label Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Simulation and Functional Verification . . . . . . . . . . . . . . . . . . . . 112
5.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 LS-NoC Management 1166.1 LS-NoC Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.1 NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.2 Traffic Engineering in LS-NoC . . . . . . . . . . . . . . . . . . . . . 117
6.2 Flow Based Pipe Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Fault Tolerance in LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4 Overhead of NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.1 Computational Latency . . . . . . . . . . . . . . . . . . . . . . . . 1226.4.2 Configuration Latency . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Scalability of LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5 Number of Pipes in an NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.1 Minimum, Maximum and Typical Pipes in a Network . . . . . . . . 125
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 Label Switched NoC 1297.1 HiperLAN/2 baseband processing + Object Recognition Processor SoC . . 1307.2 Video Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Design Philosophy of LS-NoC . . . . . . . . . . . . . . . . . . . . . 1347.3.2 LS-NoC Application . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.3 LS-NoC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 Conclusion and Future Work 1408.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . . . . . 1408.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . 1418.3 Label Switched NoC for Streaming Applications . . . . . . . . . . . . . . . 1438.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A Interface and Outputs of the SystemC Framework 146
B Testing & Validation of LS-NoC 150B.1 Implementation of LS-NoC Router . . . . . . . . . . . . . . . . . . . . . . 150B.2 Testing and Validation of LS-NoC Router . . . . . . . . . . . . . . . . . . 150
B.2.1 Individual Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2.2 Router in 8×8 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.3 Synthesis & Place and Route . . . . . . . . . . . . . . . . . . . . . . . . . 153
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
11/194
CONTENTS x
C The Flow Algorithm 155C.1 Ford-Fulkerson’s MaxFlow Algorithm . . . . . . . . . . . . . . . . . . . . . 155
C.2 Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Edges in the Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 160
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
12/194
List of Tables
3.1 ICN exploration framework parameters. . . . . . . . . . . . . . . . . . . . . 353.2 Traffic Generation/Distribution Model and Experiment Setup for the Mesh,
Torus & Folded-Torus case study. . . . . . . . . . . . . . . . . . . . . . . . 363.3 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 403.4 DLA traffic, Frequency crossover points in 2D Mesh . . . . . . . . . . . . . 493.5 Comparison of 3 topologies for DLA traffic. . . . . . . . . . . . . . . . . . 493.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 513.8 Power optimal frequency trip points in a various NoCs. . . . . . . . . . . . 573.9 Comparison of 3 topologies. Maximum interconnect network performance
and power consumption for varying pipe stages. . . . . . . . . . . . . . . . 58
4.1 Configuration parameters of processors, caches & interconnection network
used in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Scaled processor power over L1 configurations. . . . . . . . . . . . . . . . . 774.3 Primary and Secondary cache parameters (access time, area) obtained from
cacti. L2 access latencies as a function of L1 access times is also shown. . . 774.4 Max operating frequencies, Dynamic energy per access of various L1/L2
caches. Values were calculated using cacti power models using 32nm PTM. 784.5 Lengths of links between L1/L2 caches & routers and between routers of
neighbouring tiles for a regular mesh placement. No. of pipeline stagesrequired to meet the maximum frequency are also shown. . . . . . . . . . . 79
4.6 FFT. Power spent in links (in mW). . . . . . . . . . . . . . . . . . . . . . 89
4.7 Total messages in transit (in Millions). . . . . . . . . . . . . . . . . . . . . 934.8 Clustered tile placement floorplan for L1: 256KB and L2: 512KB. Lengths
of links between neighbouring routers, number of pipeline stages are shown.Frequency: 1.38 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1 Communication characteristics between HiperLAN/2 nodes. . . . . . . . . 1025.2 Routing table of a n port (n = 5) router with a lw bit (lw = 4) label
indexed by labels used in the label switched NoC. Size of the routing table= 2lw × n× lw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Simulation parameters used for functional verification of the label switchedrouter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xi
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
13/194
LIST OF TABLES xii
5.4 Synthesis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Synthesis results 2 Router and Mesh networks. Area of a Router is 0.431
mm2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1 NoC Manager Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.1 Pipes set up for HiperLAN/2 baseband processing SoC and Object Recog-nition Processor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PECcommunicates with every other PEC. . . . . . . . . . . . . . . . . . . . . . 130
7.2 Standard test videos used in experiments. . . . . . . . . . . . . . . . . . . 1327.3 Evaluation of the proposed Label Switched Router and NoC. CS: Circuit
switched, PS: Packet switched. . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.1 ICN exploration framework parameters and their default values. . . . . . . 147
C.1 Routing tables at R0 I0, R0 I2 and R1 I4 nodes after pipes P0 and P1 haveb e e n s e t u p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 9
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
14/194
List of Figures
1.1 Design space exploration of NoCs in CMPs are closely related to link mi-croarchitecture, router design and tile configurations. . . . . . . . . . . . . 6
2.1 Floorplan used in estimating wire lengths. Wire lengths estimated fromthese floorplans are used as input to Intacte to arrive at a power optimalconfiguration and latency in clock cycles. Horizontal R-R: Link betweenneighboring routers in the horizontal direction, Vertical R-R: Link betweenneighbouring routers in the vertical direction. . . . . . . . . . . . . . . . . 25
3.1 Architecture of the SystemC framework. . . . . . . . . . . . . . . . . . . . 343.2 Flow of the ICN exploration framework. . . . . . . . . . . . . . . . . . . . 343.3 Flit header format. DSTID/SRCID: Destination/Source ID, SQ:Sequence
Number, RQ & RP: Request and Response Flags and a 13 bit flit id. . . . 36
3.4 Example flit header formats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x). . . . . 37
3.5 Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded and Processing Elements(PE) are not. . . . . . . . . . 39
3.6 Normalized average round trip latency in cycles vs. Traffic injection ratein all the 3 NoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus and Folded 2D Torus are 2.5mm, 8.15mm and 5.5mm. . . . . . . . . 42
3.8 Total NoC throughput in 3 topologies, DLA traffic. . . . . . . . . . . . . . 433.9 Avg. round trip flit latency in 3 NoCs, DLA traffic. . . . . . . . . . . . . . 43
3.10 2D Mesh Power/Throughput/Latency trade-offs for DLA traffic. Normal-ized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.11 2D Mesh Power/Throughput/Latency trade-offs for SLA traffic. . . . . . . 443.12 DLA Traffic, 2D Torus Power/Throughput/Latency trade-offs. Normalized
results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 DLA Traffic, Folded 2D Torus Power/Throughput/Latency trade-offs. Nor-
malized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Frequency scaling on 3 topologies, DLA Traffic. . . . . . . . . . . . . . . . 473.15 Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curve
for P=8 is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiii
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
15/194
LIST OF FIGURES xiv
3.16 Schematic representation of the three compared topologies (L to R: 2DTorus, Tree, Reduced 2D Torus). Shaded rectangles are Routers and white
boxes are source/sink Processing Elements(PE) nodes. . . . . . . . . . . . 503.17 Floorplans of the three compared topologies. . . . . . . . . . . . . . . . . . 513.18 Maximum attainable frequency by links in the respective topologies. Esti-
mated length of the longest link in a 2D Torus is 7mm. Estimated longestlink in the Tree based and Reduced 2D Torus is 3.5mm. . . . . . . . . . . . 52
3.19 Variation of total NoC throughput with varying pipeline stages in all threetopologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.20 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.21 Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized re-
sults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.22 Variation of NoC power with throughput for each topology. . . . . . . . . . 563.23 Effects of dynamic voltage scaling on the power and performance of a 2D
Torus. Highest frequency of operation for P=1, 2, 4 and 7 are .93GHz,1.68GHz, 2.92GHz and 4.22GHz. Power consumption of the frequencyscaled NoC is shown for comparison. . . . . . . . . . . . . . . . . . . . . . 57
4.1 Error in performance measurement between real and ideal interconnectexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Schematic of a multiprocessor architecture comprising of tiles and an in-terconnecting network. Each tile is made up of a processor, L1 and L2
caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Flowchart illustrating the steps in experimental procedure. . . . . . . . . . 754.4 Tile floorplans for different (L1, L2) sizes. From left: (8KB, 64KB), (64KB,
1MB), (128KB, 4MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Mesh floorplans used in experiments. From left: Conventional 2D Mesh
topology, a clustered topology, cluster topology with L2 bank and threadmapping and and a mesh topology with L2 bank and thread mapping. . . . 77
4.6 Benchmark execution time vs. Communication time - DRAM access timeand On-chip transit time vs. L2 cache size vs. Program completion time. . 80
4.7 Program energy vs. Communication time. . . . . . . . . . . . . . . . . . . 81
4.8 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB,512KB, 1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.9 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB,1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.10 Total messages over all the links during the execution of the benchmarkand Average transit time of a message. . . . . . . . . . . . . . . . . . . . . 86
4.11 FFT. Total instructions executed and power spent in the memory hierarchyand on-chip links during the execution. . . . . . . . . . . . . . . . . . . . . 88
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
16/194
LIST OF FIGURES xv
4.12 FFT Benchmark. Energy per Instruction and Instructions per second2 perWatt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.13 Y1:PCT, Y2:on-chip transit and off-chip comm. times. . . . . . . . . . . . 924.14 FFT benchmark results. (Program Completion Time, comm.: communi-
cation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.15 FFT benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.16 Program Completion Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 954.17 Alternative Tile Placements, custom process scheduling example and ideal
interconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K. . 98
5.1 (a) Process graph of a HiperLAN/2 baseband processing SoC[7] and (b)NoC of the Object recognition processor[8]. . . . . . . . . . . . . . . . . . . 103
5.2 A 64 Node, 8 × 8 2D LS-NoC along with NoC Manager interface to routingtables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3 Pipe establishment and label swapping example in a 3×3 LS-NoC. . . . . . 1095.4 Label Switched Router with single cycle flit traversal. Valid signal identifies
Data and Label as valid. PauseIn and PauseOut are flow control signalsfor downstream and upstream routers. Routing table has output port andlabel swap information. Arbiter receives input from all the input portsalong with the flow control signal from the downstream router. . . . . . . . 110
5.5 Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label. . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1 Surveillance system showing the application of LS-NoC in the Video com-putati on serv er. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 A 2 router, 6 communicating nodes linear network. (b) Multiple source,multiple sink flow calculation in a network. . . . . . . . . . . . . . . . . . . 126
6.3 (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipesin 2D Mesh (Fig. 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 (a) Process blocks of HiperLAN/2 baseband processing SoC and Objectrecognition processor mapped on to a 8 × 8 LS-NoC. Pipe 1: PEC0 →PEC6, Pipe 2: MP → PEC3. (b) Flows set up for CBR & VBR traffic. . . 131
7.2 Latency of HiperLAN/2 and ORP pipes in LS-NoC over varying injectionrates of non-streaming application nodes. Latency of non-provisioned pathsare titled (U). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3 (a) Latency of CBR traffic over various injection rates of non-streamingnodes in LS-NoC. (b) Latency of VBR traffic over various injection ratesof non-streaming nodes in LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 133
7.4 LS-NoC being used alongside a best effort NoC. . . . . . . . . . . . . . . . 136
B.1 Modules in LS-NoC router design shown along with testbench, imple-mented in Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
B.2 Test cases used to verify an individual LS-NoC router. . . . . . . . . . . . 151
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
17/194
LIST OF FIGURES xvi
B.3 8×8 mesh used for testing LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 152B.4 Traffic test cases used to verify proper functioning of LS-NoC router. . . . 153
B.5 Flowchart illustrating steps in Synthesis and Place & Route steps of theLS-NoC router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
B.6 Placed and routed output - Single Router. . . . . . . . . . . . . . . . . . . 154
C.1 Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities of links after flows X→A→C→Y & X→B→C→Y are set up. (c) Residualnetwork showing available capacities of links in the forward direction andutilized capacity in the reverse. (d) Residual network after adding theflow: X→A→C→B→D→E→Y. (e) Final output of the maxflow algorithm
showing 3 flows from X to Y. . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 (a) A 2 router, 6 source+sink system used for validation of the LS-NoCrouter design. Graph representation of the system used as input to theflow algorithm is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . . . 157
C.3 The NoC after two pipes, P0 and P1 have been established. P0: R0S0→R1 D2and P1: R0S2→R1 D0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
18/194
Chapter 1
Introduction
1.1 Network-on-Chip
Network on Chips[1][2][3][4] are critical elements of modern Chip Multiprocessors (CMPs)
and System on Chips (SoCs). Network on Chips (NoCs) help manage high complexity of
designing large chips by decoupling computation from communication. SoCs and CMPs
have a multiplicity of communicating entities like programmable processing elements,
hardware acceleration engines, memory blocks as well as off-chip interfaces. Using an
NoC enables modular design of communicating blocks and network interfaces. NoCs
help achieve a well structured design enabling higher performance while servicing larger
bandwidths compared to bus based systems[1]. Links in NoCs designed with controlled
electrical parameters can use aggressive singling circuits to reduce power and delay[9].
Network resources are utilized more efficiently in NoCs as compared to global wires[10].
Communication patterns between communicating entities are application dependent.
As a result, NoCs are expected to cater to diverse connections varying in forms of connec-
tivity, burstiness, latency and bandwidth requirements. NoC servicing communication re-
quirements in CMPs or SoCs are expected to meet Quality of Service (QoS) demands such
as maximum or average latency, typical or peak bandwidth and required throughput of
executing applications. Further, with power having become a serious design constraint[5],
1
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
19/194
CHAPTER 1. INTRODUCTION 2
there is a great need for designing NoC which meets the target communication require-
ments, while minimizing power using various strategies at architecture, microarchitectureand circuit levels of the design.
1.2 Switching Policies
Switching policies configure paths in routers to facilitate data transfer between input and
output ports. Programming of internal switches in routers to connect input ports to out-
put ports and determination of when and which data units are transferred is accomplishedusing switching policies. Flow control mechanisms synchronize data transfer between
router and traffic sources and between two routers. Switching policies and flow control
mechanisms influence the design of internal switches, routing and arbitration units, and
the amount of buffers in a router. The major types of switching policies are introduced
here.
1.2.1 Circuit Switching
Circuit switching is a reservation based switching policy in which network resources are
allocated to a communication path before data is transferred. At the end of data transfer,
reserved resources are de-allocated and are available for future circuits. As circuits are
used on a reservation basis, circuit switching requires a simple router design with a few
or no buffers.
Circuits are established using path identifying probe packets that reserve resources
as they propagate towards the destination. The circuit establishment is complete after
an acknowledgment message is received by the source. Data is transferred along the cir-
cuit without further monitoring or control. After the transfer is complete, the circuit is
torn down and resources freed using a tail packet. Popular examples of circuit switched
networks are Autonomous Error-Tolerant Cell[11], Asynchronous SoC[12], Crossroad[13],
dTDMA[14], Point to point network on real time systems[15], Programmable NoC for
FPGA-based systems[16], ProtoNoC[17], Space Division Multiplexing based NoC[18],
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
20/194
CHAPTER 1. INTRODUCTION 3
SoCBuS[19], Reconfigurable Circuit Switched NoC[7], etc.
1.2.2 Packet Switching
In packet switching, the message to be transmitted is partitioned and transmitted as
fixed-length packets. Routing and control is handled on a per packet basis. The packet
header includes routing and other control information needed for the packet to reach
the destination. Packet switching increases network resource utilization as communica-
tion channels share resources along the path. Buffers and arbitration units in routers
manage resource conflicts and storage demands in communication paths. Packet switch-
ing networks aid IP block re-use and are scalable[20]. Packet-switching is more flexible
than circuit switching though it requires buffering and introduces unpredictable latency
(jitter). Popular packet switched networks are Asynchronous NoC[21], FAUST[22], Ar-
teris NoC[23], Butterfly Fat Tree[24], DyAD[25], Eclipse[26], MANGO[27], Proteo[28],
QNoC[29], SPIN[30], etc. Some NoC designs can adaptively work in circuit or packet
switched modes based on traffic requirements. A few examples are Æthereal[31], Hetero-
geneous IP Block Interconnection[32], dynamically reconfigurable NoC[33], Octagon[34],
etc.
1.2.3 Label Switching
Label switching is used by technologies such as ATM[35][36] and Multiprotocol Label
Switching (MPLS)[37] as a packet relaying technique. Individual packets carry route in-
formation in the form of labels. A label denotes a common route that a set of data packets
traverse. Therefore, a minimalistic label identifies the source hop and the destination hop
along with the intermediate transit routers. Along with routing information, labels can
be used to specify service priorities to packets. This feature of labels enables use of dif-
ferentiated services for packets using common labels. Routers along the path use the
label to identify the next hop, forwarding information, traffic priority, Quality of Service
guarantees and the next label to be assigned. Label switching inherently supports traffic
engineering, as labels can be chosen based on desired next hop or required QoS services.
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
21/194
CHAPTER 1. INTRODUCTION 4
A few proposals of label switched NoCs are MPLS NoC[38], Nexus[39] and Blackbus[40].
1.3 QoS in NoCs
NoCs servicing CMPs and SoCs are expected to meet Quality of Service (QoS) demands
of executing applications. Latency sensitive applications demand a guaranteed average
and maximum latency on communication traffic. Jitter sensitive applications may tolerate
longer latencies but require fixed delay along communication paths. Further, in between
classes of application some have higher priority than others. For example, applicationdata usually has higher priority than acknowledgment packets or control information.
The two basic approaches in NoC designs to enable QoS guarantees are: creation of
reserved connections between source and destinations via circuit switching or support for
prioritized routing (in case of packet switched, connectionless paths).
Circuit switched NoCs guarantee high data transfer rates in an energy efficient manner
by reducing intra-route data storage[41]. Circuit switched NoCs provide guaranteed QoS
for worst case traffic scenarios leading to higher network resource requirements[42]. These
are well suited for streaming traffic generated by media processors where communication
requirements are well known a priori. One of the drawbacks here is under utilization
of network resources as resources are reserved for peak bandwidth while the average
requirement might be lesser.
Packet switched networks provide efficient interconnect utilization and high throughputs[43]
while providing fairness amongst best effort flows. However, network resources in packet
switched networks need to be over-provisioned to support QoS for various traffic classes
and have high buffer requirements in routers. Packet switching networks usually provide
QoS by differentiated services to traffic by classifying them into various classes[29]. Pri-
oritized services are provided to traffic belonging to each class. Due to the sharing of
network resources, packet switched networks can be configured to provide Guaranteed
Throughput (GT) for a few classes of traffic and Best Effort (BE) services for remaining
classes.
With traffic engineering enabled label switching networks, communication loads can
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
22/194
CHAPTER 1. INTRODUCTION 5
be distributed over the NoC resulting in fair allocation of network resources. Network
resource guarantees, enable paths with less or no jitter while keeping network utilizationfairly high. Further, design of routers is simplified compared to conventional wormhole
routers[40].
1.4 QoS Guaranteed NoC Design
Media processors with streaming traffic such as HiperLAN/2 Baseband Processors[7],
Real-time Object Recognition Processors [8] and H.264 encoders[44][45] demand ade-quate bandwidth and bounded latencies between communicating entities. They also have
well known communication patterns and bandwidth requirements. Adequate throughput,
latency and bandwidth guarantees between process blocks have to be provided for such
applications. Nature of streaming applications in media processors and characteristics of
streaming traffic are illustrated in Section 5.1 of Chapter 5.
Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-
nections and deterministic latencies in communication paths. This thesis proposes a QoS
guaranteeing NoC using label switching where bandwidth can be reserved while links are
shared. The traffic is engineered during route setup and it leverages advantages of both
packet and circuit switching techniques. We propose a QoS based Label Switched NoC
(LS-NoC) router design. We present a latency, power and performance optimal intercon-
nect design methodology considering low level circuit and system parameters. Further,
optimal tile configurations are identified using effects of application communication traffic
on performance and energy in chip multiprocessors (Figure 4.2).
A label switched, QoS guaranteeing NoC, that retains advantages of both packet
switched and circuit switched networks is the main focus of this thesis. Congestion free
communication pipes are identified by a centralized Manager with complete network vis-
ibility. Label Switched NoC (LS-NoC) sets up communication channels (pipes) between
communicating nodes that are independent of existing pipes and are contention free at the
routers. Deterministic delays and bandwidth are guaranteed in newly established pipes,
taking into account established flows. Residual bandwidth in links reserved by a pipe can
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
23/194
CHAPTER 1. INTRODUCTION 6
Figure 1.1: Design space exploration of NoCs in CMPs are closely related to link microar-
chitecture, router design and tile configurations.
be utilized by other pipes, thus enabling sharing of physical links between pipes without
compromising QoS guarantees. LS-NoC provides throughput guarantees irrespective of
spatial separation of the communicating entities.
Interconnect delay and power contribute significantly towards the final performance
and power numbers of a CMP[46]. Design variables for interconnect exploration include
wire width, wire spacing, repeater size and spacing, degree of pipelining, supply (V dd),
threshold voltage (V th), activity and coupling factors. A power and performance opti-
mal link microarchitecture can be arrived at by optimizing these low level link param-
eters. A methodology to arrive at the optimal link configuration in terms of number
of pipeline stages (cycle latency) for a given length of link and desired operating fre-
quency is presented. Optimal configurations of all links in the NoC are identified and a
power-performance optimal NoC thus achieved.
Primary and secondary cache sizes have a major bearing on the amount of on-chipand off-chip communication in a Chip Multiprocessor (CMP). On-chip and off-chip com-
munication times have significant impact on execution time and the energy efficiency of
CMPs. From a performance point of view, cache accesses should suffer minimum delay
and off-tile communication due to cache misses should be negligible. Large caches dissi-
pate more leakage energy and may exceed area budgets though they reduce cache misses
and decrease off-tile communication. Larger caches result in longer inter-tile communi-
cation link lengths and latencies, thus adversely impacting communication time. Small
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
24/194
CHAPTER 1. INTRODUCTION 7
caches reduce occupied tile area, have higher activity and hence dissipate lesser leakage
energy. Drawback of smaller caches is potentially higher number of misses and frequentof off-tile communication. This illustrates the trade off between cache size, miss rate,
NoC communication latency and power. Energy efficient tile design is a configuration
exploration and trade-off study using different cache sizes and tile areas to identify a
power-performance optimal cache size and NoC configuration for the CMP.
1.5 Contributions of the Thesis
Work in this thesis presents methodologies for label switched QoS guaranteed NoC design,
link microarchitecture exploration and optimal Chip Multiprocessor (CMP) tile configu-
rations. Contributions from this thesis are listed here:
1.5.1 Link Microarchitecture Exploration
• Optimal Link Design and Exploration Framework: We present simulation framework
developed in System-C which allows the designer to explore NoC design across low
level link parameters such as pipelining, link width, wire pitch, supply voltage, op-
erating frequency and NoC architectural parameters such as router type and topol-
ogy of the interconnection network. We use the simulation framework to identify
power-performance (Energy-Delay) optimal link configuration in a given NoC over
particular traffic patterns. Such an optimum exists because increasing pipelining
allows for shorter length wire segments which can be operated either faster or with
lower power at the same speed.
• Optimum Pipe Depth: Contrary to intuition, we find that increasing pipeline depth
can actually help reduce latency in absolute time units, by allowing shorter links
& hence higher frequency of operation. In some cases, we find that switching to
a higher pipelining configuration can actually help reduce power as the links can
be designed with smaller repeaters. Larger NoC power savings can be achieved by
voltage scaling along with frequency scaling. Hence it is important to include the
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
25/194
CHAPTER 1. INTRODUCTION 8
link microarchitecture parameters as well as circuit parameters like supply voltage
during architecture design exploration of NoCs.
1.5.2 Optimal CMP Tile Configuration
• Optimal Cache Size: The performance-power optimal L1/L2 configuration of a tile
is close to the configuration that spends least amount of time in on-chip and off-chip
communication.
• Effect of Floorplanning and Process Mapping: Communication aware floorplanningcan reduce up to 2.6% of the energy spent in execution of an instruction and up to
11% savings in communication power during the execution of the program. Mapping
L2 banks in the same core as the processes accessing it reduces time spent in commu-
nication and hence the overall program completion time and also has a bearing on
the Total Energy spent in the execution of the program. Experiments have revealed
that as much as 2% of energy per instruction can be saved by communication-aware
process scheduling compared to a conventional thread mapping policies in a 2DMesh architecture.
1.5.3 QoS in NoCs
• A Label Switching NoC providing QoS guarantees: We present a LS-NoC to service
QoS demands of streaming traffic in media processors. A centralized NoC Man-
ager capable of traffic engineering establishes bandwidth guaranteed communication
channels between nodes. LS-NoC guarantees deterministic path latencies, satisfies
bandwidth requirements and delivers constant throughput. Delay and throughput
guaranteed paths (pipes ) are established between source and destinations along con-
tention free, bandwidth provisioned routes. Pipes are identified by labels unique to
each source node. Labels need fewer bits compared to node identification numbers
- potentially decreasing memory usage in routing tables.
• NoC Manager with traffic engineering capabilities: The NoC Manager utilizes flow
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
26/194
CHAPTER 1. INTRODUCTION 9
identification algorithms to identify contention free, bandwidth provisioned paths
in LS-NoC called pipes . The LS-NoC Manager has complete visibility of the stateof LS-NoC. Bandwidth requirements of the application are taken into account to
provision routes between communicating nodes by the flow identification algorithm.
Flow based pipe establishment algorithm is topology independent and hence the
NoC Manager supports applications mapped to both regular chip multiprocessors
(CMPs) and customized SoCs with non-conventional NoC topologies. Additionally,
fault tolerance is achieved by the NoC Manager by considering link status during
pipe establishment.
• Design of a Label Switched Router: The Label Switched (LS) Router used in LS-
NoC achieves single cycle traversal delay during no contention and is multicast and
broadcast capable. Source nodes in the LS-NoC can work asynchronously as cycle
level scheduling is not required in the LS Router. LS router supports multiple clock
domain operation. Dual clock buffers can be used at output ports in the LS-NoC
router. This eases clock domain crossovers and reduces the need for a single globally
synchronous clock. As a result, clock tree design is less complex and clock power is
potentially saved.
1.6 Organization of the Thesis
Chapter 2 highlights several works from current literature related to the broad areas
of QoS guaranteed NoCs, link microarchitecture, design space exploration of NoCs and
effects of communication on energy and performance trade-offs in CMPs.
Chapter 3 presents a latency, power and performance trade-off study of NoCs through
link microarchitecture exploration using microarchitectural and circuit level parameters.
NoC exploration framework used in the trade-off studies is described. The interface to
the SystemC framework and sample output logs generater are presented in Appendix A.
Effects of on-chip and off-chip communication due to various CMP tile configurations
is explored in Chapter 4. The need to use detailed interconnection network models to
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
27/194
CHAPTER 1. INTRODUCTION 10
identify optimal energy and performance configurations is also highlighted. On-chip and
off-chip communication effects on power and performance of a CMPs is explored. Effects of communication on program execution times and program execution energy are presented.
Further, Energy-performance results for tile configurations and effects of custom L2 bank
mapping and thread mapping on power and performance of CMPs is presented.
Design and implementation of a label switching, traffic engineering capable NoC de-
livering guaranteed QoS for streaming traffic in media processors has been presented in
Chapter 5. Traffic characteristics of streaming applications are also presented in the chap-
ter. Functional verification of the LS-NoC router using various test cases is presented inAppendix B. Chapter 6 illustrates the LS-NoC management framework and the flow
identification algorithm used to establish pipes. An example of use of flow algorithm has
been presented in Appendix C. Streaming application test cases and various types of
video traffic are used to establish LS-NoC as a QoS guaranteeing framework in Chapter
7. The thesis concludes in Chapter 8 after enlisting some future advancements possible
on the proposed work.
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
28/194
Chapter 2
Related Work
Several publications have highlighted the need for solutions to pressing problems in various
domains in the broad area of Network-on-Chips[47][48][49][50]. This chapter introduces
relevant works in the broad areas of QoS guaranteed Network-on-Chips, design space
exploration of NoCs and effects of communication on energy and performance trade-offs
in CMPs.
2.1 Traffic Engineered NoC for Streaming Applica-
tions
Providing QoS guarantees in on-chip communication networks has been identified as one
of major research problems in NoCs[48]. QoS solutions in packet switched networks use
priority based services while circuit switched NoCs use some form of resource reservation.We introduce a few well known QoS solutions from literature and compare our work with
the state of the art. Packet switched NoCs use differentiated services for traffic classes
[29][22][21][8] to provide latency and bandwidth guarantees. Circuit switched NoCs use
resource reservation mechanisms to guarantee QoS[34][51][41][19]. Resource reservation
mechanisms involve identifying a sufficiently resource rich path, reserving resources along
the path, configuration, actual communication and path tear down. A fairly extensive
survey of NoC proposals has been presented in [50]. Relevant QoS NoCs are discussed in
11
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
29/194
CHAPTER 2. RELATED WORK 12
this section.
2.1.1 QoS in Packet Switched Networks
QoS NoC (QNoC) presented by Bolotin et. al. [29] is a customized QoS NoC architecture
based on a 2D Mesh to satisfy QoS by allocating frequently communicating nodes close-by,
doing away with unnecessary links, tailoring link width to meet bandwidth requirements
and balancing link utilization. Inter-module communication traffic is classified into four
classes of service: signaling, real-time, RD/WR and block-transfer. FAUST[22] is a recon-
figurable baseband platform based on an asynchronous NoC providing a programmable
communication framework linking heterogeneous resources. FAUST uses 2 level priority
based virtual circuit design in its Network Interface (NI) to provide QoS guarantees. Asyn-
chronous NoCs[21] use clock-free interconnect to improve reliability and delay-insensitive
arbiters to solve routing conflicts. A QoS Router with both soft (Soft GT) and hard (Hard
GT) guarantees for globally asynchronous, locally synchronous (GALS) NoCs is presented
in [52]. Leftover bandwidth in routers servicing Hard GT is utilized by Soft GT connec-
tions and best effort traffic. NoCs presented in [21], [52] and [53] employ multiple priority
levels to provide differentiated services and guarantee QoS. The MANGO [27][54] NoC
provides hard GT by prioritizing each GT connection and adopts Asynchronous Latency
Guarantee (ALG) scheduling to prevent starvation of packets with lower priority.
One of the major drawbacks of priority based QoS schemes is that increase in traffic
in one priority class effects the delay on traffic belonging to other classes. A priority
network will lose the differentiated services advantage if all traffic belong to the same
priority level. Further, deadlock-free routing algorithms using virtual circuits with a
priority approach may lead to degradation in NoC throughput. In cases where connections
cannot be overlapped with each other (eg. MANGO NoC), increased number of hard GT
connections will lead to increased cost in network resources.
Another class of packet switched NoCs using priority based QoS solutions are applica-
tion specific SoCs. A tree based hierarchical packet-switched NoC for a real-time object
recognition processor is implemented in [8]. The tree topology NoC with three crossbar
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
30/194
CHAPTER 2. RELATED WORK 13
switches interconnects 12 IPs supports both bursty (for image traffic) and non-bursty (for
control and synchronization signals) traffic. Network resources in this NoC are tailoredto meet throughput and bandwidth demands of the application and hence the design is
not a generic solution for servicing QoS in an CMP environment.
2.1.2 QoS in Circuit Switched Networks
Resource reservation between communicating nodes involves identification of path us-
ing point-to-point links or a path probing service network or an intelligent, traffic aware
distributed or centralized manager. Hu et. al.[15] introduce point-to-point (P2P) commu-
nication synthesis to meet timing demands between communicating nodes using bus width
synthesis. Circuit switched bus based QoS solutions such as Crossroad[13], dTDMA[14]
and Heterogeneous IP Block Interconnection (HIBI)[32] rely on communication localiza-
tion to satisfy timing demands. NEXUS[39] is a resource reservation based QoS NoC
for globally asynchronous, locally synchronous (GALS) architectures. NEXUS uses an
asynchronous crossbar to connect synchronous modules through asynchronous channels
and clock-domain converters.
P2P networks do not share communication links between multiple nodes leading to
inefficient utilization of network resources. This increases wiring resources inside the
chip and results in poor scalability. Crossbar based solutions using protocol handshakes
(for example, 4-way handshakes in NEXUS[39] and ProtoNoC[17]) force communicating
nodes to wait till handshake is complete and path is established. Non-interference of
communication channels is achieved by over-provisioning resources in the crossbar. This
leads to complex and poorly scalable networks. Connecting frequently communicating
nodes on a single bus will increase demand on the bus and lead to larger waiting times at
the nodes. Static routing along shortest paths does not guarantee latency bound routes
due to arbitration delays in the network.
Amongst the NoCs that use a probe based circuit establishment solutions are Intel’s
8×8 circuit switched NoC[41], SoCBUS[19][55] and distributed programming model in
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
31/194
CHAPTER 2. RELATED WORK 14
Æthereal[51]. In these NoCs, probe packets are used to reconnoiter shortest communica-
tion paths and configure routing tables if path (circuit) is available. Routers are lockeddown and no other circuits can use the port during the lifetime of an established circuit.
If the shortest X-Y path is not available, the probe packets initiate route discovery mech-
anisms in other paths. The method involves some dynamic behaviour as the probe might
repeat route discovery steps or try after a random period of time if circuit set up does
not succeed. This leads to indeterministic and sometimes large route setup times which
may be unacceptable for real time application performance.
Centralized Circuit Management
Reserved communication channels can be identified and configured using an application
aware hardware or software entity[51][34]. Such a traffic manager can provide programma-
bility of routes.
The Æthereal NoC [51] aims at providing hard guaranteed QoS using Time Divi-
sion Multiplexing(TDM) to avoid contention in a synchronous network. The centralized
programming model in Æthereal NoC[51] uses a root process to identify free slots and
configure network interfaces. Time slot tables are used in routers to reserve output ports
per input port in a particular time slot. To avoid collisions and the loss of data, con-
secutive time slots are then reserved in routers along the circuit path. The number of
paths established in the NoC is restricted by the scheduling constraints during time slots
reservation. Increasing the number of time slots in TDM based NoCs increases router size.
In cases where a communication channel cannot be found due to slots exhaustion, the
traffic division over multiple physical paths may be required[56]. Traffic division involves
reordering packets at the target node leading to increased memory and computational
costs.
TDM techniques using slot tables in Æthereal[51] and sequencers in Adaptive System-
on-Chip[12] require a single synchronous clock distributed over the chip. Accurate global
synchronous clock distribution is expensive in terms of power. Global synchronicity can be
achieved in a distributed manner using tokens such that every router synchronizes every
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
32/194
CHAPTER 2. RELATED WORK 15
slot with all of its neighbors [57]. This method will bring down the operating speed of the
NoC as the slowest router will dictate the speed of the NoC. Further, power managementtechniques such as multiple clock domains is not feasible with this approach. AElite[58]
and dAElite[59] have been proposed as improved next generation Æthereal NoCs. AElite
inherits the guaranteed services model from Æthereal. To overcome the global synchronic-
ity problem, AElite proposes use of asynchronous and mesochronous links as a possibility.
As noted in the paper[58], using mesochronous links alone may not be sufficient if routers
and NIs are plesiochronous[60]. One of the drawbacks of AElite was number of slots
occupied by the header flits. A header flit in AElite occupied one in three slots and theoverhead rises to up to 33%. dAElite circumvents the header flit overhead by routing
based on the time of packet injection and packet receiving. One of the disadvantages of
dAElite is an increase in the number of link wires, due to the configuration network and
also because of separate wires for end-to-end credit communication.
The Octagon NoC[34] implements a centralized best fit scheduler to configure and
manage non-overlapping connections. The scheduler cannot establish a new connection
through a port if it is blocked by another connection. This results in increased connection
establishment time at the routers and also packet losses.
2.1.3 QoS by Space Division Multiplexing
As an alternative to TDM techniques, Spatial Division Multiplexing (SDM) techniques
for QoS have been proposed in [23],[61] and [62]. SDM techniques involve sharing fraction
of links between connections simultaneously based on bandwidth requirements of the
corresponding connections. An approach comparable to a static version of SDM called
Lane-Division-Multiplexing has been proposed in [7]. Lane-Division-Multiplexing is based
on a reconfigurable circuit switched router composed of a crossbar and data converters.
Disadvantage of the solution in [7] is that it does not support channel sharing and BE
traffic. An additional network is required for configuring the switches and for carrying
the BE traffic. Sharing a subset of wires between connections as in [63] leads to a more
complex switch design with huge delay. SDM and TDM techniques have been combined
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
33/194
CHAPTER 2. RELATED WORK 16
in [64] allowing for increase in number of connections supported by increasing the number
of sub-channels in the link or by increasing the number of time slots. This increases pathestablishment probability in the NoC.
In SDM based techniques, sender serializes data on the wires allocated and the receiver
deserializes the data before forwarding to the IP block. One of the issues in SDM based
circuits is complexity of implementation of serializers and deserializers.
2.1.4 Static routing in NoCs
Most NoCs use traffic oblivious static routing[51] to establish communication channels
between nodes. Dimension ordered routing[41][53][17][53][51][34] or routes decided at de-
sign time[65] are not flexible and cannot circumvent congested paths. Routing in FPGAs
also present a similar scenario where routes between communicating nodes are bandwidth
and latency guaranteed, but are static. These routes occupy network resources along the
path for the entire lifetime of the application. QoS is guaranteed in this case by over
provisioning resources along the route.
2.1.5 MPLS and Label Switching in NoCs
Use of Multi-Protocol Label Switching for QoS[38] in NoCs and advantages of identifying
communication channels using labels have been investigated in [39],[40]. A conventional
NoC is connected to an MPLS backbone using Label Edge Routers (LERs)[38]. The
MPLS backbone uses traffic engineering and priority based QoS services to communication
channels identified by labels. The work is a direct mapping of the MPLS implementation
in the Internet to NoCs. The router and NoC design approach is not optimized for a
hardware implementation. Results from Network Simulator-2 (NS-2) are at a functional
level and may not reflect the exact performance achievable inside a chip.
Use of labels to identify communication channels instead of source and destination
identification numbers reduces the amount of metadata transmitted in the NoC. Unique
addressing at source allows label reuse and enables efficient use of label space. Imple-
mentation of label based addressing in streaming applications have resulted in significant
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
34/194
CHAPTER 2. RELATED WORK 17
reduction in router area[40]. The work employs a method similar to label switching to
achieve non-global label addressing hence reducing label bit width. A C ×N → C routingstrategy is described in conjunction with the label addressing scheme. Work presented in
[40] presents a simple data transfer scheme and does not concentrate on rendering QoS
between communicating nodes. The route establishment process has not been explicitly
mentioned and one can assume that standard routing algorithms will be used.
2.1.6 Label Switched NoC
In the proposed work, we describe a Label Switched QoS guaranteeing NoC that retains
advantages of both packet switched and circuit switched networks. Contention at output
ports in is tackled using communication pipes. Pipes are communication routes estab-
lished along a bandwidth rich, contention free router path. Pipes are identified by a
centralized Manager with complete network visibility.
NoC Manager utilizes Flow identification algorithms[66][67] (Algorithm 1) to establish
pipes. Flow identification algorithm guarantees a deterministic delay in identifying and
configuring pipes. Flow identification algorithm takes into account bandwidth available
in individual links to establish QoS guaranteed pipes. This guarantees QoS serviced
communication paths between communicating nodes. Multiple pipes can be set up in
a single link if QoS requirements of all the pipes are satisfied. This enables sharing of
physical links between pipes without compromising QoS guarantees. LS-NoC provides
throughput guarantees irrespective of spatial separation of communicating entities.
2.2 Link Microarchitecture and Tile Area Exploration
2.2.1 NoC Design Space Exploration
Current research in architectural level exploration of NoC in SoCs concentrates on un-
derstanding the impacts of varying topologies, link and router parameters on the overall
throughput, area and power consumption of the system (SoCs and Multicore chips) using
suitable traffic models[68]. Impacts of varying topologies, link and router parameters on
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
35/194
CHAPTER 2. RELATED WORK 18
the overall throughput, area and power consumption of the system (SoCs and Multicore
chips) using relevant traffic models is discussed in [68]. The paper illustrates a consistentcomparison and evaluation methodology based on a set of quantifiable critical parameters
for NoCs. The work suggests that evaluation of NoCs must consider applications into
account. The usual most critical evaluation parameters are not exhaustive and differ-
ent applications may require additional parameters such as testability, dependability, and
reliability.
Work in [69] emphasizes need for co-design of interconnects, processing elements and
memory blocks to understand the effects on overall system characteristics. Results fromthis work show that the architecture of the interconnect interacts with the design and
architecture of the cores and caches closely. The work studies the area-bandwidth-
performance trade-off on on-chip interconnects. The increase in area demands of shared
caches in CMPs is also documented. Not using detailed interconnect models during CMP
design leads to non-optimal larger shared L2 caches inside the chip.
2.3 Simulation Tools
Simulation tools have been developed to aid designers in interconnection network (ICN)
space exploration[70][71]. Kogel et. al.[70] present a modular exploration framework to
capture performance of point-to-point, shared bus and crossbar topologies.
2.3.1 Link Exploration Tools
Link exploration tool works make a case for microarchitectural wire management in future
processors where communication is a prominent contributor for power and performance.
Separate wire exploration tools such as those presented in [71], [72], [73], [74] and [75]
give an estimate of delay of the wire in terms of latency for a particular wire length and
operating frequency.
Orion [71] is a power-performance interconnection network simulator that is capable of
providing power and performance statistics. Orion model estimates power consumed by
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
36/194
CHAPTER 2. RELATED WORK 19
router elements (crossbars, FIFOs and arbiters) by calculating switching capacitances of
individual circuit elements. Orion contains a library of architectural level parameterizedpower models.
The more recent Orion 2.0 presented in [76] is an enhanced NoC power and area
simulator offering improved accuracy compared to the original Orion framework. Some of
the additions into Orion 2.0 include flip-flop and clock dynamic and leakage power models,
link power models, leveraging models developed in [74]. Virtual Channel (VC) allocator
microarchitecture uses a VC allocation model, based on the microarchitecture and pipeline
proposed in [77]. Application-specific technology-level fine tuning of parameters usingdifferent Vth and transistor widths are used to increase accuracy of power estimation.
Work in [72] explores use of heterogeneous interconnects optimized for delay, band-
width or power by varying design parameters such as a buffer sizes, wire width and number
of repeaters on the interconnects. The work presented in the paper uses Energy-Delay2
as the optimization parameter. An evaluation of different configurations of heterogeneous
interconnects is made. The evaluation shows that an optimal configuration (for delay,
bandwidth, power or power and bandwidth) of wires can reduce the total processor ED 2
value by up to 11% compared to a NoC with homogeneous interconnect in a typical
processor.
Courtay et. al[73] have developed a high-level delay and power estimation tool for
link exploration that offers similar statistics as Intacte does. The tool allows changing
architectural level parameters such as different signal coding techniques to analyze the
effects on wire delay/power.
Work in [74] proposes delay and power models for buffered interconnects. The mod-
els can be constructed from sources such as Liberty[78], LEF/ITF[79], ITRS[80], and
PTM[81]. The buffered delay models take into account effects of input and output slews
of circuit elements in calculating intrinsic delays. The power models include leakage and
dynamic power dissipation of gates. The area models include technology dependent co-
efficients that can be estimated by linear regression techniques per technology node to
estimate repeater areas.
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
37/194
CHAPTER 2. RELATED WORK 20
Intacte[82] is used for interconnect delay and power estimates. Design variables for
Intacte’s interconnect optimization are wire width, wire spacing, repeater size and spacing,degree of pipelining, supply (V dd) and threshold voltage (V th). Intacte can be used to arrive
at power optimal number of repeaters, sizes and spacing for a given wire length to achieve
a desired frequency. Intacte outputs total power dissipated including short circuit and
leakage power values.
A high level power estimation tool accounting for interconnect effects is presented in
[83]. The work presents an interconnect length estimation model based on Rent’s rule[84]
and a high level area (gate count) prediction method. Different place and route enginesand cell libraries can be used with this proposed model after some minor adaptations.
2.3.2 Router Power and Architecture Exploration Tools
Most router exploration tools model ICN elements at a higher level abstraction of switches,
links and buffers and help in power/performance trade-off studies[85][86]. These are used
to research the design of router architectures[87] and ICN topologies[34] with varying
area/performance trade-offs for general purpose SoCs or to cater to specific applications.
A high level power estimation methodology for NoC routers based on number of
traversing flits as the unit of abstraction has been proposed in [85]. The macro model of
the framework incurs a minor absolute cycle error compared to gate level analysis. Provid-
ing a fast and cycle accurate power profile at an early stage of router design enables power
optimizations such as power-aware compilers, core mapping, and scheduling techniques
for CMPs to be incorporated into the final design. The power macro model uses state
information of FSM in a router assigned to reserve channels during packet forwarding
for wormhole flow control. This enhances the accuracy of the power macro model. The
power macro model based on regression analysis can be migrated to different technology
libraries.
An architectural-level power model for interconnection network routers has been pre-
sented in [88]. The work specifically considers the Alpha 21364 and Infiniband routers
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
38/194
CHAPTER 2. RELATED WORK 21
for modelling case studies. Memory arrays, crossbars and arbiters form the basic build-
ing blocks of all router models using this framework. Each of these building blocks havebeen modelled in detail to estimate switching capacitance. Switching activity is estimated
based on traffic models assuming certain arrival rates at the input ports. The power num-
bers for both Alpha 21364 and Infiniband routers have been found to be matching the
vendors’ estimates within a minor error margin.
The high level power model presented in [86] to estimate power consumption in semi-
global and global interconnects considers switching power, power due to vias and re-
peaters. The high level model estimates switching power within an error of 6% with aspeedup of three-to-four orders of magnitude. Error in via power is under 3%. A segment
length distribution model has been presented for cases where Rents rule is insufficient.
The segment length distribution model has been validated by analyzing netlists of a set
of complex designs.
A wormhole router implementing a minimal adaptive routing algorithm with near
optimal performance and feasible design complexity is proposed in [87]. The work also
estimates the optimal size of FIFO in an adaptive router with fixed priority scheme. The
optimal size of the FIFO is derived to be equal to the length of the packet in flits in this
work.
2.3.3 Complete NoC Exploration
Several frameworks have been proposed for complete NoC exploration[89][90][91]. These
frameworks can be used as tools to derive a first cut analysis of effect of certain NoC con-
figurations at an early design phase. Such frameworks are the first steps for roadmapping
future of on-chip networks.
A technology aware NoC topology exploration tool has been presented in [89]. The
NoC exploration is optimized for energy consumption of the entire SoC. The work char-
acterizes 2D Meshes and Torii along with higher dimensions, multiple hierarchies and
express channels, for energy spent in the network. The work presents analytical models
based on NoC parameters such as average hop count and average flit traversal energy to
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
39/194
CHAPTER 2. RELATED WORK 22
predict the most energy-efficient topology for future technologies.
A holistic approach to designing energy-efficient cluster interconnects has been pro-posed in [90]. The work uses a cycle-accurate simulator with designs of an InfiniBand
Architecture (IBA) compliant interconnect fabric. The system is modelled to comprise
of switches, network interface cards and links. The study reveals that the links and
switch buffers consume the major portion of the SoC power. The work proposes dynamic
voltage scaling and dynamic link shutdown as viable methods to save power during SoC
operation. A system-level roadmapping toolchain for interconnection networks has been
presented in [91]. The framework is titled Polaris and iterates through available NoC de-signs to identify a power optimal one based on network traffic, architectures and process
characteristics.
Several complete NoC simulators have been developed and are in use by the NoC
research community[92][93][94]. The Network-on-Chip Simulator, Noxim[92], was devel-
oped at the University of Catania, Italy. Several NoC parameters such as network size,
buffer size, packet size distribution, routing algorithm, selection strategy, packet injection
rate, traffic time distribution, traffic pattern, hot-spot traffic distribution can be input
to this framework. The simulator allows NoC evaluation based on throughput, flit de-
lay and power consumption. The Nostrum NoC Simulation Environment (NNSE) [94] is
part of the Nostrum project[65] and contains a SystemC based simulator. Inputs to this
simulator are network size, topology, routing policy and traffic patterns. Based on these
configuration parameters a simulator is built and executed to produce a desired set of
results in a variety of graphs.
2.3.4 CMP Exploration Tools
Wattch was one of the first [95] architectural level frameworks for analyzing and optimizing
microprocessor power dissipation. Wattch was orders of magnitude faster than layout-
level power tools, and its accuracy was within 10% of verified industry tools on leading-
edge designs. Wattch was an architecture-level, parameterizable, simulator framework
that can accurately quantify potential power consumption in microprocessors. Wattch
-
8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis
40/194
CHAPTER 2. RELATED WORK 23
framework quantifies power consumption of all the major units of the processor, param-
eterize them, and integrate these power estimates into a high-level simulator. Wattchmodels main processor units into array structures, fully associative content-addressable-
memories, combinational logic and wires or clocking elements. Individual capacitances
of each of these elements are estimated and power is calculated. Work presented in [95]
integrates Wattch into SimpleScalar architectural simulator[96].
A tool like Ruby[97], allows one to simulate a complete distributed memory hierarchy
with an on-chip network as in Orion. However, it needs to be augmented with a detailed
interconnect model which accounts for the physical area of the tiles and their placements.Network Processor exploration and power estimation tools utilize models for smaller
components and quote the integrated power for the system[98][99][100]. They use cycle
accurate register, cache and arbiter models introduced previously here. NePSim[99] is
an open-source integrated simulation infrastructure. Typical network processors can be
simulated with the cy