noc.design.and.optimization.of.multicore.media.processors.thesis

Upload: vasanthmahadev

Post on 03-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    1/194

    NoC Design & Optimization of Multicore Media

    Processors

    A Thesis

    Submitted for the Degree of 

    Doctor of Philosophy

    in the Faculty of Engineering

    by

    Basavaraj T

    DEPARTMENT OF ELECTRICAL AND COMMUNICATION

    ENGINEERING

    INDIAN INSTITUTE OF SCIENCE

    BANGALORE – 560 012, INDIA

    July 2013

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    2/194

    Abstract

    Network on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as well

    as Chip Multiprocessor (CMP) designs. Network on Chips (NoCs) help manage high com-

    plexity of designing large chips by decoupling computation from communication. SoCs

    and CMPs have a multiplicity of communicating entities like programmable processing el-

    ements, hardware acceleration engines, memory blocks as well as off-chip interfaces. With

    power having become a serious design constraint[5], there is a great need for designing

    NoC which meets the target communication requirements, while minimizing power using

    all the tricks available at the architecture, microarchitecture and circuit levels of the de-

    sign. This thesis presents a holistic, QoS based, power optimal design solution of a NoC

    inside a CMP taking into account link microarchitecture and processor tile configurations.

    Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-

    nections and deterministic latencies in communication paths. Label Switching based

    Network-on-Chip (LS-NoC) uses a centralized LS-NoC Management framework that en-

    gineers traffic into QoS guaranteed routes. LS-NoC uses label switching, enables band-

    width reservation, allows physical link sharing and leverages advantages of both packet

    and circuit switching techniques. A flow identification algorithm takes into account band-

    width available in individual links to establish QoS guaranteed routes. LS-NoC caters

    to the requirements of streaming applications where communication channels are fixed

    over the lifetime of the application. The proposed NoC framework inherently supports

    heterogeneous and ad-hoc SoC designs.

    A multicast, broadcast capable label switched router for the LS-NoC has been de-

    signed, verified, synthesized, placed and routed and timing analyzed. A 5 port, 256

    i

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    3/194

    Abstract   ii

    bit data bus, 4 bit label router occupies 0.431 mm2 in 130nm and delivers peak band-

    width of 80Gbits/s per link at 312.5MHz. LS Router is estimated to consume 43.08 mW.Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming ap-

    plications like HiperLAN/2 and Object Recognition Processor, Constant Bit Rate traffic

    patterns and video decoder traffic representing Variable Bit Rate traffic. LS-NoC was

    found to have a competitive   Area×PowerThroughput

     figure of merit with state-of-the-art NoCs provid-

    ing QoS. We envision the use of LS-NoC in general purpose CMPs where applications

    demand deterministic latencies and hard bandwidth requirements.

    Design variables for interconnect exploration include wire width, wire spacing, repeatersize and spacing, degree of pipelining, supply, threshold voltage, activity and coupling

    factors. An optimal link configuration in terms of number of pipeline stages for a given

    length of link and desired operating frequency is arrived at. Optimal configurations of all

    links in the NoC are identified and a power-performance optimal NoC is presented. We

    presents a latency, power and performance trade-off study of NoCs using link microar-

    chitecture exploration. The design and implementation of a framework for such a design

    space exploration study is also presented. We present the trade-off study on NoCs by

    varying microarchitectural (e.g. pipelining) and circuit level (e.g. frequency and voltage)

    parameters.

    A System-C based NoC exploration framework is used to explore impacts of various

    architectural and microarchitectural level parameters of NoC elements on power and per-

    formance of the NoC. The framework enables the designer to choose from a variety of 

    architectural options like topology, routing policy, etc., as well as allows experimentation

    with various microarchitectural options for the individual links like length, wire width,

    pitch, pipelining, supply voltage and frequency. The framework also supports a flexible

    traffic generation and communication model. Latency, power and throughput results us-

    ing this framework to study a 4x4 CMP are presented. The framework is used to study

    NoC designs of a CMP using different classes of parallel computing benchmarks[6].

    One of the key findings is that the average latency of a link can be reduced by increasing

    pipeline depth to a certain extent, as it enables link operation at higher link frequencies.

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    4/194

    Abstract   iii

    There exists an optimum degree of pipelining which minimizes the energy-delay product

    of the link. In a 2D Torus when the longest link is pipelined by 4 stages at which pointleast latency (1.56 times minimum) is achieved and power (40% of max) and throughput

    (64% of max) are nominal. Using frequency scaling experiments, power variations of up

    to 40%, 26.6% and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC

    between various pipeline configurations to achieve same frequency at constant voltages.

    Also in some cases, we find that switching to a higher pipelining configuration can actually

    help reduce power as the links can be designed with smaller repeaters. We also find that

    the overall performance of the ICNs is determined by the lengths of the links needed tosupport the communication patterns. Thus the mesh seems to perform the best amongst

    the three topologies (Mesh, Torus and Folded Torus) considered in case studies.

    The effects of communication overheads on performance, power and energy of a multi-

    processor chip using L1, L2 cache sizes as primary exploration parameters using accurate

    interconnect, processor, on-chip and off-chip memory modelling are presented. On-chip

    and off-chip communication times have significant impact on execution time and the en-

    ergy efficiency of CMPs. Large caches imply larger tile area that result in longer inter-tile

    communication link lengths and latencies, thus adversely impacting communication time.

    Smaller caches potentially have higher number of misses and frequent of off-tile communi-

    cation. Energy efficient tile design is a configuration exploration and trade-off study using

    different cache sizes and tile areas to identify a power-performance optimal configuration

    for the CMP.

    Trade-offs are explored using a detailed, cycle accurate, multicore simulation frame-

    work which includes superscalar processor cores, cache coherent memory hierarchies, on-

    chip point-to-point communication networks and detailed interconnect model including

    pipelining and latency. Sapphire, a detailed multiprocessor execution environment in-

    tegrating SESC, Ruby and DRAMSim was used to run applications from the Splash2

    benchmark (64K point FFT). Link latencies are estimated for a 16 core CMP simulation

    on Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Different

    sizes of L1 and L2 lead to different tile clock speeds, tile miss rates and tile area and hence

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    5/194

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    6/194

    Acknowledgements

    I thank my advisor, Prof. Bharadwaj Amrutur for his invaluable guidance throughout my

    Ph.D. I thank all of you who have shared many precious moments with me and enriched

    my journey through life.

    v

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    7/194

    Publications

    Journals•   Basavaraj Talwar and Bharadwaj Amrutur, “Traffic Engineered NoC for Streaming

    Applications”,  Microprocessors and Microsystems , 37(2013), 333-344.

    Conferences

    •   Basavaraj Talwar and Bharadwaj Amrutur, “A System-C based Microarchitectural

    Exploration Framework for Latency, Power and Performance Trade-offs of On-Chip

    Interconnection Networks”,  First International Workshop on Network on Chip Ar-

    chitectures , Nov. 2008.

    •   Basavaraj Talwar, Shailesh Kulkarni and Bharadwaj Amrutur, “Latency, Power

    and Performance Trade-offs in Network-on-Chips by Link Microarchitecture Explo-

    ration”, 22nd Intl. Conference on VLSI Design , Jan. 2009.

    vi

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    8/194

    Contents

    Abstract i

    Acknowledgements v

    1 Introduction 11.1 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Switching Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Packet Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 QoS Guaranteed NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.5.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . 71.5.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . 81.5.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2 Related Work 112.1 Traffic Engineered NoC for Streaming Applications . . . . . . . . . . . . . 11

    2.1.1 QoS in Packet Switched Networks . . . . . . . . . . . . . . . . . . . 122.1.2 QoS in Circuit Switched Networks . . . . . . . . . . . . . . . . . . . 132.1.3 QoS by Space Division Multiplexing . . . . . . . . . . . . . . . . . . 15

    2.1.4 Static routing in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.5 MPLS and Label Switching in NoCs . . . . . . . . . . . . . . . . . 162.1.6 Label Switched NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2 Link Microarchitecture and Tile Area Exploration . . . . . . . . . . . . . . 172.2.1 NoC Design Space Exploration . . . . . . . . . . . . . . . . . . . . 17

    2.3 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Link Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Router Power and Architecture Exploration Tools . . . . . . . . . . 202.3.3 Complete NoC Exploration . . . . . . . . . . . . . . . . . . . . . . 212.3.4 CMP Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.3.5 Communication in CMPs - Performance Exploration . . . . . . . . 24

    vii

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    9/194

    CONTENTS    viii

    2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3 Link Microarchitecture Exploration 303.1 Motivation for a Microarchitectural Exploration Framework . . . . . . . . 323.2 NoC Microarchitectural Exploration Framework . . . . . . . . . . . . . . . 33

    3.2.1 Traffic Generation and Distribution Models . . . . . . . . . . . . . 353.2.2 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.3 Case Study: Mesh, Torus & Folded-Torus . . . . . . . . . . . . . . . . . . . 383.3.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Round Trip Flit Latency & NoC Throughput . . . . . . . . . . . . 403.3.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 42

    3.3.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 463.3.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 483.4 Case Study: Torus, Reduced Torus & Tree based NoC . . . . . . . . . . . . 50

    3.4.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.2 NoC Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 533.4.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 543.4.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 58

    3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4 Tile Exploration 61

    4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Observations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Communication Time and Energy Efficiency . . . . . . . . . . . . . . . . . 674.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    4.5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 734.6 Effect of Link Latency on Performance of a CMP . . . . . . . . . . . . . . 804.7 Communication in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Program Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.9 Ideal Interconnects, Custom Floorplanning, L2 Banks and Process Mapping 96

    4.10 Remarks & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    5 Label Switched NoC 1005.1 Streaming Applications in Media Processors . . . . . . . . . . . . . . . . . 102

    5.1.1 HiperLAN/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.2 Object Recognition Processor . . . . . . . . . . . . . . . . . . . . . 103

    5.2 LS-NoC - Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 LS-NoC - The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 LS-NoC - Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Label Switched Router Design . . . . . . . . . . . . . . . . . . . . . . . . . 108

    5.5.1 Pipes & Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    10/194

    CONTENTS    ix

    5.5.2 Label Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Simulation and Functional Verification . . . . . . . . . . . . . . . . . . . . 112

    5.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    6 LS-NoC Management 1166.1 LS-NoC Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    6.1.1 NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.2 Traffic Engineering in LS-NoC . . . . . . . . . . . . . . . . . . . . . 117

    6.2 Flow Based Pipe Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Fault Tolerance in LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4 Overhead of NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    6.4.1 Computational Latency . . . . . . . . . . . . . . . . . . . . . . . . 1226.4.2 Configuration Latency . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Scalability of LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    6.5 Number of Pipes in an NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.1 Minimum, Maximum and Typical Pipes in a Network . . . . . . . . 125

    6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    7 Label Switched NoC 1297.1 HiperLAN/2 baseband processing + Object Recognition Processor SoC . . 1307.2 Video Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.3.1 Design Philosophy of LS-NoC . . . . . . . . . . . . . . . . . . . . . 1347.3.2 LS-NoC Application . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.3 LS-NoC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    8 Conclusion and Future Work 1408.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . . . . . 1408.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . 1418.3 Label Switched NoC for Streaming Applications . . . . . . . . . . . . . . . 1438.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    A Interface and Outputs of the SystemC Framework 146

    B Testing & Validation of LS-NoC 150B.1 Implementation of LS-NoC Router . . . . . . . . . . . . . . . . . . . . . . 150B.2 Testing and Validation of LS-NoC Router . . . . . . . . . . . . . . . . . . 150

    B.2.1 Individual Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2.2 Router in 8×8 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    B.3 Synthesis & Place and Route . . . . . . . . . . . . . . . . . . . . . . . . . 153

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    11/194

    CONTENTS    x

    C The Flow Algorithm 155C.1 Ford-Fulkerson’s MaxFlow Algorithm . . . . . . . . . . . . . . . . . . . . . 155

    C.2 Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Edges in the Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    Bibliography 160

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    12/194

    List of Tables

    3.1 ICN exploration framework parameters. . . . . . . . . . . . . . . . . . . . . 353.2 Traffic Generation/Distribution Model and Experiment Setup for the Mesh,

    Torus & Folded-Torus case study. . . . . . . . . . . . . . . . . . . . . . . . 363.3 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 403.4 DLA traffic, Frequency crossover points in 2D Mesh . . . . . . . . . . . . . 493.5 Comparison of 3 topologies for DLA traffic. . . . . . . . . . . . . . . . . . 493.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 513.8 Power optimal frequency trip points in a various NoCs. . . . . . . . . . . . 573.9 Comparison of 3 topologies. Maximum interconnect network performance

    and power consumption for varying pipe stages. . . . . . . . . . . . . . . . 58

    4.1 Configuration parameters of processors, caches & interconnection network

    used in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Scaled processor power over L1 configurations. . . . . . . . . . . . . . . . . 774.3 Primary and Secondary cache parameters (access time, area) obtained from

    cacti. L2 access latencies as a function of L1 access times is also shown. . . 774.4 Max operating frequencies, Dynamic energy per access of various L1/L2

    caches. Values were calculated using cacti power models using 32nm PTM. 784.5 Lengths of links between L1/L2 caches & routers and between routers of 

    neighbouring tiles for a regular mesh placement. No. of pipeline stagesrequired to meet the maximum frequency are also shown. . . . . . . . . . . 79

    4.6 FFT. Power spent in links (in mW). . . . . . . . . . . . . . . . . . . . . . 89

    4.7 Total messages in transit (in Millions). . . . . . . . . . . . . . . . . . . . . 934.8 Clustered tile placement floorplan for L1: 256KB and L2: 512KB. Lengths

    of links between neighbouring routers, number of pipeline stages are shown.Frequency: 1.38 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.1 Communication characteristics between HiperLAN/2 nodes. . . . . . . . . 1025.2 Routing table of a   n   port (n = 5) router with a   lw   bit (lw   = 4) label

    indexed by labels used in the label switched NoC. Size of the routing table= 2lw × n× lw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    5.3 Simulation parameters used for functional verification of the label switchedrouter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    xi

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    13/194

    LIST OF TABLES    xii

    5.4 Synthesis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Synthesis results 2 Router and Mesh networks. Area of a Router is 0.431

    mm2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    6.1 NoC Manager Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    7.1 Pipes set up for HiperLAN/2 baseband processing SoC and Object Recog-nition Processor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PECcommunicates with every other PEC. . . . . . . . . . . . . . . . . . . . . . 130

    7.2 Standard test videos used in experiments. . . . . . . . . . . . . . . . . . . 1327.3 Evaluation of the proposed Label Switched Router and NoC. CS: Circuit

    switched, PS: Packet switched. . . . . . . . . . . . . . . . . . . . . . . . . . 136

    A.1 ICN exploration framework parameters and their default values. . . . . . . 147

    C.1 Routing tables at R0 I0, R0 I2 and R1 I4 nodes after pipes P0 and P1 haveb e e n s e t u p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 9

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    14/194

    List of Figures

    1.1 Design space exploration of NoCs in CMPs are closely related to link mi-croarchitecture, router design and tile configurations. . . . . . . . . . . . . 6

    2.1 Floorplan used in estimating wire lengths. Wire lengths estimated fromthese floorplans are used as input to Intacte to arrive at a power optimalconfiguration and latency in clock cycles. Horizontal R-R: Link betweenneighboring routers in the horizontal direction, Vertical R-R: Link betweenneighbouring routers in the vertical direction. . . . . . . . . . . . . . . . . 25

    3.1 Architecture of the SystemC framework. . . . . . . . . . . . . . . . . . . . 343.2 Flow of the ICN exploration framework. . . . . . . . . . . . . . . . . . . . 343.3 Flit header format. DSTID/SRCID: Destination/Source ID, SQ:Sequence

    Number, RQ & RP: Request and Response Flags and a 13 bit flit id. . . . 36

    3.4 Example flit header formats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x). . . . . 37

    3.5 Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded and Processing Elements(PE) are not. . . . . . . . . . 39

    3.6 Normalized average round trip latency in cycles vs. Traffic injection ratein all the 3 NoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.7 Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus and Folded 2D Torus are 2.5mm, 8.15mm and 5.5mm. . . . . . . . . 42

    3.8 Total NoC throughput in 3 topologies, DLA traffic. . . . . . . . . . . . . . 433.9 Avg. round trip flit latency in 3 NoCs, DLA traffic. . . . . . . . . . . . . . 43

    3.10 2D Mesh Power/Throughput/Latency trade-offs for DLA traffic. Normal-ized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.11 2D Mesh Power/Throughput/Latency trade-offs for SLA traffic. . . . . . . 443.12 DLA Traffic, 2D Torus Power/Throughput/Latency trade-offs. Normalized

    results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 DLA Traffic, Folded 2D Torus Power/Throughput/Latency trade-offs. Nor-

    malized results are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Frequency scaling on 3 topologies, DLA Traffic. . . . . . . . . . . . . . . . 473.15 Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curve

    for P=8 is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    xiii

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    15/194

    LIST OF FIGURES    xiv

    3.16 Schematic representation of the three compared topologies (L to R: 2DTorus, Tree, Reduced 2D Torus). Shaded rectangles are Routers and white

    boxes are source/sink Processing Elements(PE) nodes. . . . . . . . . . . . 503.17 Floorplans of the three compared topologies. . . . . . . . . . . . . . . . . . 513.18 Maximum attainable frequency by links in the respective topologies. Esti-

    mated length of the longest link in a 2D Torus is 7mm. Estimated longestlink in the Tree based and Reduced 2D Torus is 3.5mm. . . . . . . . . . . . 52

    3.19 Variation of total NoC throughput with varying pipeline stages in all threetopologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.20 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.21 Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized re-

    sults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.22 Variation of NoC power with throughput for each topology. . . . . . . . . . 563.23 Effects of dynamic voltage scaling on the power and performance of a 2D

    Torus. Highest frequency of operation for P=1, 2, 4 and 7 are .93GHz,1.68GHz, 2.92GHz and 4.22GHz. Power consumption of the frequencyscaled NoC is shown for comparison. . . . . . . . . . . . . . . . . . . . . . 57

    4.1 Error in performance measurement between real and ideal interconnectexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.2 Schematic of a multiprocessor architecture comprising of tiles and an in-terconnecting network. Each tile is made up of a processor, L1 and L2

    caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Flowchart illustrating the steps in experimental procedure. . . . . . . . . . 754.4 Tile floorplans for different (L1, L2) sizes. From left: (8KB, 64KB), (64KB,

    1MB), (128KB, 4MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Mesh floorplans used in experiments. From left: Conventional 2D Mesh

    topology, a clustered topology, cluster topology with L2 bank and threadmapping and and a mesh topology with L2 bank and thread mapping. . . . 77

    4.6 Benchmark execution time vs. Communication time - DRAM access timeand On-chip transit time vs. L2 cache size vs. Program completion time. . 80

    4.7 Program energy vs. Communication time. . . . . . . . . . . . . . . . . . . 81

    4.8 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit.   L2 cache sizes are in the order 64KB, 128KB, 256KB,512KB, 1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4.9 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses.   L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB,1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.10 Total messages over all the links during the execution of the benchmarkand Average transit time of a message. . . . . . . . . . . . . . . . . . . . . 86

    4.11 FFT. Total instructions executed and power spent in the memory hierarchyand on-chip links during the execution. . . . . . . . . . . . . . . . . . . . . 88

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    16/194

    LIST OF FIGURES    xv

    4.12 FFT Benchmark. Energy per Instruction and Instructions per second2 perWatt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    4.13 Y1:PCT, Y2:on-chip transit and off-chip comm. times. . . . . . . . . . . . 924.14 FFT benchmark results. (Program Completion Time, comm.: communi-

    cation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.15 FFT benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.16 Program Completion Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 954.17 Alternative Tile Placements, custom process scheduling example and ideal

    interconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K. . 98

    5.1 (a) Process graph of a HiperLAN/2 baseband processing SoC[7] and (b)NoC of the Object recognition processor[8]. . . . . . . . . . . . . . . . . . . 103

    5.2 A 64 Node, 8 × 8 2D LS-NoC along with NoC Manager interface to routingtables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3 Pipe establishment and label swapping example in a 3×3 LS-NoC. . . . . . 1095.4 Label Switched Router with single cycle flit traversal. Valid signal identifies

    Data and Label as valid. PauseIn and PauseOut are flow control signalsfor downstream and upstream routers. Routing table has output port andlabel swap information. Arbiter receives input from all the input portsalong with the flow control signal from the downstream router. . . . . . . . 110

    5.5 Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label. . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.1 Surveillance system showing the application of LS-NoC in the Video com-putati on serv er. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    6.2 A 2 router, 6 communicating nodes linear network. (b) Multiple source,multiple sink flow calculation in a network. . . . . . . . . . . . . . . . . . . 126

    6.3 (a) Number of pipes in a linear network (Fig. 6.2(a)),  lw  = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipesin 2D Mesh (Fig. 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    7.1 (a) Process blocks of HiperLAN/2 baseband processing SoC and Objectrecognition processor mapped on to a 8  ×   8 LS-NoC. Pipe 1: PEC0  →PEC6, Pipe 2: MP  → PEC3. (b) Flows set up for CBR & VBR traffic. . . 131

    7.2 Latency of HiperLAN/2 and ORP pipes in LS-NoC over varying injectionrates of non-streaming application nodes. Latency of non-provisioned pathsare titled (U). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    7.3 (a) Latency of CBR traffic over various injection rates of non-streamingnodes in LS-NoC. (b) Latency of VBR traffic over various injection ratesof non-streaming nodes in LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 133

    7.4 LS-NoC being used alongside a best effort NoC. . . . . . . . . . . . . . . . 136

    B.1 Modules in LS-NoC router design shown along with testbench, imple-mented in Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    B.2 Test cases used to verify an individual LS-NoC router. . . . . . . . . . . . 151

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    17/194

    LIST OF FIGURES    xvi

    B.3 8×8 mesh used for testing LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 152B.4 Traffic test cases used to verify proper functioning of LS-NoC router. . . . 153

    B.5 Flowchart illustrating steps in Synthesis and Place & Route steps of theLS-NoC router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

    B.6 Placed and routed output - Single Router. . . . . . . . . . . . . . . . . . . 154

    C.1 Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities of links after flows X→A→C→Y & X→B→C→Y are set up. (c) Residualnetwork showing available capacities of links in the forward direction andutilized capacity in the reverse. (d) Residual network after adding theflow: X→A→C→B→D→E→Y. (e) Final output of the maxflow algorithm

    showing 3 flows from X to Y. . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 (a) A 2 router, 6 source+sink system used for validation of the LS-NoCrouter design. Graph representation of the system used as input to theflow algorithm is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . . . 157

    C.3 The NoC after two pipes, P0 and P1 have been established. P0: R0S0→R1 D2and P1: R0S2→R1 D0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    18/194

    Chapter 1

    Introduction

    1.1 Network-on-Chip

    Network on Chips[1][2][3][4] are critical elements of modern Chip Multiprocessors (CMPs)

    and System on Chips (SoCs). Network on Chips (NoCs) help manage high complexity of 

    designing large chips by decoupling computation from communication. SoCs and CMPs

    have a multiplicity of communicating entities like programmable processing elements,

    hardware acceleration engines, memory blocks as well as off-chip interfaces. Using an

    NoC enables modular design of communicating blocks and network interfaces. NoCs

    help achieve a well structured design enabling higher performance while servicing larger

    bandwidths compared to bus based systems[1]. Links in NoCs designed with controlled

    electrical parameters can use aggressive singling circuits to reduce power and delay[9].

    Network resources are utilized more efficiently in NoCs as compared to global wires[10].

    Communication patterns between communicating entities are application dependent.

    As a result, NoCs are expected to cater to diverse connections varying in forms of connec-

    tivity, burstiness, latency and bandwidth requirements. NoC servicing communication re-

    quirements in CMPs or SoCs are expected to meet Quality of Service (QoS) demands such

    as maximum or average latency, typical or peak bandwidth and required throughput of 

    executing applications. Further, with power having become a serious design constraint[5],

    1

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    19/194

    CHAPTER 1. INTRODUCTION    2

    there is a great need for designing NoC which meets the target communication require-

    ments, while minimizing power using various strategies at architecture, microarchitectureand circuit levels of the design.

    1.2 Switching Policies

    Switching policies configure paths in routers to facilitate data transfer between input and

    output ports. Programming of internal switches in routers to connect input ports to out-

    put ports and determination of when and which data units are transferred is accomplishedusing switching policies. Flow control mechanisms synchronize data transfer between

    router and traffic sources and between two routers. Switching policies and flow control

    mechanisms influence the design of internal switches, routing and arbitration units, and

    the amount of buffers in a router. The major types of switching policies are introduced

    here.

    1.2.1 Circuit Switching

    Circuit switching is a reservation based switching policy in which network resources are

    allocated to a communication path before data is transferred. At the end of data transfer,

    reserved resources are de-allocated and are available for future circuits. As circuits are

    used on a reservation basis, circuit switching requires a simple router design with a few

    or no buffers.

    Circuits are established using path identifying probe packets that reserve resources

    as they propagate towards the destination. The circuit establishment is complete after

    an acknowledgment message is received by the source. Data is transferred along the cir-

    cuit without further monitoring or control. After the transfer is complete, the circuit is

    torn down and resources freed using a tail packet. Popular examples of circuit switched

    networks are Autonomous Error-Tolerant Cell[11], Asynchronous SoC[12], Crossroad[13],

    dTDMA[14], Point to point network on real time systems[15], Programmable NoC for

    FPGA-based systems[16], ProtoNoC[17], Space Division Multiplexing based NoC[18],

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    20/194

    CHAPTER 1. INTRODUCTION    3

    SoCBuS[19], Reconfigurable Circuit Switched NoC[7], etc.

    1.2.2 Packet Switching

    In packet switching, the message to be transmitted is partitioned and transmitted as

    fixed-length packets. Routing and control is handled on a per packet basis. The packet

    header includes routing and other control information needed for the packet to reach

    the destination. Packet switching increases network resource utilization as communica-

    tion channels share resources along the path. Buffers and arbitration units in routers

    manage resource conflicts and storage demands in communication paths. Packet switch-

    ing networks aid IP block re-use and are scalable[20]. Packet-switching is more flexible

    than circuit switching though it requires buffering and introduces unpredictable latency

    (jitter). Popular packet switched networks are Asynchronous NoC[21], FAUST[22], Ar-

    teris NoC[23], Butterfly Fat Tree[24], DyAD[25], Eclipse[26], MANGO[27], Proteo[28],

    QNoC[29], SPIN[30], etc. Some NoC designs can adaptively work in circuit or packet

    switched modes based on traffic requirements. A few examples are Æthereal[31], Hetero-

    geneous IP Block Interconnection[32], dynamically reconfigurable NoC[33], Octagon[34],

    etc.

    1.2.3 Label Switching

    Label switching is used by technologies such as ATM[35][36] and Multiprotocol Label

    Switching (MPLS)[37] as a packet relaying technique. Individual packets carry route in-

    formation in the form of labels. A label denotes a common route that a set of data packets

    traverse. Therefore, a minimalistic label identifies the source hop and the destination hop

    along with the intermediate transit routers. Along with routing information, labels can

    be used to specify service priorities to packets. This feature of labels enables use of dif-

    ferentiated services for packets using common labels. Routers along the path use the

    label to identify the next hop, forwarding information, traffic priority, Quality of Service

    guarantees and the next label to be assigned. Label switching inherently supports traffic

    engineering, as labels can be chosen based on desired next hop or required QoS services.

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    21/194

    CHAPTER 1. INTRODUCTION    4

    A few proposals of label switched NoCs are MPLS NoC[38], Nexus[39] and Blackbus[40].

    1.3 QoS in NoCs

    NoCs servicing CMPs and SoCs are expected to meet Quality of Service (QoS) demands

    of executing applications. Latency sensitive applications demand a guaranteed average

    and maximum latency on communication traffic. Jitter sensitive applications may tolerate

    longer latencies but require fixed delay along communication paths. Further, in between

    classes of application some have higher priority than others. For example, applicationdata usually has higher priority than acknowledgment packets or control information.

    The two basic approaches in NoC designs to enable QoS guarantees are: creation of 

    reserved connections between source and destinations via circuit switching or support for

    prioritized routing (in case of packet switched, connectionless paths).

    Circuit switched NoCs guarantee high data transfer rates in an energy efficient manner

    by reducing intra-route data storage[41]. Circuit switched NoCs provide guaranteed QoS

    for worst case traffic scenarios leading to higher network resource requirements[42]. These

    are well suited for streaming traffic generated by media processors where communication

    requirements are well known a priori. One of the drawbacks here is under utilization

    of network resources as resources are reserved for peak bandwidth while the average

    requirement might be lesser.

    Packet switched networks provide efficient interconnect utilization and high throughputs[43]

    while providing fairness amongst best effort flows. However, network resources in packet

    switched networks need to be over-provisioned to support QoS for various traffic classes

    and have high buffer requirements in routers. Packet switching networks usually provide

    QoS by differentiated services to traffic by classifying them into various classes[29]. Pri-

    oritized services are provided to traffic belonging to each class. Due to the sharing of 

    network resources, packet switched networks can be configured to provide Guaranteed

    Throughput (GT) for a few classes of traffic and Best Effort (BE) services for remaining

    classes.

    With traffic engineering enabled label switching networks, communication loads can

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    22/194

    CHAPTER 1. INTRODUCTION    5

    be distributed over the NoC resulting in fair allocation of network resources. Network

    resource guarantees, enable paths with less or no jitter while keeping network utilizationfairly high. Further, design of routers is simplified compared to conventional wormhole

    routers[40].

    1.4 QoS Guaranteed NoC Design

    Media processors with streaming traffic such as HiperLAN/2 Baseband Processors[7],

    Real-time Object Recognition Processors [8] and H.264 encoders[44][45] demand ade-quate bandwidth and bounded latencies between communicating entities. They also have

    well known communication patterns and bandwidth requirements. Adequate throughput,

    latency and bandwidth guarantees between process blocks have to be provided for such

    applications. Nature of streaming applications in media processors and characteristics of 

    streaming traffic are illustrated in Section 5.1 of Chapter 5.

    Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for con-

    nections and deterministic latencies in communication paths. This thesis proposes a QoS

    guaranteeing NoC using label switching where bandwidth can be reserved while links are

    shared. The traffic is engineered during route setup and it leverages advantages of both

    packet and circuit switching techniques. We propose a QoS based Label Switched NoC

    (LS-NoC) router design. We present a latency, power and performance optimal intercon-

    nect design methodology considering low level circuit and system parameters. Further,

    optimal tile configurations are identified using effects of application communication traffic

    on performance and energy in chip multiprocessors (Figure 4.2).

    A label switched, QoS guaranteeing NoC, that retains advantages of both packet

    switched and circuit switched networks is the main focus of this thesis. Congestion free

    communication pipes are identified by a centralized Manager with complete network vis-

    ibility. Label Switched NoC (LS-NoC) sets up communication channels (pipes) between

    communicating nodes that are independent of existing pipes and are contention free at the

    routers. Deterministic delays and bandwidth are guaranteed in newly established pipes,

    taking into account established flows. Residual bandwidth in links reserved by a pipe can

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    23/194

    CHAPTER 1. INTRODUCTION    6

    Figure 1.1: Design space exploration of NoCs in CMPs are closely related to link microar-

    chitecture, router design and tile configurations.

    be utilized by other pipes, thus enabling sharing of physical links between pipes without

    compromising QoS guarantees. LS-NoC provides throughput guarantees irrespective of 

    spatial separation of the communicating entities.

    Interconnect delay and power contribute significantly towards the final performance

    and power numbers of a CMP[46]. Design variables for interconnect exploration include

    wire width, wire spacing, repeater size and spacing, degree of pipelining, supply (V dd),

    threshold voltage (V th), activity and coupling factors. A power and performance opti-

    mal link microarchitecture can be arrived at by optimizing these low level link param-

    eters. A methodology to arrive at the optimal link configuration in terms of number

    of pipeline stages (cycle latency) for a given length of link and desired operating fre-

    quency is presented. Optimal configurations of all links in the NoC are identified and a

    power-performance optimal NoC thus achieved.

    Primary and secondary cache sizes have a major bearing on the amount of on-chipand off-chip communication in a Chip Multiprocessor (CMP). On-chip and off-chip com-

    munication times have significant impact on execution time and the energy efficiency of 

    CMPs. From a performance point of view, cache accesses should suffer minimum delay

    and off-tile communication due to cache misses should be negligible. Large caches dissi-

    pate more leakage energy and may exceed area budgets though they reduce cache misses

    and decrease off-tile communication. Larger caches result in longer inter-tile communi-

    cation link lengths and latencies, thus adversely impacting communication time. Small

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    24/194

    CHAPTER 1. INTRODUCTION    7

    caches reduce occupied tile area, have higher activity and hence dissipate lesser leakage

    energy. Drawback of smaller caches is potentially higher number of misses and frequentof off-tile communication. This illustrates the trade off between cache size, miss rate,

    NoC communication latency and power. Energy efficient tile design is a configuration

    exploration and trade-off study using different cache sizes and tile areas to identify a

    power-performance optimal cache size and NoC configuration for the CMP.

    1.5 Contributions of the Thesis

    Work in this thesis presents methodologies for label switched QoS guaranteed NoC design,

    link microarchitecture exploration and optimal Chip Multiprocessor (CMP) tile configu-

    rations. Contributions from this thesis are listed here:

    1.5.1 Link Microarchitecture Exploration

    •  Optimal Link Design and Exploration Framework:  We present simulation framework

    developed in System-C which allows the designer to explore NoC design across low

    level link parameters such as pipelining, link width, wire pitch, supply voltage, op-

    erating frequency and NoC architectural parameters such as router type and topol-

    ogy of the interconnection network. We use the simulation framework to identify

    power-performance (Energy-Delay) optimal link configuration in a given NoC over

    particular traffic patterns. Such an optimum exists because increasing pipelining

    allows for shorter length wire segments which can be operated either faster or with

    lower power at the same speed.

    •   Optimum Pipe Depth:  Contrary to intuition, we find that increasing pipeline depth

    can actually help reduce latency in absolute time units, by allowing shorter links

    & hence higher frequency of operation. In some cases, we find that switching to

    a higher pipelining configuration can actually help reduce power as the links can

    be designed with smaller repeaters. Larger NoC power savings can be achieved by

    voltage scaling along with frequency scaling. Hence it is important to include the

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    25/194

    CHAPTER 1. INTRODUCTION    8

    link microarchitecture parameters as well as circuit parameters like supply voltage

    during architecture design exploration of NoCs.

    1.5.2 Optimal CMP Tile Configuration

    •  Optimal Cache Size:  The performance-power optimal L1/L2 configuration of a tile

    is close to the configuration that spends least amount of time in on-chip and off-chip

    communication.

    •  Effect of Floorplanning and Process Mapping:   Communication aware floorplanningcan reduce up to 2.6% of the energy spent in execution of an instruction and up to

    11% savings in communication power during the execution of the program. Mapping

    L2 banks in the same core as the processes accessing it reduces time spent in commu-

    nication and hence the overall program completion time and also has a bearing on

    the Total Energy spent in the execution of the program. Experiments have revealed

    that as much as 2% of energy per instruction can be saved by communication-aware

    process scheduling compared to a conventional thread mapping policies in a 2DMesh architecture.

    1.5.3 QoS in NoCs

    •  A Label Switching NoC providing QoS guarantees:  We present a LS-NoC to service

    QoS demands of streaming traffic in media processors. A centralized NoC Man-

    ager capable of traffic engineering establishes bandwidth guaranteed communication

    channels between nodes. LS-NoC guarantees deterministic path latencies, satisfies

    bandwidth requirements and delivers constant throughput. Delay and throughput

    guaranteed paths (pipes ) are established between source and destinations along con-

    tention free, bandwidth provisioned routes. Pipes are identified by  labels  unique to

    each source node. Labels need fewer bits compared to node identification numbers

    - potentially decreasing memory usage in routing tables.

    •   NoC Manager with traffic engineering capabilities:  The NoC Manager utilizes flow

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    26/194

    CHAPTER 1. INTRODUCTION    9

    identification algorithms to identify contention free, bandwidth provisioned paths

    in LS-NoC called  pipes . The LS-NoC Manager has complete visibility of the stateof LS-NoC. Bandwidth requirements of the application are taken into account to

    provision routes between communicating nodes by the flow identification algorithm.

    Flow based pipe establishment algorithm is topology independent and hence the

    NoC Manager supports applications mapped to both regular chip multiprocessors

    (CMPs) and customized SoCs with non-conventional NoC topologies. Additionally,

    fault tolerance is achieved by the NoC Manager by considering link status during

    pipe establishment.

    •  Design of a Label Switched Router:  The Label Switched (LS) Router used in LS-

    NoC achieves single cycle traversal delay during no contention and is multicast and

    broadcast capable. Source nodes in the LS-NoC can work asynchronously as cycle

    level scheduling is not required in the LS Router. LS router supports multiple clock

    domain operation. Dual clock buffers can be used at output ports in the LS-NoC

    router. This eases clock domain crossovers and reduces the need for a single globally

    synchronous clock. As a result, clock tree design is less complex and clock power is

    potentially saved.

    1.6 Organization of the Thesis

    Chapter 2 highlights several works from current literature related to the broad areas

    of QoS guaranteed NoCs, link microarchitecture, design space exploration of NoCs and

    effects of communication on energy and performance trade-offs in CMPs.

    Chapter 3 presents a latency, power and performance trade-off study of NoCs through

    link microarchitecture exploration using microarchitectural and circuit level parameters.

    NoC exploration framework used in the trade-off studies is described. The interface to

    the SystemC framework and sample output logs generater are presented in Appendix A.

    Effects of on-chip and off-chip communication due to various CMP tile configurations

    is explored in Chapter 4. The need to use detailed interconnection network models to

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    27/194

    CHAPTER 1. INTRODUCTION    10

    identify optimal energy and performance configurations is also highlighted. On-chip and

    off-chip communication effects on power and performance of a CMPs is explored. Effects of communication on program execution times and program execution energy are presented.

    Further, Energy-performance results for tile configurations and effects of custom L2 bank

    mapping and thread mapping on power and performance of CMPs is presented.

    Design and implementation of a label switching, traffic engineering capable NoC de-

    livering guaranteed QoS for streaming traffic in media processors has been presented in

    Chapter 5. Traffic characteristics of streaming applications are also presented in the chap-

    ter. Functional verification of the LS-NoC router using various test cases is presented inAppendix B. Chapter 6 illustrates the LS-NoC management framework and the flow

    identification algorithm used to establish pipes. An example of use of flow algorithm has

    been presented in Appendix C. Streaming application test cases and various types of 

    video traffic are used to establish LS-NoC as a QoS guaranteeing framework in Chapter

    7. The thesis concludes in Chapter 8 after enlisting some future advancements possible

    on the proposed work.

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    28/194

    Chapter 2

    Related Work

    Several publications have highlighted the need for solutions to pressing problems in various

    domains in the broad area of Network-on-Chips[47][48][49][50]. This chapter introduces

    relevant works in the broad areas of QoS guaranteed Network-on-Chips, design space

    exploration of NoCs and effects of communication on energy and performance trade-offs

    in CMPs.

    2.1 Traffic Engineered NoC for Streaming Applica-

    tions

    Providing QoS guarantees in on-chip communication networks has been identified as one

    of major research problems in NoCs[48]. QoS solutions in packet switched networks use

    priority based services while circuit switched NoCs use some form of resource reservation.We introduce a few well known QoS solutions from literature and compare our work with

    the state of the art. Packet switched NoCs use differentiated services for traffic classes

    [29][22][21][8] to provide latency and bandwidth guarantees. Circuit switched NoCs use

    resource reservation mechanisms to guarantee QoS[34][51][41][19]. Resource reservation

    mechanisms involve identifying a sufficiently resource rich path, reserving resources along

    the path, configuration, actual communication and path tear down. A fairly extensive

    survey of NoC proposals has been presented in [50]. Relevant QoS NoCs are discussed in

    11

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    29/194

    CHAPTER 2. RELATED WORK    12

    this section.

    2.1.1 QoS in Packet Switched Networks

    QoS NoC (QNoC) presented by Bolotin et. al. [29] is a customized QoS NoC architecture

    based on a 2D Mesh to satisfy QoS by allocating frequently communicating nodes close-by,

    doing away with unnecessary links, tailoring link width to meet bandwidth requirements

    and balancing link utilization. Inter-module communication traffic is classified into four

    classes of service: signaling, real-time, RD/WR and block-transfer. FAUST[22] is a recon-

    figurable baseband platform based on an asynchronous NoC providing a programmable

    communication framework linking heterogeneous resources. FAUST uses 2 level priority

    based virtual circuit design in its Network Interface (NI) to provide QoS guarantees. Asyn-

    chronous NoCs[21] use clock-free interconnect to improve reliability and delay-insensitive

    arbiters to solve routing conflicts. A QoS Router with both soft (Soft GT) and hard (Hard

    GT) guarantees for globally asynchronous, locally synchronous (GALS) NoCs is presented

    in [52]. Leftover bandwidth in routers servicing Hard GT is utilized by Soft GT connec-

    tions and best effort traffic. NoCs presented in [21], [52] and [53] employ multiple priority

    levels to provide differentiated services and guarantee QoS. The MANGO [27][54] NoC

    provides hard GT by prioritizing each GT connection and adopts Asynchronous Latency

    Guarantee (ALG) scheduling to prevent starvation of packets with lower priority.

    One of the major drawbacks of priority based QoS schemes is that increase in traffic

    in one priority class effects the delay on traffic belonging to other classes. A priority

    network will lose the differentiated services advantage if all traffic belong to the same

    priority level. Further, deadlock-free routing algorithms using virtual circuits with a

    priority approach may lead to degradation in NoC throughput. In cases where connections

    cannot be overlapped with each other (eg. MANGO NoC), increased number of hard GT

    connections will lead to increased cost in network resources.

    Another class of packet switched NoCs using priority based QoS solutions are applica-

    tion specific SoCs. A tree based hierarchical packet-switched NoC for a real-time object

    recognition processor is implemented in [8]. The tree topology NoC with three crossbar

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    30/194

    CHAPTER 2. RELATED WORK    13

    switches interconnects 12 IPs supports both bursty (for image traffic) and non-bursty (for

    control and synchronization signals) traffic. Network resources in this NoC are tailoredto meet throughput and bandwidth demands of the application and hence the design is

    not a generic solution for servicing QoS in an CMP environment.

    2.1.2 QoS in Circuit Switched Networks

    Resource reservation between communicating nodes involves identification of path us-

    ing point-to-point links or a path probing service network or an intelligent, traffic aware

    distributed or centralized manager. Hu et. al.[15] introduce point-to-point (P2P) commu-

    nication synthesis to meet timing demands between communicating nodes using bus width

    synthesis. Circuit switched bus based QoS solutions such as Crossroad[13], dTDMA[14]

    and Heterogeneous IP Block Interconnection (HIBI)[32] rely on communication localiza-

    tion to satisfy timing demands. NEXUS[39] is a resource reservation based QoS NoC

    for globally asynchronous, locally synchronous (GALS) architectures. NEXUS uses an

    asynchronous crossbar to connect synchronous modules through asynchronous channels

    and clock-domain converters.

    P2P networks do not share communication links between multiple nodes leading to

    inefficient utilization of network resources. This increases wiring resources inside the

    chip and results in poor scalability. Crossbar based solutions using protocol handshakes

    (for example, 4-way handshakes in NEXUS[39] and ProtoNoC[17]) force communicating

    nodes to wait till handshake is complete and path is established. Non-interference of 

    communication channels is achieved by over-provisioning resources in the crossbar. This

    leads to complex and poorly scalable networks. Connecting frequently communicating

    nodes on a single bus will increase demand on the bus and lead to larger waiting times at

    the nodes. Static routing along shortest paths does not guarantee latency bound routes

    due to arbitration delays in the network.

    Amongst the NoCs that use a probe based circuit establishment solutions are Intel’s

    8×8 circuit switched NoC[41], SoCBUS[19][55] and distributed programming model in

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    31/194

    CHAPTER 2. RELATED WORK    14

    Æthereal[51]. In these NoCs, probe packets are used to reconnoiter shortest communica-

    tion paths and configure routing tables if path (circuit) is available. Routers are lockeddown and no other circuits can use the port during the lifetime of an established circuit.

    If the shortest X-Y path is not available, the probe packets initiate route discovery mech-

    anisms in other paths. The method involves some dynamic behaviour as the probe might

    repeat route discovery steps or try after a random period of time if circuit set up does

    not succeed. This leads to indeterministic and sometimes large route setup times which

    may be unacceptable for real time application performance.

    Centralized Circuit Management

    Reserved communication channels can be identified and configured using an application

    aware hardware or software entity[51][34]. Such a traffic manager can provide programma-

    bility of routes.

    The Æthereal NoC [51] aims at providing hard guaranteed QoS using Time Divi-

    sion Multiplexing(TDM) to avoid contention in a synchronous network. The centralized

    programming model in Æthereal NoC[51] uses a root process to identify free slots and

    configure network interfaces. Time slot tables are used in routers to reserve output ports

    per input port in a particular time slot. To avoid collisions and the loss of data, con-

    secutive time slots are then reserved in routers along the circuit path. The number of 

    paths established in the NoC is restricted by the scheduling constraints during time slots

    reservation. Increasing the number of time slots in TDM based NoCs increases router size.

    In cases where a communication channel cannot be found due to slots exhaustion, the

    traffic division over multiple physical paths may be required[56]. Traffic division involves

    reordering packets at the target node leading to increased memory and computational

    costs.

    TDM techniques using slot tables in Æthereal[51] and sequencers in Adaptive System-

    on-Chip[12] require a single synchronous clock distributed over the chip. Accurate global

    synchronous clock distribution is expensive in terms of power. Global synchronicity can be

    achieved in a distributed manner using tokens such that every router synchronizes every

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    32/194

    CHAPTER 2. RELATED WORK    15

    slot with all of its neighbors [57]. This method will bring down the operating speed of the

    NoC as the slowest router will dictate the speed of the NoC. Further, power managementtechniques such as multiple clock domains is not feasible with this approach. AElite[58]

    and dAElite[59] have been proposed as improved next generation Æthereal NoCs. AElite

    inherits the guaranteed services model from Æthereal. To overcome the global synchronic-

    ity problem, AElite proposes use of asynchronous and mesochronous links as a possibility.

    As noted in the paper[58], using mesochronous links alone may not be sufficient if routers

    and NIs are plesiochronous[60]. One of the drawbacks of AElite was number of slots

    occupied by the header flits. A header flit in AElite occupied one in three slots and theoverhead rises to up to 33%. dAElite circumvents the header flit overhead by routing

    based on the time of packet injection and packet receiving. One of the disadvantages of 

    dAElite is an increase in the number of link wires, due to the configuration network and

    also because of separate wires for end-to-end credit communication.

    The Octagon NoC[34] implements a centralized best fit scheduler to configure and

    manage non-overlapping connections. The scheduler cannot establish a new connection

    through a port if it is blocked by another connection. This results in increased connection

    establishment time at the routers and also packet losses.

    2.1.3 QoS by Space Division Multiplexing

    As an alternative to TDM techniques, Spatial Division Multiplexing (SDM) techniques

    for QoS have been proposed in [23],[61] and [62]. SDM techniques involve sharing fraction

    of links between connections simultaneously based on bandwidth requirements of the

    corresponding connections. An approach comparable to a static version of SDM called

    Lane-Division-Multiplexing has been proposed in [7]. Lane-Division-Multiplexing is based

    on a reconfigurable circuit switched router composed of a crossbar and data converters.

    Disadvantage of the solution in [7] is that it does not support channel sharing and BE

    traffic. An additional network is required for configuring the switches and for carrying

    the BE traffic. Sharing a subset of wires between connections as in [63] leads to a more

    complex switch design with huge delay. SDM and TDM techniques have been combined

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    33/194

    CHAPTER 2. RELATED WORK    16

    in [64] allowing for increase in number of connections supported by increasing the number

    of sub-channels in the link or by increasing the number of time slots. This increases pathestablishment probability in the NoC.

    In SDM based techniques, sender serializes data on the wires allocated and the receiver

    deserializes the data before forwarding to the IP block. One of the issues in SDM based

    circuits is complexity of implementation of serializers and deserializers.

    2.1.4 Static routing in NoCs

    Most NoCs use traffic oblivious static routing[51] to establish communication channels

    between nodes. Dimension ordered routing[41][53][17][53][51][34] or routes decided at de-

    sign time[65] are not flexible and cannot circumvent congested paths. Routing in FPGAs

    also present a similar scenario where routes between communicating nodes are bandwidth

    and latency guaranteed, but are static. These routes occupy network resources along the

    path for the entire lifetime of the application. QoS is guaranteed in this case by over

    provisioning resources along the route.

    2.1.5 MPLS and Label Switching in NoCs

    Use of Multi-Protocol Label Switching for QoS[38] in NoCs and advantages of identifying

    communication channels using labels have been investigated in [39],[40]. A conventional

    NoC is connected to an MPLS backbone using Label Edge Routers (LERs)[38]. The

    MPLS backbone uses traffic engineering and priority based QoS services to communication

    channels identified by labels. The work is a direct mapping of the MPLS implementation

    in the Internet to NoCs. The router and NoC design approach is not optimized for a

    hardware implementation. Results from Network Simulator-2 (NS-2) are at a functional

    level and may not reflect the exact performance achievable inside a chip.

    Use of labels to identify communication channels instead of source and destination

    identification numbers reduces the amount of metadata transmitted in the NoC. Unique

    addressing at source allows label reuse and enables efficient use of label space. Imple-

    mentation of label based addressing in streaming applications have resulted in significant

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    34/194

    CHAPTER 2. RELATED WORK    17

    reduction in router area[40]. The work employs a method similar to label switching to

    achieve non-global label addressing hence reducing label bit width. A C ×N   → C   routingstrategy is described in conjunction with the label addressing scheme. Work presented in

    [40] presents a simple data transfer scheme and does not concentrate on rendering QoS

    between communicating nodes. The route establishment process has not been explicitly

    mentioned and one can assume that standard routing algorithms will be used.

    2.1.6 Label Switched NoC

    In the proposed work, we describe a Label Switched QoS guaranteeing NoC that retains

    advantages of both packet switched and circuit switched networks. Contention at output

    ports in is tackled using communication pipes. Pipes are communication routes estab-

    lished along a bandwidth rich, contention free router path. Pipes are identified by a

    centralized Manager with complete network visibility.

    NoC Manager utilizes Flow identification algorithms[66][67] (Algorithm 1) to establish

    pipes. Flow identification algorithm guarantees a deterministic delay in identifying and

    configuring pipes. Flow identification algorithm takes into account bandwidth available

    in individual links to establish QoS guaranteed pipes. This guarantees QoS serviced

    communication paths between communicating nodes. Multiple pipes can be set up in

    a single link if QoS requirements of all the pipes are satisfied. This enables sharing of 

    physical links between pipes without compromising QoS guarantees. LS-NoC provides

    throughput guarantees irrespective of spatial separation of communicating entities.

    2.2 Link Microarchitecture and Tile Area Exploration

    2.2.1 NoC Design Space Exploration

    Current research in architectural level exploration of NoC in SoCs concentrates on un-

    derstanding the impacts of varying topologies, link and router parameters on the overall

    throughput, area and power consumption of the system (SoCs and Multicore chips) using

    suitable traffic models[68]. Impacts of varying topologies, link and router parameters on

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    35/194

    CHAPTER 2. RELATED WORK    18

    the overall throughput, area and power consumption of the system (SoCs and Multicore

    chips) using relevant traffic models is discussed in [68]. The paper illustrates a consistentcomparison and evaluation methodology based on a set of quantifiable critical parameters

    for NoCs. The work suggests that evaluation of NoCs must consider applications into

    account. The usual most critical evaluation parameters are not exhaustive and differ-

    ent applications may require additional parameters such as testability, dependability, and

    reliability.

    Work in [69] emphasizes need for co-design of interconnects, processing elements and

    memory blocks to understand the effects on overall system characteristics. Results fromthis work show that the architecture of the interconnect interacts with the design and

    architecture of the cores and caches closely. The work studies the area-bandwidth-

    performance trade-off on on-chip interconnects. The increase in area demands of shared

    caches in CMPs is also documented. Not using detailed interconnect models during CMP

    design leads to non-optimal larger shared L2 caches inside the chip.

    2.3 Simulation Tools

    Simulation tools have been developed to aid designers in interconnection network (ICN)

    space exploration[70][71]. Kogel et. al.[70] present a modular exploration framework to

    capture performance of point-to-point, shared bus and crossbar topologies.

    2.3.1 Link Exploration Tools

    Link exploration tool works make a case for microarchitectural wire management in future

    processors where communication is a prominent contributor for power and performance.

    Separate wire exploration tools such as those presented in [71], [72], [73], [74] and [75]

    give an estimate of delay of the wire in terms of latency for a particular wire length and

    operating frequency.

    Orion [71] is a power-performance interconnection network simulator that is capable of 

    providing power and performance statistics. Orion model estimates power consumed by

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    36/194

    CHAPTER 2. RELATED WORK    19

    router elements (crossbars, FIFOs and arbiters) by calculating switching capacitances of 

    individual circuit elements. Orion contains a library of architectural level parameterizedpower models.

    The more recent Orion 2.0 presented in [76] is an enhanced NoC power and area

    simulator offering improved accuracy compared to the original Orion framework. Some of 

    the additions into Orion 2.0 include flip-flop and clock dynamic and leakage power models,

    link power models, leveraging models developed in [74]. Virtual Channel (VC) allocator

    microarchitecture uses a VC allocation model, based on the microarchitecture and pipeline

    proposed in [77]. Application-specific technology-level fine tuning of parameters usingdifferent Vth  and transistor widths are used to increase accuracy of power estimation.

    Work in [72] explores use of heterogeneous interconnects optimized for delay, band-

    width or power by varying design parameters such as a buffer sizes, wire width and number

    of repeaters on the interconnects. The work presented in the paper uses Energy-Delay2

    as the optimization parameter. An evaluation of different configurations of heterogeneous

    interconnects is made. The evaluation shows that an optimal configuration (for delay,

    bandwidth, power or power and bandwidth) of wires can reduce the total processor ED 2

    value by up to 11% compared to a NoC with homogeneous interconnect in a typical

    processor.

    Courtay et. al[73] have developed a high-level delay and power estimation tool for

    link exploration that offers similar statistics as Intacte does. The tool allows changing

    architectural level parameters such as different signal coding techniques to analyze the

    effects on wire delay/power.

    Work in [74] proposes delay and power models for buffered interconnects. The mod-

    els can be constructed from sources such as Liberty[78], LEF/ITF[79], ITRS[80], and

    PTM[81]. The buffered delay models take into account effects of input and output slews

    of circuit elements in calculating intrinsic delays. The power models include leakage and

    dynamic power dissipation of gates. The area models include technology dependent co-

    efficients that can be estimated by linear regression techniques per technology node to

    estimate repeater areas.

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    37/194

    CHAPTER 2. RELATED WORK    20

    Intacte[82] is used for interconnect delay and power estimates. Design variables for

    Intacte’s interconnect optimization are wire width, wire spacing, repeater size and spacing,degree of pipelining, supply (V dd) and threshold voltage (V th). Intacte can be used to arrive

    at power optimal number of repeaters, sizes and spacing for a given wire length to achieve

    a desired frequency. Intacte outputs total power dissipated including short circuit and

    leakage power values.

    A high level power estimation tool accounting for interconnect effects is presented in

    [83]. The work presents an interconnect length estimation model based on Rent’s rule[84]

    and a high level area (gate count) prediction method. Different place and route enginesand cell libraries can be used with this proposed model after some minor adaptations.

    2.3.2 Router Power and Architecture Exploration Tools

    Most router exploration tools model ICN elements at a higher level abstraction of switches,

    links and buffers and help in power/performance trade-off studies[85][86]. These are used

    to research the design of router architectures[87] and ICN topologies[34] with varying

    area/performance trade-offs for general purpose SoCs or to cater to specific applications.

    A high level power estimation methodology for NoC routers based on number of 

    traversing flits as the unit of abstraction has been proposed in [85]. The macro model of 

    the framework incurs a minor absolute cycle error compared to gate level analysis. Provid-

    ing a fast and cycle accurate power profile at an early stage of router design enables power

    optimizations such as power-aware compilers, core mapping, and scheduling techniques

    for CMPs to be incorporated into the final design. The power macro model uses state

    information of FSM in a router assigned to reserve channels during packet forwarding

    for wormhole flow control. This enhances the accuracy of the power macro model. The

    power macro model based on regression analysis can be migrated to different technology

    libraries.

    An architectural-level power model for interconnection network routers has been pre-

    sented in [88]. The work specifically considers the Alpha 21364 and Infiniband routers

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    38/194

    CHAPTER 2. RELATED WORK    21

    for modelling case studies. Memory arrays, crossbars and arbiters form the basic build-

    ing blocks of all router models using this framework. Each of these building blocks havebeen modelled in detail to estimate switching capacitance. Switching activity is estimated

    based on traffic models assuming certain arrival rates at the input ports. The power num-

    bers for both Alpha 21364 and Infiniband routers have been found to be matching the

    vendors’ estimates within a minor error margin.

    The high level power model presented in [86] to estimate power consumption in semi-

    global and global interconnects considers switching power, power due to vias and re-

    peaters. The high level model estimates switching power within an error of 6% with aspeedup of three-to-four orders of magnitude. Error in via power is under 3%. A segment

    length distribution model has been presented for cases where Rents rule is insufficient.

    The segment length distribution model has been validated by analyzing netlists of a set

    of complex designs.

    A wormhole router implementing a minimal adaptive routing algorithm with near

    optimal performance and feasible design complexity is proposed in [87]. The work also

    estimates the optimal size of FIFO in an adaptive router with fixed priority scheme. The

    optimal size of the FIFO is derived to be equal to the length of the packet in flits in this

    work.

    2.3.3 Complete NoC Exploration

    Several frameworks have been proposed for complete NoC exploration[89][90][91]. These

    frameworks can be used as tools to derive a first cut analysis of effect of certain NoC con-

    figurations at an early design phase. Such frameworks are the first steps for roadmapping

    future of on-chip networks.

    A technology aware NoC topology exploration tool has been presented in [89]. The

    NoC exploration is optimized for energy consumption of the entire SoC. The work char-

    acterizes 2D Meshes and Torii along with higher dimensions, multiple hierarchies and

    express channels, for energy spent in the network. The work presents analytical models

    based on NoC parameters such as average hop count and average flit traversal energy to

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    39/194

    CHAPTER 2. RELATED WORK    22

    predict the most energy-efficient topology for future technologies.

    A holistic approach to designing energy-efficient cluster interconnects has been pro-posed in [90]. The work uses a cycle-accurate simulator with designs of an InfiniBand

    Architecture (IBA) compliant interconnect fabric. The system is modelled to comprise

    of switches, network interface cards and links. The study reveals that the links and

    switch buffers consume the major portion of the SoC power. The work proposes dynamic

    voltage scaling and dynamic link shutdown as viable methods to save power during SoC

    operation. A system-level roadmapping toolchain for interconnection networks has been

    presented in [91]. The framework is titled Polaris and iterates through available NoC de-signs to identify a power optimal one based on network traffic, architectures and process

    characteristics.

    Several complete NoC simulators have been developed and are in use by the NoC

    research community[92][93][94]. The Network-on-Chip Simulator, Noxim[92], was devel-

    oped at the University of Catania, Italy. Several NoC parameters such as network size,

    buffer size, packet size distribution, routing algorithm, selection strategy, packet injection

    rate, traffic time distribution, traffic pattern, hot-spot traffic distribution can be input

    to this framework. The simulator allows NoC evaluation based on throughput, flit de-

    lay and power consumption. The Nostrum NoC Simulation Environment (NNSE) [94] is

    part of the Nostrum project[65] and contains a SystemC based simulator. Inputs to this

    simulator are network size, topology, routing policy and traffic patterns. Based on these

    configuration parameters a simulator is built and executed to produce a desired set of 

    results in a variety of graphs.

    2.3.4 CMP Exploration Tools

    Wattch was one of the first [95] architectural level frameworks for analyzing and optimizing

    microprocessor power dissipation. Wattch was orders of magnitude faster than layout-

    level power tools, and its accuracy was within 10% of verified industry tools on leading-

    edge designs. Wattch was an architecture-level, parameterizable, simulator framework

    that can accurately quantify potential power consumption in microprocessors. Wattch

  • 8/12/2019 NoC.design.and.Optimization.of.Multicore.media.processors.thesis

    40/194

    CHAPTER 2. RELATED WORK    23

    framework quantifies power consumption of all the major units of the processor, param-

    eterize them, and integrate these power estimates into a high-level simulator. Wattchmodels main processor units into array structures, fully associative content-addressable-

    memories, combinational logic and wires or clocking elements. Individual capacitances

    of each of these elements are estimated and power is calculated. Work presented in [95]

    integrates Wattch into SimpleScalar architectural simulator[96].

    A tool like Ruby[97], allows one to simulate a complete distributed memory hierarchy

    with an on-chip network as in Orion. However, it needs to be augmented with a detailed

    interconnect model which accounts for the physical area of the tiles and their placements.Network Processor exploration and power estimation tools utilize models for smaller

    components and quote the integrated power for the system[98][99][100]. They use cycle

    accurate register, cache and arbiter models introduced previously here. NePSim[99] is

    an open-source integrated simulation infrastructure. Typical network processors can be

    simulated with the cy