5/3/2011 international symposium on network-on-chip 1 dart: a programmable architecture for noc...

61
5/3/2011 International Symposium on Network -on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan* *Department of Electrical & Computer Engineering University of Toronto †Google Inc.

Upload: valentine-harmon

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

5/3/2011 International Symposium on Network-on-Chip 1

DART: A Programmable Architecture for NoC Simulation on FPGAs

Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan*

*Department of Electrical & Computer EngineeringUniversity of Toronto

†Google Inc.

5/3/2011 International Symposium on Network-on-Chip 2

Why yet another NoC simulator?• Software simulators

– Stand-alone or integrated– Parallel NoC simulator (DARSIM)

• FPGA-based Models– Direct map NoC emulators (Genko et al., NoCem)– Dynamic reconfiguration (DRNoC)– Decoupled timing and functional model (RAMPGold,

ProtoFlex, A-Ports)

• Analytical models: FIST

5/3/2011 International Symposium on Network-on-Chip 3

Why yet another NoC simulator?

Requirement Software Simulation

Accurate Possible

Fast to run < 10 KIPS to 100s KIPS

Easy to implement Yes

Easy to use & modify Yes

Available early Yes

@100KIPS:1s of execution @ 1GHz= 10K sec = 2.8 hrs

Benefits of thread-based parallelization is limited due to high synchronization overhead

5/3/2011 International Symposium on Network-on-Chip 4

Why yet another NoC simulator?

Requirement Software Simulation

FPGA-based Emulators

Accurate Possible Possible

Fast to run < 10 KIPS to 100s KIPS

10s to 100s MIPS

Easy to implement Yes No

Easy to use & modify Yes No

Available early Yes Yes

Hardware changes Hours of synthesis-place-route time

Orders of magnitude faster!

5/3/2011 International Symposium on Network-on-Chip 5

FPGA

DART: Hybrid Approach

• Generic NoC simulation engine• Fixed function nodes for basic NoC building blocks

– Router, traffic generator, link• Software configurable parameters in each node

PC UART ControlFSM

DART Simulatorconfiguration,commands

Simulationresults

Simulate different NoCs without changing hardware

5/3/2011 International Symposium on Network-on-Chip 6

Why yet another NoC simulator?

Requirement Software Simulation

FPGA-based Emulators DART

Accurate Possible Possible Yes

Fast to run < 10 KIPS to 100s KIPS

10s to 100s MIPS 10s MIPS

Easy to implement Yes No No

Easy to use & modify Yes No Yes

Available early Yes Yes Yes

5/3/2011 International Symposium on Network-on-Chip 7

DART Simulator Architecture

5/3/2011 International Symposium on Network-on-Chip 8

Traffic Generator

Flit Queue

Router

Generic NoC Model

Global interconnect• Topology

• Routing algorithm• Flow control• Router microarchitecture

• Simulated traffic

• Link properties

5/3/2011 International Symposium on Network-on-Chip 9

DART Architecture

Global Timer

Synchronize all network transfers to a global time counter

5/3/2011 International Symposium on Network-on-Chip 10

DART NodesNode Parameters Statistics Counter

TrafficGenerator

•Traffic pattern•Injection intervals•Packet size (# of flits)

•# of injected packets•# of received packets•Cumulative packet

latency

Flit Queue•Latency (flit cycles)•Bandwidth (flits / cycle)

More can be added easilyRouters

•Routing Table•Input buffer sizes

(credits)•Pipeline delay (flit cycles)• Parameters implemented using a shift register

• Configuration byte stream generated on the PC and sent to the FPGA

5/3/2011 International Symposium on Network-on-Chip 11

Simulating a NoC

1. Map simulated NoC to DART nodes

2. Program the routing tables to implement the simulated topology

3. Record timing of flit transfers

5/3/2011 International Symposium on Network-on-Chip 12

Example Walk-Through0 1 2 3

4 5 6 7

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 13

Example Walk-Through0 1 2 3

4 5 6 7

Global Interconnect Global Timer

RouterTraffic Generator

FlitQueues

5/3/2011 International Symposium on Network-on-Chip 14

Example Walk-Through0 1 2 3

4 5 6 7

0

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 15

Example Walk-Through0 1 2 3

4 5 6 7

0 1

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 16

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 17

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 18

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 19

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 20

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 21

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6 7

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 22

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6 7

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 23

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6 7

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 24

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6 7

Global Interconnect Global Timer

5/3/2011 International Symposium on Network-on-Chip 25

Example Walk-Through0 1 2 3

4 5 6 7

0 1 2 3 4 5 6 7

Global Interconnect Global Timer 0123456

# injected: 1# injected: 1

# received: 1Σlatency: 6

# received: 1Σlatency = 6

5/3/2011 International Symposium on Network-on-Chip 26

DART Router

• Virtualizes the ports replace crossbar with MUX– No large switch allocators

and crossbars– Routes 1 flit per DART cycle– N cycles for N ports

• Input ports selected based on timestamp

Router

Input Port 0

Input port 1

Input port 2

Input port 3

Input port 4

Routing Table Arbiter

Router

Input Port 0

Input port 1

Input port 2

Input port 3

Input port 4

Routing Logic Allocator

Multiplexing in time saves area

5/3/2011 International Symposium on Network-on-Chip 27

DART Summary

• Configurable functional model of an NoC– Easy to modify and reuse– Fast by exploiting fine grained parallelism

• Decouple simulated cycle from FPGA cycles– Trade simulation speed for area and programmability

• Software configurable parameters– Familiar simulation flow and fast turn-around time

5/3/2011 International Symposium on Network-on-Chip 28

Evaluation & Results

OverheadArchitecture Scalability

Implementation & Performance

5/3/2011 International Symposium on Network-on-Chip 29

Methodology• C++ Cycle-accurate architecture simulator

– Explore various DART architectures– Evaluate performance trade-offs

• 9-node implementation on a Virtex-II Pro FPGA

• Baseline: Booksim 2.0– Cycle-based software simulator (C++)

• Metrics– Overhead: DART cycles/simulated cycle (CPS)– Performance: Thousands of simulated cycles per second

5/3/2011 International Symposium on Network-on-Chip 30

Programmability Overhead

• Measure performance overhead of global interconnect and simplified Router model

• Four combinations of two options– Interconnect:– Router:

5/3/2011 International Symposium on Network-on-Chip 31

Programmability Overhead

• Measure performance overhead of global interconnect and simplified Router model

• Four combinations of two options– Interconnect: dedicated vs. global– Router:

dedicated

globalx

5/3/2011 International Symposium on Network-on-Chip 32

Programmability Overhead

• Measure performance overhead of global interconnect and simplified Router model

• Four combinations of two options– Interconnect: dedicated vs. global– Router: 5-port vs. 1-port

5-port

1-port

5/3/2011 International Symposium on Network-on-Chip 33

Programmability Overhead

• Measure performance overhead of global interconnect and simplified Router model

• Four combinations of two options– Interconnect: dedicated vs. global– Router: 5-port vs. 1-port

• Baseline: dedicated+5-port• Benchmarks: 9-node mesh and 64-node mesh

5-port

dedicated

5/3/2011 International Symposium on Network-on-Chip 34

Overhead: 9-node DART

Dedicated links + true 5-ported router

Overhead (2-3x) due to global interconnect

Overhead (2-6x) due to 1-port Router

Simulated 9-node DART

LowerOverhead

Dedicated links + 1-ported router

Global interconnect + 5-ported router

Global interconnect + 1-ported router

Router overhead dominates

5/3/2011 International Symposium on Network-on-Chip 35

Overhead: 64-node DART

Dedicated links + true 5-ported router

Simulated 64-node DART

LowerOverhead

Dedicated links + 1-ported router

Global interconnect + 5-ported router

Global interconnect + 1-ported router

Global interconnect is the bottleneck

Simulated NoC saturates

5/3/2011 International Symposium on Network-on-Chip 36

Scalability

• Compare DART’s performance scaling to Booksim beyond 9 nodes– 64-node DART with 8-partition global interconnect

• Benchmarks: mesh sizes from 9 to 64

• DART performance extrapolated from architecture simulator assuming 50 MHz clock

5/3/2011 International Symposium on Network-on-Chip 37

Scalability: Mesh Benchmarks

Booksim 64-node DART

Faster

DART simulation speed depends on network load onlyHigher speedups over Booksim for large NoCs

5/3/2011 International Symposium on Network-on-Chip 38

An Implementation of DART• 9 Nodes (max. that fit)• 8-partition interconnect• 50 MHz

XUPV2P Development BoardVirtex-II Pro XC2VP30

Component Utilization (LUTs)

Router (x9) 612

TrafficGen (x9) 691

FlitQueue (x36) 305

Interconnect 2,144

Control FSM 152

Total 26,385 (96%)

5/3/2011 International Symposium on Network-on-Chip 39

Real Speed-up vs. Booksim

Booksim DART Speedup

Large NoC simulations can become more interactive

Faster

Slower with more traffic

70x ~ 160x speedup

5/3/2011 International Symposium on Network-on-Chip 40

Future Work

• Virtualize DART nodes using multithreading– Further trade performance for area

• Off-chip traffic generation– Integrate with full-system evaluation framework

• Better coverage of the router design space– Adaptive routing, speculative routing, etc.– Investigate specialized soft processors

5/3/2011 International Symposium on Network-on-Chip 41

Summary

• Software configurable FPGA-based NoC simulator is feasible

– Area overhead vs. existing emulators is negligible

• Over 100x speedup over software NoC simulator (Booksim)

• Hardware and software tools available at http://www.eecg.toronto.edu/DART

5/3/2011 International Symposium on Network-on-Chip 42

Q & A

Thank you!

5/3/2011 International Symposium on Network-on-Chip 43

Backup Slides

• Classic Router Microarchitecture• Global Interconnect• DART Software Flow• Correctness Analysis• Interconnect Performance vs. Resource Utilization• DART vs. Booksim Speedup

5/3/2011 International Symposium on Network-on-Chip 44

Classic Router Microarchitecture

Back

5/3/2011 International Symposium on Network-on-Chip 45

Global Interconnect

Back

5/3/2011 International Symposium on Network-on-Chip 46

DART Software

• DARTgen– Placement of simulated nodes in DART partitions– Evenly distribute nodes across partitions to balance load– Generate configuration bytes

• DARTportal– Communicates with the DART simulator on FPGA through serial port– Interactive

FPGA

UART ControlFSM

DART Simulator

Back

5/3/2011 International Symposium on Network-on-Chip 47

Correctness (1/2)

• booksim: 5-cycle routing delay• booksim2: 4-cycle routing delay + 1-cycle

switch allocation delay

Topology 3 x 3 mesh

Router architecture Input queued

Routing algorithm XY

# of VCs per port 2

VC Allocation Round-robin

Traffic pattern Random permutation

Packet size 2 flits

Back

5/3/2011 International Symposium on Network-on-Chip 48

Correctness (2/2)

0-hop packets 1 hop 2 hops 3 hops 4 hops

Booksim has longer tail

Back

5/3/2011 International Symposium on Network-on-Chip 49

Interconnect Scalability (1/2)

Flit injection rate = 0.1 Flit injection rate = 0.5Back

5/3/2011 International Symposium on Network-on-Chip 50

Interconnect Scalability (2/2)

Back

5/3/2011 International Symposium on Network-on-Chip 51

DART vs. Booksim Speedup

Better speedup for larger NoCs Back

5/3/2011 International Symposium on Network-on-Chip 52

Related Work (1/2)

• FPGA-based processor simulation– RAMPGold – Tan et al. DAC 2010.– ProtoFlex – Chung et al. IPDPS 2007.– A-Ports – Pellauer et al. FPGA 2008.

• Direct NoC emulation– Genko et al. DATE 2005.– NoCem – Schelle and Grunwald. WARFP 2006.

5/3/2011 International Symposium on Network-on-Chip 53

Related Work (2/2)

• DRNoC: exploit dynamic reconfiguration of Xilinx FPGAs – Krasteva et al. Reconfig. 2008.

• Virtualized simulation – Wolkotte et al. NoCS 2007.

• DARSIM: parallel software NoC simulator – Lis et al. MoBS 2010.

5/3/2011 International Symposium on Network-on-Chip 54

Software Simulators• Modular design (typically in an OO language)• Stand-alone or integrated

• Pros:– Easy to implement new models– Fast to develop and debug– As detailed and accurate as desired

• Cons:– Simulating large NoCs in detail can be slow

• <10 KIPS to 100s KIPS– Parallelizing using threads is non-trivial

• High synchronization overhead

@100KIPS:1s of execution @ 1GHz= 10K sec = 2.8 hrs

5/3/2011 International Symposium on Network-on-Chip 55

FPGA-based Models• FPGAs have become big enough• Map entire NoC to FPGA

• Pros:– Faster than software simulation (10s to 100s MIPS)

• Lots of parallelism• Low-overhead synchronization

• Cons:– Emulators can’t be reused to evaluate different NoCs– Redesign is difficult and time-consuming– Max simulatable NoC size limited by FPGA size

5/3/2011 International Symposium on Network-on-Chip 56

DART: Configurable Simulator on FPGA

• Emulators can’t be reused to evaluate different NoCs– A generic NoC simulation model that is decoupled from

the architecture from a specific NoC

• Redesign is difficult and time-consuming– Software configurable, no hardware redesign needed

• Max simulatable NoC size limited by FPGA size– Optimize simulator architecture for area by trading off

some speed

Fixed framework, configurable settings, still fast!

5/3/2011 International Symposium on Network-on-Chip 57

Architecture Evaluation Methods

Requirement Software Simulation

FPGA Prototypes

FPGA-based Emulators DART

Accurate Possible Very Possible Yes

Fast to run < 10 KIPS to 100s KIPS 100s MIPS 10s to 100s

MIPS 10s MIPS

Easy to build Yes No No No

Easy to modify Yes No No Yes

Available early Yes No Yes Yes

KIPS: Thousands of Instructions per SecondMIPS: Millions of Instructions per Second

5/3/2011 International Symposium on Network-on-Chip 58

DART Simulator Model (cont’d)• Descriptors without data payload

– Flits: 36 bits– Credits: 12 bits

• 10-bit timestamp– Correctly captures latency up to 1024 cycles

• Scale up to 256 nodes, 8 ports/node, 4 VCs/port

5/3/2011 International Symposium on Network-on-Chip 59

NoC Basics

• Topology

• Routing algorithm

• Flow Control

• Router microarchitecture

5/3/2011 International Symposium on Network-on-Chip 60

Motivation• Multi-core is here to stay

• Communication is performance bottleneck

• Network-on-Chip (NoC) advantages– Higher bandwidth– More efficient sharing of on-chip

resources– Easier to build, verify, fabricate

• Need high quality evaluation tools

Intel SCC48 cores & mesh NoC

Cell Processor8 SPEs & ring NoC

5/3/2011 International Symposium on Network-on-Chip 61

The Ideal Simulator

• Accurate

• Fast

• Easy to implement, use and modify

• Available early in the design process

Existing tools don’t offer all four properties