5/3/2011 international symposium on network-on-chip 1 dart: a programmable architecture for noc...
TRANSCRIPT
5/3/2011 International Symposium on Network-on-Chip 1
DART: A Programmable Architecture for NoC Simulation on FPGAs
Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan*
*Department of Electrical & Computer EngineeringUniversity of Toronto
†Google Inc.
5/3/2011 International Symposium on Network-on-Chip 2
Why yet another NoC simulator?• Software simulators
– Stand-alone or integrated– Parallel NoC simulator (DARSIM)
• FPGA-based Models– Direct map NoC emulators (Genko et al., NoCem)– Dynamic reconfiguration (DRNoC)– Decoupled timing and functional model (RAMPGold,
ProtoFlex, A-Ports)
• Analytical models: FIST
5/3/2011 International Symposium on Network-on-Chip 3
Why yet another NoC simulator?
Requirement Software Simulation
Accurate Possible
Fast to run < 10 KIPS to 100s KIPS
Easy to implement Yes
Easy to use & modify Yes
Available early Yes
@100KIPS:1s of execution @ 1GHz= 10K sec = 2.8 hrs
Benefits of thread-based parallelization is limited due to high synchronization overhead
5/3/2011 International Symposium on Network-on-Chip 4
Why yet another NoC simulator?
Requirement Software Simulation
FPGA-based Emulators
Accurate Possible Possible
Fast to run < 10 KIPS to 100s KIPS
10s to 100s MIPS
Easy to implement Yes No
Easy to use & modify Yes No
Available early Yes Yes
Hardware changes Hours of synthesis-place-route time
Orders of magnitude faster!
5/3/2011 International Symposium on Network-on-Chip 5
FPGA
DART: Hybrid Approach
• Generic NoC simulation engine• Fixed function nodes for basic NoC building blocks
– Router, traffic generator, link• Software configurable parameters in each node
PC UART ControlFSM
DART Simulatorconfiguration,commands
Simulationresults
Simulate different NoCs without changing hardware
5/3/2011 International Symposium on Network-on-Chip 6
Why yet another NoC simulator?
Requirement Software Simulation
FPGA-based Emulators DART
Accurate Possible Possible Yes
Fast to run < 10 KIPS to 100s KIPS
10s to 100s MIPS 10s MIPS
Easy to implement Yes No No
Easy to use & modify Yes No Yes
Available early Yes Yes Yes
5/3/2011 International Symposium on Network-on-Chip 8
Traffic Generator
Flit Queue
Router
Generic NoC Model
Global interconnect• Topology
• Routing algorithm• Flow control• Router microarchitecture
• Simulated traffic
• Link properties
5/3/2011 International Symposium on Network-on-Chip 9
DART Architecture
Global Timer
Synchronize all network transfers to a global time counter
5/3/2011 International Symposium on Network-on-Chip 10
DART NodesNode Parameters Statistics Counter
TrafficGenerator
•Traffic pattern•Injection intervals•Packet size (# of flits)
•# of injected packets•# of received packets•Cumulative packet
latency
Flit Queue•Latency (flit cycles)•Bandwidth (flits / cycle)
More can be added easilyRouters
•Routing Table•Input buffer sizes
(credits)•Pipeline delay (flit cycles)• Parameters implemented using a shift register
• Configuration byte stream generated on the PC and sent to the FPGA
5/3/2011 International Symposium on Network-on-Chip 11
Simulating a NoC
1. Map simulated NoC to DART nodes
2. Program the routing tables to implement the simulated topology
3. Record timing of flit transfers
5/3/2011 International Symposium on Network-on-Chip 12
Example Walk-Through0 1 2 3
4 5 6 7
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 13
Example Walk-Through0 1 2 3
4 5 6 7
Global Interconnect Global Timer
RouterTraffic Generator
FlitQueues
5/3/2011 International Symposium on Network-on-Chip 14
Example Walk-Through0 1 2 3
4 5 6 7
0
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 15
Example Walk-Through0 1 2 3
4 5 6 7
0 1
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 16
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 17
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 18
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 19
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 20
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 21
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6 7
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 22
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6 7
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 23
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6 7
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 24
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6 7
Global Interconnect Global Timer
5/3/2011 International Symposium on Network-on-Chip 25
Example Walk-Through0 1 2 3
4 5 6 7
0 1 2 3 4 5 6 7
Global Interconnect Global Timer 0123456
# injected: 1# injected: 1
# received: 1Σlatency: 6
# received: 1Σlatency = 6
5/3/2011 International Symposium on Network-on-Chip 26
DART Router
• Virtualizes the ports replace crossbar with MUX– No large switch allocators
and crossbars– Routes 1 flit per DART cycle– N cycles for N ports
• Input ports selected based on timestamp
Router
Input Port 0
Input port 1
Input port 2
Input port 3
Input port 4
Routing Table Arbiter
Router
Input Port 0
Input port 1
Input port 2
Input port 3
Input port 4
Routing Logic Allocator
Multiplexing in time saves area
5/3/2011 International Symposium on Network-on-Chip 27
DART Summary
• Configurable functional model of an NoC– Easy to modify and reuse– Fast by exploiting fine grained parallelism
• Decouple simulated cycle from FPGA cycles– Trade simulation speed for area and programmability
• Software configurable parameters– Familiar simulation flow and fast turn-around time
5/3/2011 International Symposium on Network-on-Chip 28
Evaluation & Results
OverheadArchitecture Scalability
Implementation & Performance
5/3/2011 International Symposium on Network-on-Chip 29
Methodology• C++ Cycle-accurate architecture simulator
– Explore various DART architectures– Evaluate performance trade-offs
• 9-node implementation on a Virtex-II Pro FPGA
• Baseline: Booksim 2.0– Cycle-based software simulator (C++)
• Metrics– Overhead: DART cycles/simulated cycle (CPS)– Performance: Thousands of simulated cycles per second
5/3/2011 International Symposium on Network-on-Chip 30
Programmability Overhead
• Measure performance overhead of global interconnect and simplified Router model
• Four combinations of two options– Interconnect:– Router:
5/3/2011 International Symposium on Network-on-Chip 31
Programmability Overhead
• Measure performance overhead of global interconnect and simplified Router model
• Four combinations of two options– Interconnect: dedicated vs. global– Router:
dedicated
globalx
5/3/2011 International Symposium on Network-on-Chip 32
Programmability Overhead
• Measure performance overhead of global interconnect and simplified Router model
• Four combinations of two options– Interconnect: dedicated vs. global– Router: 5-port vs. 1-port
5-port
1-port
5/3/2011 International Symposium on Network-on-Chip 33
Programmability Overhead
• Measure performance overhead of global interconnect and simplified Router model
• Four combinations of two options– Interconnect: dedicated vs. global– Router: 5-port vs. 1-port
• Baseline: dedicated+5-port• Benchmarks: 9-node mesh and 64-node mesh
5-port
dedicated
5/3/2011 International Symposium on Network-on-Chip 34
Overhead: 9-node DART
Dedicated links + true 5-ported router
Overhead (2-3x) due to global interconnect
Overhead (2-6x) due to 1-port Router
Simulated 9-node DART
LowerOverhead
Dedicated links + 1-ported router
Global interconnect + 5-ported router
Global interconnect + 1-ported router
Router overhead dominates
5/3/2011 International Symposium on Network-on-Chip 35
Overhead: 64-node DART
Dedicated links + true 5-ported router
Simulated 64-node DART
LowerOverhead
Dedicated links + 1-ported router
Global interconnect + 5-ported router
Global interconnect + 1-ported router
Global interconnect is the bottleneck
Simulated NoC saturates
5/3/2011 International Symposium on Network-on-Chip 36
Scalability
• Compare DART’s performance scaling to Booksim beyond 9 nodes– 64-node DART with 8-partition global interconnect
• Benchmarks: mesh sizes from 9 to 64
• DART performance extrapolated from architecture simulator assuming 50 MHz clock
5/3/2011 International Symposium on Network-on-Chip 37
Scalability: Mesh Benchmarks
Booksim 64-node DART
Faster
DART simulation speed depends on network load onlyHigher speedups over Booksim for large NoCs
5/3/2011 International Symposium on Network-on-Chip 38
An Implementation of DART• 9 Nodes (max. that fit)• 8-partition interconnect• 50 MHz
XUPV2P Development BoardVirtex-II Pro XC2VP30
Component Utilization (LUTs)
Router (x9) 612
TrafficGen (x9) 691
FlitQueue (x36) 305
Interconnect 2,144
Control FSM 152
Total 26,385 (96%)
5/3/2011 International Symposium on Network-on-Chip 39
Real Speed-up vs. Booksim
Booksim DART Speedup
Large NoC simulations can become more interactive
Faster
Slower with more traffic
70x ~ 160x speedup
5/3/2011 International Symposium on Network-on-Chip 40
Future Work
• Virtualize DART nodes using multithreading– Further trade performance for area
• Off-chip traffic generation– Integrate with full-system evaluation framework
• Better coverage of the router design space– Adaptive routing, speculative routing, etc.– Investigate specialized soft processors
5/3/2011 International Symposium on Network-on-Chip 41
Summary
• Software configurable FPGA-based NoC simulator is feasible
– Area overhead vs. existing emulators is negligible
• Over 100x speedup over software NoC simulator (Booksim)
• Hardware and software tools available at http://www.eecg.toronto.edu/DART
5/3/2011 International Symposium on Network-on-Chip 43
Backup Slides
• Classic Router Microarchitecture• Global Interconnect• DART Software Flow• Correctness Analysis• Interconnect Performance vs. Resource Utilization• DART vs. Booksim Speedup
5/3/2011 International Symposium on Network-on-Chip 46
DART Software
• DARTgen– Placement of simulated nodes in DART partitions– Evenly distribute nodes across partitions to balance load– Generate configuration bytes
• DARTportal– Communicates with the DART simulator on FPGA through serial port– Interactive
FPGA
UART ControlFSM
DART Simulator
Back
5/3/2011 International Symposium on Network-on-Chip 47
Correctness (1/2)
• booksim: 5-cycle routing delay• booksim2: 4-cycle routing delay + 1-cycle
switch allocation delay
Topology 3 x 3 mesh
Router architecture Input queued
Routing algorithm XY
# of VCs per port 2
VC Allocation Round-robin
Traffic pattern Random permutation
Packet size 2 flits
Back
5/3/2011 International Symposium on Network-on-Chip 48
Correctness (2/2)
0-hop packets 1 hop 2 hops 3 hops 4 hops
Booksim has longer tail
Back
5/3/2011 International Symposium on Network-on-Chip 49
Interconnect Scalability (1/2)
Flit injection rate = 0.1 Flit injection rate = 0.5Back
5/3/2011 International Symposium on Network-on-Chip 51
DART vs. Booksim Speedup
Better speedup for larger NoCs Back
5/3/2011 International Symposium on Network-on-Chip 52
Related Work (1/2)
• FPGA-based processor simulation– RAMPGold – Tan et al. DAC 2010.– ProtoFlex – Chung et al. IPDPS 2007.– A-Ports – Pellauer et al. FPGA 2008.
• Direct NoC emulation– Genko et al. DATE 2005.– NoCem – Schelle and Grunwald. WARFP 2006.
5/3/2011 International Symposium on Network-on-Chip 53
Related Work (2/2)
• DRNoC: exploit dynamic reconfiguration of Xilinx FPGAs – Krasteva et al. Reconfig. 2008.
• Virtualized simulation – Wolkotte et al. NoCS 2007.
• DARSIM: parallel software NoC simulator – Lis et al. MoBS 2010.
5/3/2011 International Symposium on Network-on-Chip 54
Software Simulators• Modular design (typically in an OO language)• Stand-alone or integrated
• Pros:– Easy to implement new models– Fast to develop and debug– As detailed and accurate as desired
• Cons:– Simulating large NoCs in detail can be slow
• <10 KIPS to 100s KIPS– Parallelizing using threads is non-trivial
• High synchronization overhead
@100KIPS:1s of execution @ 1GHz= 10K sec = 2.8 hrs
5/3/2011 International Symposium on Network-on-Chip 55
FPGA-based Models• FPGAs have become big enough• Map entire NoC to FPGA
• Pros:– Faster than software simulation (10s to 100s MIPS)
• Lots of parallelism• Low-overhead synchronization
• Cons:– Emulators can’t be reused to evaluate different NoCs– Redesign is difficult and time-consuming– Max simulatable NoC size limited by FPGA size
5/3/2011 International Symposium on Network-on-Chip 56
DART: Configurable Simulator on FPGA
• Emulators can’t be reused to evaluate different NoCs– A generic NoC simulation model that is decoupled from
the architecture from a specific NoC
• Redesign is difficult and time-consuming– Software configurable, no hardware redesign needed
• Max simulatable NoC size limited by FPGA size– Optimize simulator architecture for area by trading off
some speed
Fixed framework, configurable settings, still fast!
5/3/2011 International Symposium on Network-on-Chip 57
Architecture Evaluation Methods
Requirement Software Simulation
FPGA Prototypes
FPGA-based Emulators DART
Accurate Possible Very Possible Yes
Fast to run < 10 KIPS to 100s KIPS 100s MIPS 10s to 100s
MIPS 10s MIPS
Easy to build Yes No No No
Easy to modify Yes No No Yes
Available early Yes No Yes Yes
KIPS: Thousands of Instructions per SecondMIPS: Millions of Instructions per Second
5/3/2011 International Symposium on Network-on-Chip 58
DART Simulator Model (cont’d)• Descriptors without data payload
– Flits: 36 bits– Credits: 12 bits
• 10-bit timestamp– Correctly captures latency up to 1024 cycles
• Scale up to 256 nodes, 8 ports/node, 4 VCs/port
5/3/2011 International Symposium on Network-on-Chip 59
NoC Basics
• Topology
• Routing algorithm
• Flow Control
• Router microarchitecture
5/3/2011 International Symposium on Network-on-Chip 60
Motivation• Multi-core is here to stay
• Communication is performance bottleneck
• Network-on-Chip (NoC) advantages– Higher bandwidth– More efficient sharing of on-chip
resources– Easier to build, verify, fabricate
• Need high quality evaluation tools
Intel SCC48 cores & mesh NoC
Cell Processor8 SPEs & ring NoC