a bandwidth-optimized routing algorithm for hybrid fpga ... · to destination. the approach uses...

A Bandwidth-Optimized Routing Algorithm forHybrid FPGA Networks-on-Chip

Shivukumar B. Patil, Tianqi Liu, and Russell TessierDepartment of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA 01003

Abstract—In this paper, a heuristic routing algorithm that istuned for routing traffic in hybrid FPGA NoCs is presented. Thismulti-iteration routing algorithm requires a limited amount ofhardware for prescheduled stream-based routing while allowingfor bandwidth-optimized usage of NoC routing resources. By ef-ficiently scheduling data streams, the remaining NoC bandwidthcan be used for bursty, packet-switched traffic. We demonstrateour approach using a hybrid NoC and show an average 11%data stream bandwidth improvement for a collection of fivebenchmark traffic patterns.

I. INTRODUCTION

The need for both hard and soft FPGA NoCs has been well-documented and a number of NoC architectures have beendeveloped. Packet-switched (PS) NoCs (similar to those usedon multi-core microprocessors) are well-suited to bursty trafficwith unpredictable destinations. Time division multiplexed(TDM) NoCs route stream-based traffic using a schedule de-rived at compile time. In general, TDM NoCs exhibit reducedlatency and energy consumption versus packet-switched NoCssince router buffering can be largely eliminated. This savingsdoes come at the cost of some per-router storage for theschedule memory. Recently, hybrid FPGA NoCs [1][2] havebeen introduced that support both PS and TDM routing. SinceTDM routing is performed at compile time and PS routing isperformed at run time, care must be taken during TDM routingto provide low latency and high throughput for stream-basedtraffic without blocking additional PS traffic.

The selection of NoC channels and time slots for eachTDM route requires an extensive spatial and temporal searcheven if routes are constrained to shortest paths. The lack ofbuffering for TDM routes requires time slots to be allocatedsequentially. Additionally, the use of routing resources shouldbe balanced across the network to allow for sufficient timeslots for PS routing at run time. Previous work on TDMrouting algorithms for FPGA NoCs [1][2] generally focuseson algorithm speed rather than optimizing the amount ofTDM bandwidth or balancing routing resources by minimizingchannel congestion. Given the amount of compile time neededfor physical design of the FPGA logic, allocating a fewseconds of additional compile time to obtain higher-bandwidthTDM routes can be an effective goal.

Our new TDM routing approach for hybrid NoCs maximizesand balances TDM bandwidth by using a multi-iteration one-step routing algorithm. For each routed TDM flit, the algorithmperforms a breadth-first search that simultaneously considersboth physical routing channels and time slots. A key aspect

of the algorithm is its use of a Pathfinder [3] approach toripup and reroute. Wire and time-slot overuse in a specificiteration is reflected in a non-decreasing cost value that canbe used to gently guide routes away from a resource duringlater iterations. Our TDM routing algorithm is applied to ahybrid FPGA NoC for five widely-used NoC routing patterns.

II. BACKGROUND

FPGA NoCs have several characteristics that differentiatethem from multi-core microprocessor NoCs. Since FPGAcores are expected to be simple, transmitted data values areexpected to arrive in order at the receiver. This issue gener-ally results in shortest-path routing for TDM communicationand communication protocols with predictable paths (e.g.dimension-ordered routing (DOR)) for PS routing [4].

Most previous compile-time TDM approaches first identifythe routing channels used by paths and then select routingtime slots. This two-step approach is easy to implement butcan lead to suboptimal results. For example, Lu and Jantsch[5] use a depth-first search to select routing paths followedby scheduling. In Carle et al. [6], TDM routes are limited toDOR to reduce route search complexity. A previous hybridFPGA router [1] can support simultaneous TDM and PSrouting although all traffic is limited to DOR patterns. MultipleTDM routing approaches have combined channel with timeslot selection. These algorithms greedily identify feasiblesolutions. Kapre et al. [7] attempt to schedule routes usingthe first available time slot. Multi-iteration ripup and reroute ismentioned but not implemented. Shpiner et al. [8] use a single-iteration random-greedy scheduling algorithm to determineroute time slots. Evain and Diguet [9] use a space-time routeallocation algorithm to minimize the TDM schedule.

In contrast, our approach combines path and time slotselection in a single pass. TDM routes are not restricted toDOR; they can take any available shortest path from sourceto destination. The approach uses multiple rip-up and reroutepasses to eliminate the effects of net ordering. Although ourTDM routing algorithm could be applied to any FPGA NoCthat supports TDM, it is optimized for hybrid NoCs withboth TDM and packet-switching. A router from a typicalFPGA hybrid NoC [2] is shown in Fig. 1. The router containsa context memory for TDM schedule information that canconfigure the crossbar to connect a specific source port to eachoutput port (e.g. S, N , ..). Output ports with a − in the contextmemory on a specific cycle instead accept PS data. If a TDMconnection is allocated and no input data is available, PS data

Switch Allocator

.

...

Output Port Destinations

Input Modules Output Modules

VC Allocator

Control Logic

Cycle counter

S N - - -

N S E W C

.

...

.

...

.

.

Context Memory Source Input Ports

Fig. 1: Router for Hybrid FPGA NOC [2]

may be forwarded instead. Each input module contains twosets of FIFOs that serve as PS buffers and a bypass registerfor TDM data. Packet-switched routing is performed usingDOR. Since the TDM schedule length is short (eight timeslots), effective routing heuristics are needed to efficiently useavailable bandwidth.

III. ROUTING ALGORITHM IMPLEMENTATION

Our multi-iteration routing algorithm attempts to maximizeTDM routing up to the allocation requested by the user atcompile time. Unused routing time slots can be used forpacket-switched routing. Our algorithm consists of three partsillustrated in Algorithms 1, 2, and 3. Each FPGA source com-ponent (e.g. soft processor, IP core) communicates with oneor more destination components forming a connection. Eachconnection supports one or more periodic data transmissions(e.g. flits). To provide comparisons to PS routing, four flits areconsidered a packet.

Algorithm 1 shows the top-level loop of data communica-tion scheduling. An attempt is made to schedule a route foreach connection beginning at a starting time slot at the sourcerouter. After paths consisting of (inter-router channel, timeslot) assignments are made for each connection, a check forroute success is performed. A successful schedule includes nointer-router channel, time slot pairs that are shared by multipleconnections. If unsuccessful, paths for connections withoutoverlap are locked and another set of routing attempts for theremaining unrouted connections are performed beginning atthe next starting time slot (e.g. starting time slot+1). The pro-cess terminates when all connections have been successfullyscheduled with no overlap or no schedule can be found for

Data: List of multi-fanout connections (input to output)Result: List of routing paths (S) for every connection

1 /* Try all starting time slots */2 for i=0 to num_slots do3 unrouted_set = all connections; start slot = i4 for j = i, j < num_slots, j++ do5 unrouted_set = perform_set_route(unrouted_set, j)

if no shared (channel, time slot) paths then6 break7 end8 end9 end

Algorithm 1: Traffic path algorithm

Data: Set of unlocked connections, starting slot jResult: List of routing paths (S) with no (channel, slot)

pair overlapsResult: Set of connections with (channel, slot) pair

overlap1 Pathfinder_ts(j, set)2 Lock all connections with paths with no (channel, slot)

pairs that overlap3 set = set - locked connectionsAlgorithm 2: Route a set of unlocked connections per-form_set_route

them. In the latter case, unrouted connections are routed atrun-time using packet switching.

Algorithm 2 illustrates the use of our Pathfinder_ts algo-rithm that considers both spatial (channels) and temporal (timeslots) information. One application of Pathfinder_ts is per-formed per starting time slot. Connections that are successfullyrouted are locked.

Algorithm 3 forms the heart of our route scheduling ap-proach. Each source-destination connection i is built from aseries of channel, time slot pairs along a shortest path. Multi-fanout connections are supported and multiple iterations ofrip-up and reroute are performed. As a route proceeds fromsource to destination, the cost of a new channel, time slot pairis considered. The lowest cost pair is selected at each step tocreate the lowest-cost scheduled path with cost Ci.

cn(ts) = (1+congestionts×pfac)(1+historyts×hfac) (1)

The Pathfinder cost function [3] in (1) determines thecost of using an adjacent node (channel, time slot pair) inAlgorithm 3. Congestion cost (congestionts) of a node isthe number of routing paths that are currently sharing it.History cost (historyts) is a non-decreasing value which isincreased by the amount of congestion cost at the end ofeach iteration. Values pfac and hfac are congestion cost andhistory cost multiplication factors, respectively. They are setto constant values of 1.2 and 0.2, respectively, as determinedvia experimentation.

Data: List of connections (source, destination, andstarting time slot)

Result: List of routing paths (S) for each connection ofchannel, time slot pairs

1 while shared (channel, slot) pairs exist and (iteration <max_iter) or iteration = 0 do

2 for all connections i starting at time slot i.slot do3 Rip up existing connection path Pi4 Pi ← (source, i.slot)5 Initialize expansion_list = new MinHeap()6 while all destinations dij not found do7 Remove lowest cost node m with time slot j

from expansion_list8 while fanouts n of node m at time slot (j+1)

mod max_slot on shortest path exist do9 Add n to expansion_list at cost cn + Ci

10 end11 for all nodes n in path from dij to source do12 Update cn13 Add n to Pi14 end15 end16 end17 end

Algorithm 3: Pathfinder_ts algorithm including time slots

Topology 64-node, 8 × 8 2D meshTechnology 45 nm at 1.1V, 1.0 GHzChannel width 128 bits (16 bytes)Packet size 4 flitsVirtual channels 2/portBuffer size per VC 10 flits in depth

TABLE I: Routing Parameters

IV. EXPERIMENTAL APPROACH

To evaluate the performance of our scheduling algorithmon a hybrid NoC, Booksim 2.0 [10] was used to assess datathroughput, packet latency and NoC dynamic power. Thesimulator was previously modified to allow for hybrid TDMand packet-switched routing and power measurements [2] anda physical router layout was used to obtain NoC physicalparameters [2]. Each router component and the inter-routerwires were assigned dynamic energy consumption valuesdetermined via the post-synthesis simulation described in [2].These values were scaled by flit-level toggle rates determinedduring the Booksim simulations. The network was warmedup with enough packets to reach a stable state prior to resultsrecording. Unless otherwise stated, the NoC parameters shownin Table I were used for experimentation. All experiments usea routing schedule of eight time slots and 500 Pathfinder_tsiterations.

Using our modified version of Booksim, three communi-cation patterns for hybrid routing were considered [10]. Thepatterns include 1) transpose (messages from (x, y) are sent to(y, x)), 2) bit-reverse (messages from the node labeled x aresent to the node whose label is bit-reversed(x), and 3) tornado

Single Iteration Pathfinder_tsPattern BW Min/ Ave σ BW Min/ Ave σ

% Max % MaxBitRev. 53 1/8 4.3 1.8 55 3/8 4.4 1.6Tornado 28 1/5 2.2 0.8 31 2/3 2.5 0.5Transp. 56 2/8 4.5 1.9 56 3/8 4.5 1.72-Side 71 5/7 5.7 0.7 99 7/8 7.9 0.24-Side 79 5/8 6.5 0.9 100 8/8 8.0 0.0

TABLE II: Time slot routing results of single iteration shortestpath (SP) and Pathfinder_ts (PF) algorithms for single-source,single-destination traffic patterns.

(messages from (x, y) are sent to (x+ k2 −1 mod k, y+k2 −1

mod k) where k is the dimension of x and y. Results are alsoreported for 2-side (sources and destinations on parallel edgesof the NoC) and 4-side (sources and destinations along theNoC perimeter) traffic patterns [2] at throughput rates of 10Gbps for each source-destination pair. Combined execution ofAlgorithms 1, 2, and 3 requires on the order of seconds.

V. RESULTS

In this section, the routing results from using ourPathfinder_ts routing algorithm are directly compared againstresults using a single iteration shortest-path (SP) TDM ap-proach [2]. The latter algorithm is similar to the routingalgorithm described in [1]. All routes for TDM follow shortestpaths.

The routing results shown in Table II illustrate the benefitsof multi-iteration path selection and scheduling for TDMrouting versus the previous routing approach. Pathfinder_ts(PF) achieves improved bandwidth (BW%) for all five trafficpatterns. Up to 100% of required source-destination commu-nication (BW%) in the routing network can be accommodatedusing PF with an average 11% bandwidth improvement for PFversus SP. The remaining available bandwidth can be allocatedand used at run-time using longer-latency packet switchedrouting. The table also shows that the time slot usage for TDMin the routers is more balanced if PF is used. The minimum,maximum, and average time slot usage (out of 8 availableslots) across the routers is noted. By balancing slot usage forTDM at compile-time, unused slots can be more easily usedfor packet-switched routing at run-time.

Figure 2 shows average packet latency for transpose, tor-nado, and bit-reverse traffic patterns under increasing trafficinjection rates. TDM-only routing using PF and SP are usedto route connections at compile-time. The packet latency ofrouting using these approaches is compared against routingall traffic using packet switching (PS) with DOR. For PS-only traffic, latency increases sharply at the injection rates of0.04 to 0.06, due to limitations in dimension order routing.In TDM-only routing, the PF latency is better than the SPlatency for all traffic patterns. Since more TDM routing slotsare available with PF, data is assigned to a free slot moreeasily at the source, reducing source-to-destination latency. Fortornado and low-injection-rate transpose and bit-reverse, PS-only achieves lower latency in some cases since PS traffic can

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11Injection rate (packets/node/cycle)

20

40

60

80

100

120la

tenc

y (c

ycle

s)

TDM-PF-transposeTDM-SP-transposePS-transpose

TDM-PF-tornadoTDM-SP-tornadoPS-tornado

TDM-PF-bitrevTDM-SP-bitrevPS-bitrev

Fig. 2: Average packet latency for transpose, tornado and bit-reverse traffic patterns using only compile-time TDM or run-time PS routing using DOR.

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050TDM Injection rate (packets/node/cycle)

24

26

28

30

32

34

36

38

40

42

44

Late

ncy

(cyc

les)

SP-TDM-0.05SP-PS-0.05PF-TDM-0.05PF-PS-0.05

Fig. 3: Average NoC latency for varying amounts of TDMversus PS traffic for transpose under a moderate, fixed totalpacket injection rate of 0.05 packets/node/cycle. Note thatTDM latency increases as TDM injection rate increases dueto more contention for time slots.

be immediately transmitted by a source rather than having towait for a pre-scheduled TDM time slot.

Figure 3 shows results for experiments when the totalpacket injection rate at each source is constant at 0.05packets/node/cycle. The fraction of transmitted data that isscheduled via TDM versus the amount of PS traffic is variedfrom all PS (left) to all TDM (right). Initially, as the amountof PS traffic is high, all packets have high latency. As theTDM injection rate is increased, the PS traffic and associatedlatency are reduced. Latencies for TDM traffic routed withthe PF algorithm are consistently less than corresponding SPresults.

Figure 4 shows the total power consumption of the NoC for

6 8 10 12 14 16 18 20 22 24 26Throughput (Gbits/s)

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

Pow

er (

mW

)

TDM-PF-transposeTDM-SP-transposePS-transposeTDM-PF-tornadoTDM-SP-tornadoPS-tornadoTDM-PF-bitrevTDM-SP-bitrevPS-bitrev

Fig. 4: Total power consumption for transpose, tornado andbit-reverse traffic using PS-only or TDM-only routing.

SP-only, PF-only, and PS-only routing for three traffic patterns.For SP- and PF-only routing, the routes have similar powernumbers at low injection rate. At higher injection rates, thepower consumption of PF routing is slightly greater than thatof SP due to a larger number of time slots available for routing.

VI. CONCLUSION

In this paper, a heuristic routing algorithm that is tuned forrouting traffic in hybrid FPGA NoCs is presented. This multi-iteration router balances routed traffic across routing resourcesto reduce congestion. Multiple rip-up and reroute iterations areused to enhance low-latency, time-scheduled communication.Substantial data stream bandwidth and latency improvementfor a collection of five benchmarks is shown using our newrouting approach.1

REFERENCES[1] N. Kapre, “Marathon: Statically-Scheduled Conflict-Free Routing on

FPGA Overlay NoCs,” in FCCM, May 2016, pp. 156–163.[2] T. Liu, N. K. Dumpala, and R. Tessier, “Hybrid hard NoCs for efficient

FPGA communication,” in FPT, Dec. 2016, pp. 157–164.[3] L. McMurchie and C. Ebeling, “PathFinder: A Negotiation-Based

Performance-Driven Router for FPGAs,” in FPGA, Feb. 1995, pp. 111–117.

[4] M. S. Abdelfattah, A. Bitar, and V. Betz, “Take the highway: Designfor embedded NoCs on FPGAs,” in FPGA, Feb. 2015, pp. 98–107.

[5] Z. Lu and A. Jantsch, “TDM virtual-circuit configuration for network-on-chip,” IEEE TVLSI, vol. 16, no. 8, pp. 1021–1034, Aug. 2008.

[6] T. Carle et al., “Static mapping of real-time applications onto massivelyparallel processor arrays,” in Int’l Conf. on Application of Concurrencyto Sys. Design, Jun. 2014, pp. 112–121.

[7] N. Kapre et al., “Packet switched vs. time multiplexed FPGA overlaynetworks,” in FCCM, May 2006, pp. 1–10.

[8] A. Shpiner et al., “On the capacity of bufferless networks-on-chip,” IEEETPDS, vol. 26, no. 2, pp. 492–506, Mar. 2014.

[9] S. Evian and J.-P. Diguet, “Efficient space-time NoC path allocationbased on mutual exclusion and pre-reservation,” in ACM GLSVLSI, Mar.2007, pp. 457–460.

[10] N. Jiang et al., “A detailed and flexible cycle-accurate network-on-chipsimulator,” in ISPASS, Apr. 2013, pp. 86–96.

1This research was funded by a grant from Xilinx Corporation.

a bandwidth-optimized routing algorithm for hybrid fpga ... · to destination. the approach uses...

Documents