a bandwidth-optimized routing algorithm for hybrid fpga ... · to destination. the approach uses...

4
A Bandwidth-Optimized Routing Algorithm for Hybrid FPGA Networks-on-Chip Shivukumar B. Patil, Tianqi Liu, and Russell Tessier Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA 01003 Abstract—In this paper, a heuristic routing algorithm that is tuned for routing traffic in hybrid FPGA NoCs is presented. This multi-iteration routing algorithm requires a limited amount of hardware for prescheduled stream-based routing while allowing for bandwidth-optimized usage of NoC routing resources. By ef- ficiently scheduling data streams, the remaining NoC bandwidth can be used for bursty, packet-switched traffic. We demonstrate our approach using a hybrid NoC and show an average 11% data stream bandwidth improvement for a collection of five benchmark traffic patterns. I. I NTRODUCTION The need for both hard and soft FPGA NoCs has been well- documented and a number of NoC architectures have been developed. Packet-switched (PS) NoCs (similar to those used on multi-core microprocessors) are well-suited to bursty traffic with unpredictable destinations. Time division multiplexed (TDM) NoCs route stream-based traffic using a schedule de- rived at compile time. In general, TDM NoCs exhibit reduced latency and energy consumption versus packet-switched NoCs since router buffering can be largely eliminated. This savings does come at the cost of some per-router storage for the schedule memory. Recently, hybrid FPGA NoCs [1][2] have been introduced that support both PS and TDM routing. Since TDM routing is performed at compile time and PS routing is performed at run time, care must be taken during TDM routing to provide low latency and high throughput for stream-based traffic without blocking additional PS traffic. The selection of NoC channels and time slots for each TDM route requires an extensive spatial and temporal search even if routes are constrained to shortest paths. The lack of buffering for TDM routes requires time slots to be allocated sequentially. Additionally, the use of routing resources should be balanced across the network to allow for sufficient time slots for PS routing at run time. Previous work on TDM routing algorithms for FPGA NoCs [1][2] generally focuses on algorithm speed rather than optimizing the amount of TDM bandwidth or balancing routing resources by minimizing channel congestion. Given the amount of compile time needed for physical design of the FPGA logic, allocating a few seconds of additional compile time to obtain higher-bandwidth TDM routes can be an effective goal. Our new TDM routing approach for hybrid NoCs maximizes and balances TDM bandwidth by using a multi-iteration one- step routing algorithm. For each routed TDM flit, the algorithm performs a breadth-first search that simultaneously considers both physical routing channels and time slots. A key aspect of the algorithm is its use of a Pathfinder [3] approach to ripup and reroute. Wire and time-slot overuse in a specific iteration is reflected in a non-decreasing cost value that can be used to gently guide routes away from a resource during later iterations. Our TDM routing algorithm is applied to a hybrid FPGA NoC for five widely-used NoC routing patterns. II. BACKGROUND FPGA NoCs have several characteristics that differentiate them from multi-core microprocessor NoCs. Since FPGA cores are expected to be simple, transmitted data values are expected to arrive in order at the receiver. This issue gener- ally results in shortest-path routing for TDM communication and communication protocols with predictable paths (e.g. dimension-ordered routing (DOR)) for PS routing [4]. Most previous compile-time TDM approaches first identify the routing channels used by paths and then select routing time slots. This two-step approach is easy to implement but can lead to suboptimal results. For example, Lu and Jantsch [5] use a depth-first search to select routing paths followed by scheduling. In Carle et al. [6], TDM routes are limited to DOR to reduce route search complexity. A previous hybrid FPGA router [1] can support simultaneous TDM and PS routing although all traffic is limited to DOR patterns. Multiple TDM routing approaches have combined channel with time slot selection. These algorithms greedily identify feasible solutions. Kapre et al. [7] attempt to schedule routes using the first available time slot. Multi-iteration ripup and reroute is mentioned but not implemented. Shpiner et al. [8] use a single- iteration random-greedy scheduling algorithm to determine route time slots. Evain and Diguet [9] use a space-time route allocation algorithm to minimize the TDM schedule. In contrast, our approach combines path and time slot selection in a single pass. TDM routes are not restricted to DOR; they can take any available shortest path from source to destination. The approach uses multiple rip-up and reroute passes to eliminate the effects of net ordering. Although our TDM routing algorithm could be applied to any FPGA NoC that supports TDM, it is optimized for hybrid NoCs with both TDM and packet-switching. A router from a typical FPGA hybrid NoC [2] is shown in Fig. 1. The router contains a context memory for TDM schedule information that can configure the crossbar to connect a specific source port to each output port (e.g. S, N , ..). Output ports with a - in the context memory on a specific cycle instead accept PS data. If a TDM connection is allocated and no input data is available, PS data

Upload: others

Post on 21-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • A Bandwidth-Optimized Routing Algorithm forHybrid FPGA Networks-on-Chip

    Shivukumar B. Patil, Tianqi Liu, and Russell TessierDepartment of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA 01003

    Abstract—In this paper, a heuristic routing algorithm that istuned for routing traffic in hybrid FPGA NoCs is presented. Thismulti-iteration routing algorithm requires a limited amount ofhardware for prescheduled stream-based routing while allowingfor bandwidth-optimized usage of NoC routing resources. By ef-ficiently scheduling data streams, the remaining NoC bandwidthcan be used for bursty, packet-switched traffic. We demonstrateour approach using a hybrid NoC and show an average 11%data stream bandwidth improvement for a collection of fivebenchmark traffic patterns.

    I. INTRODUCTION

    The need for both hard and soft FPGA NoCs has been well-documented and a number of NoC architectures have beendeveloped. Packet-switched (PS) NoCs (similar to those usedon multi-core microprocessors) are well-suited to bursty trafficwith unpredictable destinations. Time division multiplexed(TDM) NoCs route stream-based traffic using a schedule de-rived at compile time. In general, TDM NoCs exhibit reducedlatency and energy consumption versus packet-switched NoCssince router buffering can be largely eliminated. This savingsdoes come at the cost of some per-router storage for theschedule memory. Recently, hybrid FPGA NoCs [1][2] havebeen introduced that support both PS and TDM routing. SinceTDM routing is performed at compile time and PS routing isperformed at run time, care must be taken during TDM routingto provide low latency and high throughput for stream-basedtraffic without blocking additional PS traffic.

    The selection of NoC channels and time slots for eachTDM route requires an extensive spatial and temporal searcheven if routes are constrained to shortest paths. The lack ofbuffering for TDM routes requires time slots to be allocatedsequentially. Additionally, the use of routing resources shouldbe balanced across the network to allow for sufficient timeslots for PS routing at run time. Previous work on TDMrouting algorithms for FPGA NoCs [1][2] generally focuseson algorithm speed rather than optimizing the amount ofTDM bandwidth or balancing routing resources by minimizingchannel congestion. Given the amount of compile time neededfor physical design of the FPGA logic, allocating a fewseconds of additional compile time to obtain higher-bandwidthTDM routes can be an effective goal.

    Our new TDM routing approach for hybrid NoCs maximizesand balances TDM bandwidth by using a multi-iteration one-step routing algorithm. For each routed TDM flit, the algorithmperforms a breadth-first search that simultaneously considersboth physical routing channels and time slots. A key aspect

    of the algorithm is its use of a Pathfinder [3] approach toripup and reroute. Wire and time-slot overuse in a specificiteration is reflected in a non-decreasing cost value that canbe used to gently guide routes away from a resource duringlater iterations. Our TDM routing algorithm is applied to ahybrid FPGA NoC for five widely-used NoC routing patterns.

    II. BACKGROUND

    FPGA NoCs have several characteristics that differentiatethem from multi-core microprocessor NoCs. Since FPGAcores are expected to be simple, transmitted data values areexpected to arrive in order at the receiver. This issue gener-ally results in shortest-path routing for TDM communicationand communication protocols with predictable paths (e.g.dimension-ordered routing (DOR)) for PS routing [4].

    Most previous compile-time TDM approaches first identifythe routing channels used by paths and then select routingtime slots. This two-step approach is easy to implement butcan lead to suboptimal results. For example, Lu and Jantsch[5] use a depth-first search to select routing paths followedby scheduling. In Carle et al. [6], TDM routes are limited toDOR to reduce route search complexity. A previous hybridFPGA router [1] can support simultaneous TDM and PSrouting although all traffic is limited to DOR patterns. MultipleTDM routing approaches have combined channel with timeslot selection. These algorithms greedily identify feasiblesolutions. Kapre et al. [7] attempt to schedule routes usingthe first available time slot. Multi-iteration ripup and reroute ismentioned but not implemented. Shpiner et al. [8] use a single-iteration random-greedy scheduling algorithm to determineroute time slots. Evain and Diguet [9] use a space-time routeallocation algorithm to minimize the TDM schedule.

    In contrast, our approach combines path and time slotselection in a single pass. TDM routes are not restricted toDOR; they can take any available shortest path from sourceto destination. The approach uses multiple rip-up and reroutepasses to eliminate the effects of net ordering. Although ourTDM routing algorithm could be applied to any FPGA NoCthat supports TDM, it is optimized for hybrid NoCs withboth TDM and packet-switching. A router from a typicalFPGA hybrid NoC [2] is shown in Fig. 1. The router containsa context memory for TDM schedule information that canconfigure the crossbar to connect a specific source port to eachoutput port (e.g. S, N , ..). Output ports with a − in the contextmemory on a specific cycle instead accept PS data. If a TDMconnection is allocated and no input data is available, PS data

  • Switch Allocator

    .

    ...

    Output Port Destinations

    Input Modules Output Modules

    VC Allocator

    Control Logic

    Cycle counter

    S N - - -

    N S E W C

    .

    ...

    .

    ...

    .

    .

    Context Memory Source Input Ports

    Fig. 1: Router for Hybrid FPGA NOC [2]

    may be forwarded instead. Each input module contains twosets of FIFOs that serve as PS buffers and a bypass registerfor TDM data. Packet-switched routing is performed usingDOR. Since the TDM schedule length is short (eight timeslots), effective routing heuristics are needed to efficiently useavailable bandwidth.

    III. ROUTING ALGORITHM IMPLEMENTATION

    Our multi-iteration routing algorithm attempts to maximizeTDM routing up to the allocation requested by the user atcompile time. Unused routing time slots can be used forpacket-switched routing. Our algorithm consists of three partsillustrated in Algorithms 1, 2, and 3. Each FPGA source com-ponent (e.g. soft processor, IP core) communicates with oneor more destination components forming a connection. Eachconnection supports one or more periodic data transmissions(e.g. flits). To provide comparisons to PS routing, four flits areconsidered a packet.

    Algorithm 1 shows the top-level loop of data communica-tion scheduling. An attempt is made to schedule a route foreach connection beginning at a starting time slot at the sourcerouter. After paths consisting of (inter-router channel, timeslot) assignments are made for each connection, a check forroute success is performed. A successful schedule includes nointer-router channel, time slot pairs that are shared by multipleconnections. If unsuccessful, paths for connections withoutoverlap are locked and another set of routing attempts for theremaining unrouted connections are performed beginning atthe next starting time slot (e.g. starting time slot+1). The pro-cess terminates when all connections have been successfullyscheduled with no overlap or no schedule can be found for

    Data: List of multi-fanout connections (input to output)Result: List of routing paths (S) for every connection

    1 /* Try all starting time slots */2 for i=0 to num_slots do3 unrouted_set = all connections; start slot = i4 for j = i, j < num_slots, j++ do5 unrouted_set = perform_set_route(unrouted_set, j)

    if no shared (channel, time slot) paths then6 break7 end8 end9 end

    Algorithm 1: Traffic path algorithm

    Data: Set of unlocked connections, starting slot jResult: List of routing paths (S) with no (channel, slot)

    pair overlapsResult: Set of connections with (channel, slot) pair

    overlap1 Pathfinder_ts(j, set)2 Lock all connections with paths with no (channel, slot)

    pairs that overlap3 set = set - locked connectionsAlgorithm 2: Route a set of unlocked connections per-form_set_route

    them. In the latter case, unrouted connections are routed atrun-time using packet switching.

    Algorithm 2 illustrates the use of our Pathfinder_ts algo-rithm that considers both spatial (channels) and temporal (timeslots) information. One application of Pathfinder_ts is per-formed per starting time slot. Connections that are successfullyrouted are locked.

    Algorithm 3 forms the heart of our route scheduling ap-proach. Each source-destination connection i is built from aseries of channel, time slot pairs along a shortest path. Multi-fanout connections are supported and multiple iterations ofrip-up and reroute are performed. As a route proceeds fromsource to destination, the cost of a new channel, time slot pairis considered. The lowest cost pair is selected at each step tocreate the lowest-cost scheduled path with cost Ci.

    cn(ts) = (1+congestionts×pfac)(1+historyts×hfac) (1)

    The Pathfinder cost function [3] in (1) determines thecost of using an adjacent node (channel, time slot pair) inAlgorithm 3. Congestion cost (congestionts) of a node isthe number of routing paths that are currently sharing it.History cost (historyts) is a non-decreasing value which isincreased by the amount of congestion cost at the end ofeach iteration. Values pfac and hfac are congestion cost andhistory cost multiplication factors, respectively. They are setto constant values of 1.2 and 0.2, respectively, as determinedvia experimentation.

  • Data: List of connections (source, destination, andstarting time slot)

    Result: List of routing paths (S) for each connection ofchannel, time slot pairs

    1 while shared (channel, slot) pairs exist and (iteration <max_iter) or iteration = 0 do

    2 for all connections i starting at time slot i.slot do3 Rip up existing connection path Pi4 Pi ← (source, i.slot)5 Initialize expansion_list = new MinHeap()6 while all destinations dij not found do7 Remove lowest cost node m with time slot j

    from expansion_list8 while fanouts n of node m at time slot (j+1)

    mod max_slot on shortest path exist do9 Add n to expansion_list at cost cn + Ci

    10 end11 for all nodes n in path from dij to source do12 Update cn13 Add n to Pi14 end15 end16 end17 end

    Algorithm 3: Pathfinder_ts algorithm including time slots

    Topology 64-node, 8 × 8 2D meshTechnology 45 nm at 1.1V, 1.0 GHzChannel width 128 bits (16 bytes)Packet size 4 flitsVirtual channels 2/portBuffer size per VC 10 flits in depth

    TABLE I: Routing Parameters

    IV. EXPERIMENTAL APPROACH

    To evaluate the performance of our scheduling algorithmon a hybrid NoC, Booksim 2.0 [10] was used to assess datathroughput, packet latency and NoC dynamic power. Thesimulator was previously modified to allow for hybrid TDMand packet-switched routing and power measurements [2] anda physical router layout was used to obtain NoC physicalparameters [2]. Each router component and the inter-routerwires were assigned dynamic energy consumption valuesdetermined via the post-synthesis simulation described in [2].These values were scaled by flit-level toggle rates determinedduring the Booksim simulations. The network was warmedup with enough packets to reach a stable state prior to resultsrecording. Unless otherwise stated, the NoC parameters shownin Table I were used for experimentation. All experiments usea routing schedule of eight time slots and 500 Pathfinder_tsiterations.

    Using our modified version of Booksim, three communi-cation patterns for hybrid routing were considered [10]. Thepatterns include 1) transpose (messages from (x, y) are sent to(y, x)), 2) bit-reverse (messages from the node labeled x aresent to the node whose label is bit-reversed(x), and 3) tornado

    Single Iteration Pathfinder_tsPattern BW Min/ Ave σ BW Min/ Ave σ

    % Max % MaxBitRev. 53 1/8 4.3 1.8 55 3/8 4.4 1.6Tornado 28 1/5 2.2 0.8 31 2/3 2.5 0.5Transp. 56 2/8 4.5 1.9 56 3/8 4.5 1.72-Side 71 5/7 5.7 0.7 99 7/8 7.9 0.24-Side 79 5/8 6.5 0.9 100 8/8 8.0 0.0

    TABLE II: Time slot routing results of single iteration shortestpath (SP) and Pathfinder_ts (PF) algorithms for single-source,single-destination traffic patterns.

    (messages from (x, y) are sent to (x+ k2 −1 mod k, y+k2 −1

    mod k) where k is the dimension of x and y. Results are alsoreported for 2-side (sources and destinations on parallel edgesof the NoC) and 4-side (sources and destinations along theNoC perimeter) traffic patterns [2] at throughput rates of 10Gbps for each source-destination pair. Combined execution ofAlgorithms 1, 2, and 3 requires on the order of seconds.

    V. RESULTS

    In this section, the routing results from using ourPathfinder_ts routing algorithm are directly compared againstresults using a single iteration shortest-path (SP) TDM ap-proach [2]. The latter algorithm is similar to the routingalgorithm described in [1]. All routes for TDM follow shortestpaths.

    The routing results shown in Table II illustrate the benefitsof multi-iteration path selection and scheduling for TDMrouting versus the previous routing approach. Pathfinder_ts(PF) achieves improved bandwidth (BW%) for all five trafficpatterns. Up to 100% of required source-destination commu-nication (BW%) in the routing network can be accommodatedusing PF with an average 11% bandwidth improvement for PFversus SP. The remaining available bandwidth can be allocatedand used at run-time using longer-latency packet switchedrouting. The table also shows that the time slot usage for TDMin the routers is more balanced if PF is used. The minimum,maximum, and average time slot usage (out of 8 availableslots) across the routers is noted. By balancing slot usage forTDM at compile-time, unused slots can be more easily usedfor packet-switched routing at run-time.

    Figure 2 shows average packet latency for transpose, tor-nado, and bit-reverse traffic patterns under increasing trafficinjection rates. TDM-only routing using PF and SP are usedto route connections at compile-time. The packet latency ofrouting using these approaches is compared against routingall traffic using packet switching (PS) with DOR. For PS-only traffic, latency increases sharply at the injection rates of0.04 to 0.06, due to limitations in dimension order routing.In TDM-only routing, the PF latency is better than the SPlatency for all traffic patterns. Since more TDM routing slotsare available with PF, data is assigned to a free slot moreeasily at the source, reducing source-to-destination latency. Fortornado and low-injection-rate transpose and bit-reverse, PS-only achieves lower latency in some cases since PS traffic can

  • 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11Injection rate (packets/node/cycle)

    20

    40

    60

    80

    100

    120la

    tenc

    y (c

    ycle

    s)

    TDM-PF-transposeTDM-SP-transposePS-transpose

    TDM-PF-tornadoTDM-SP-tornadoPS-tornado

    TDM-PF-bitrevTDM-SP-bitrevPS-bitrev

    Fig. 2: Average packet latency for transpose, tornado and bit-reverse traffic patterns using only compile-time TDM or run-time PS routing using DOR.

    0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050TDM Injection rate (packets/node/cycle)

    24

    26

    28

    30

    32

    34

    36

    38

    40

    42

    44

    Late

    ncy

    (cyc

    les)

    SP-TDM-0.05SP-PS-0.05PF-TDM-0.05PF-PS-0.05

    Fig. 3: Average NoC latency for varying amounts of TDMversus PS traffic for transpose under a moderate, fixed totalpacket injection rate of 0.05 packets/node/cycle. Note thatTDM latency increases as TDM injection rate increases dueto more contention for time slots.

    be immediately transmitted by a source rather than having towait for a pre-scheduled TDM time slot.

    Figure 3 shows results for experiments when the totalpacket injection rate at each source is constant at 0.05packets/node/cycle. The fraction of transmitted data that isscheduled via TDM versus the amount of PS traffic is variedfrom all PS (left) to all TDM (right). Initially, as the amountof PS traffic is high, all packets have high latency. As theTDM injection rate is increased, the PS traffic and associatedlatency are reduced. Latencies for TDM traffic routed withthe PF algorithm are consistently less than corresponding SPresults.

    Figure 4 shows the total power consumption of the NoC for

    6 8 10 12 14 16 18 20 22 24 26Throughput (Gbits/s)

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    5500

    6000

    Pow

    er (

    mW

    )

    TDM-PF-transposeTDM-SP-transposePS-transposeTDM-PF-tornadoTDM-SP-tornadoPS-tornadoTDM-PF-bitrevTDM-SP-bitrevPS-bitrev

    Fig. 4: Total power consumption for transpose, tornado andbit-reverse traffic using PS-only or TDM-only routing.

    SP-only, PF-only, and PS-only routing for three traffic patterns.For SP- and PF-only routing, the routes have similar powernumbers at low injection rate. At higher injection rates, thepower consumption of PF routing is slightly greater than thatof SP due to a larger number of time slots available for routing.

    VI. CONCLUSION

    In this paper, a heuristic routing algorithm that is tuned forrouting traffic in hybrid FPGA NoCs is presented. This multi-iteration router balances routed traffic across routing resourcesto reduce congestion. Multiple rip-up and reroute iterations areused to enhance low-latency, time-scheduled communication.Substantial data stream bandwidth and latency improvementfor a collection of five benchmarks is shown using our newrouting approach.1

    REFERENCES[1] N. Kapre, “Marathon: Statically-Scheduled Conflict-Free Routing on

    FPGA Overlay NoCs,” in FCCM, May 2016, pp. 156–163.[2] T. Liu, N. K. Dumpala, and R. Tessier, “Hybrid hard NoCs for efficient

    FPGA communication,” in FPT, Dec. 2016, pp. 157–164.[3] L. McMurchie and C. Ebeling, “PathFinder: A Negotiation-Based

    Performance-Driven Router for FPGAs,” in FPGA, Feb. 1995, pp. 111–117.

    [4] M. S. Abdelfattah, A. Bitar, and V. Betz, “Take the highway: Designfor embedded NoCs on FPGAs,” in FPGA, Feb. 2015, pp. 98–107.

    [5] Z. Lu and A. Jantsch, “TDM virtual-circuit configuration for network-on-chip,” IEEE TVLSI, vol. 16, no. 8, pp. 1021–1034, Aug. 2008.

    [6] T. Carle et al., “Static mapping of real-time applications onto massivelyparallel processor arrays,” in Int’l Conf. on Application of Concurrencyto Sys. Design, Jun. 2014, pp. 112–121.

    [7] N. Kapre et al., “Packet switched vs. time multiplexed FPGA overlaynetworks,” in FCCM, May 2006, pp. 1–10.

    [8] A. Shpiner et al., “On the capacity of bufferless networks-on-chip,” IEEETPDS, vol. 26, no. 2, pp. 492–506, Mar. 2014.

    [9] S. Evian and J.-P. Diguet, “Efficient space-time NoC path allocationbased on mutual exclusion and pre-reservation,” in ACM GLSVLSI, Mar.2007, pp. 457–460.

    [10] N. Jiang et al., “A detailed and flexible cycle-accurate network-on-chipsimulator,” in ISPASS, Apr. 2013, pp. 86–96.

    1This research was funded by a grant from Xilinx Corporation.