crossbars with minimally-sized...

7
Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos and Manolis Katevenis Inst. of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH) - member of HiPEAC FORTH-ICS, Vassilika Vouton, P.O. Box 1385, Heraklion, Crete, GR-711-10 Greece Abstract- Buffered crossbars are emerging as the architecture that will replace the bufferless core currently used in high-speed routers and switching fabrics. Their main drawback is the cost of N2 crosspoint buffers, which must be implemented inside the crossbar chip. Using traditional credit-based backpressure, each such buffer may need to hold several tens of cells due to the non-negligible linecard-crossbar round-time. In this paper we present credit prediction, a method that renders the requirements on crosspoint buffer space independent of the linecard-crossbar round-trip time. Credit prediction uses a central scheduler, but scheduling operations at inputs and at outputs are pipelined and run in parallel, just as in traditional (distributed) buffered crossbars. In terms of performance, our scheme performs identi- cally with traditional buffered crossbars when the linecard-fabric round-trip time is zero, while it achieves considerable buffer savings as this round-trip time increases. Effectively, with our method we can build effective buffered crossbars, using just one or two cells buffer per crosspoint. 1 . INTRODUCTION High throughput crossbar switches are desired in order to build effective switching fabrics and routers. The throughput of a crossbar switch is measured by N x A, where N is the switch radix (i.e. the number of input or output ports), and A is the line rate. The limiting factor for throughput is the power consumption of the crossbar chip, which directly and critically depends on the I/O throughput of this chip. Bufferless crossbar switches use internal speedup in order to mask out the scheduling inefficiencies of the crossbar scheduler [1] [2] [3] [4], as well as the padding overhead which is pertinent in systems that can operate only on fixed- size packets (cells). With internal speedup s, the effective (usable) throughput of the switch is s times lower than the I/O throughput of the crossbar chip. In commercial bufferless crossbars, s ranges between 2-4, hence these systems have 2-4 times lower throughput than what they could have if they did not use internal speedup. This situation changes dramatically if we add a small buffer at each crosspoint (combined input-crosspoint queueing - CICQ) [5] [6] [7] [8].The crosspoint buffers allow solving the bipartite graph matching problem in an approximate and long- term way, rather than the exact and short-term way needed in bufferless crossbars. Effectively, CICQ (or buffered crossbars) switches can yield efficient operation even when no internal speedup is used; additionally, they can directly operate on variable-size packets [9] [10], thus eliminating the padding overhead, which is the other source of speedup. The net result is that with buffered crossbars we can build higher throughput switches. A difficulty that buffered crossbar designers face today pertains to the cost of the N2 crosspoint buffers. Using traditional credit-based flow control, the buffer space per crosspoint grows proportionally with the round-trip time (RTT) between the ingress linecards and the crossbar core. This RTT may easily reach several thousands of nanoseconds, because in modern systems, the linecards and the fabric are distributed over multiple racks, which are accommodated in one or more large halls. As a consequence, the required crosspoint buffer space may exceed the buffering capabilities of a single crossbar chip. One solution to this problem is to put multiple crossbar chips in parallel (see for instance [8]), but this directly increases design complexity and cost. In this paper, we propose credit prediction, a method that renders the size of crosspoint buffers independent from the round-trip time between the ingress linecards and the crossbar core, thus drastically reducing buffer space requirements. 1.] Contribution & Contents Traditional buffered crossbars require one RTT worth of buffer space per crosspoint, so that input scheduler i (inside ingress linecard i) is able to continue writing new cells into crosspoint buffer i-*j until it gets informed that output scheduler j (inside the crossbar) reads these cells out of that buffer. In reality, if input i is the only one requesting output j, the length of crosspoint buffer i-*j will never grow beyond one (1) cell! But if for some reason output j suddenly stops serving input i, the ingress linecard may have sent one RTT worth of cells towards the crossbar before being informed of this sudden event; the one RTT buffer space at crosspoint i->j is needed in order to accommodate these cells-in-flight. In this paper, we position all input and output schedulers inside a central control chip. This placement enables faster coordination between inputs and outputs, thus smaller cross- point buffers. With our scheme, the buffer space per crosspoint grows proportionally with the scheduling round-trip time, which can be made as small as one or two cell times: it is independent of the linecard-crossbar round-trip time. Our central scheduling employs a request-grant protocol between the linecards and the control chip. Although this resembles the type of scheduling used in bufferless crossbar switches, our system maintains the scheduling simplicity and the perfor- mance efficiency of traditional (distributed) buffered crossbars. In our scheme, the per-input and per-output schedulers operate independently and in pipeline; they may even work asyn- chronously from each other. In terms of performance, when the linecard-crossbar RTT equals zero, our system performs 1-4244-1 206-4/07/$25.00 ©2007 IEEE

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

Crossbars with Minimally-Sized Crosspoint Buffers

Nikos Chrysos and Manolis Katevenis

Inst. of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH) - member of HiPEACFORTH-ICS, Vassilika Vouton, P.O. Box 1385, Heraklion, Crete, GR-711-10 Greece

Abstract- Buffered crossbars are emerging as the architecturethat will replace the bufferless core currently used in high-speedrouters and switching fabrics. Their main drawback is the costof N2 crosspoint buffers, which must be implemented insidethe crossbar chip. Using traditional credit-based backpressure,each such buffer may need to hold several tens of cells due tothe non-negligible linecard-crossbar round-time. In this paper wepresent credit prediction, a method that renders the requirementson crosspoint buffer space independent of the linecard-crossbarround-trip time. Credit prediction uses a central scheduler, butscheduling operations at inputs and at outputs are pipelinedand run in parallel, just as in traditional (distributed) bufferedcrossbars. In terms of performance, our scheme performs identi-cally with traditional buffered crossbars when the linecard-fabricround-trip time is zero, while it achieves considerable buffersavings as this round-trip time increases. Effectively, with ourmethod we can build effective buffered crossbars, using just oneor two cells buffer per crosspoint.

1 . INTRODUCTION

High throughput crossbar switches are desired in order tobuild effective switching fabrics and routers. The throughputof a crossbar switch is measured by N x A, where N is theswitch radix (i.e. the number of input or output ports), and Ais the line rate. The limiting factor for throughput is the powerconsumption of the crossbar chip, which directly and criticallydepends on the I/O throughput of this chip.

Bufferless crossbar switches use internal speedup in orderto mask out the scheduling inefficiencies of the crossbarscheduler [1] [2] [3] [4], as well as the padding overheadwhich is pertinent in systems that can operate only on fixed-size packets (cells). With internal speedup s, the effective(usable) throughput of the switch is s times lower than theI/O throughput of the crossbar chip. In commercial bufferlesscrossbars, s ranges between 2-4, hence these systems have 2-4times lower throughput than what they could have if they didnot use internal speedup.

This situation changes dramatically if we add a small bufferat each crosspoint (combined input-crosspoint queueing -

CICQ) [5] [6] [7] [8].The crosspoint buffers allow solving thebipartite graph matching problem in an approximate and long-term way, rather than the exact and short-term way needed inbufferless crossbars. Effectively, CICQ (or buffered crossbars)switches can yield efficient operation even when no internalspeedup is used; additionally, they can directly operate onvariable-size packets [9] [10], thus eliminating the paddingoverhead, which is the other source of speedup. The net resultis that with buffered crossbars we can build higher throughputswitches.

A difficulty that buffered crossbar designers face todaypertains to the cost of the N2 crosspoint buffers. Usingtraditional credit-based flow control, the buffer space percrosspoint grows proportionally with the round-trip time (RTT)between the ingress linecards and the crossbar core. ThisRTT may easily reach several thousands of nanoseconds,because in modern systems, the linecards and the fabric aredistributed over multiple racks, which are accommodated inone or more large halls. As a consequence, the requiredcrosspoint buffer space may exceed the buffering capabilitiesof a single crossbar chip. One solution to this problem is toput multiple crossbar chips in parallel (see for instance [8]),but this directly increases design complexity and cost. In thispaper, we propose credit prediction, a method that rendersthe size of crosspoint buffers independentfrom the round-triptime between the ingress linecards and the crossbar core, thusdrastically reducing buffer space requirements.

1.] Contribution & Contents

Traditional buffered crossbars require one RTT worth ofbuffer space per crosspoint, so that input scheduler i (insideingress linecard i) is able to continue writing new cellsinto crosspoint buffer i-*j until it gets informed that outputscheduler j (inside the crossbar) reads these cells out of thatbuffer. In reality, if input i is the only one requesting outputj, the length of crosspoint buffer i-*j will never grow beyondone (1) cell! But if for some reason output j suddenly stopsserving input i, the ingress linecard may have sent one RTTworth of cells towards the crossbar before being informed ofthis sudden event; the one RTT buffer space at crosspoint i->jis needed in order to accommodate these cells-in-flight.

In this paper, we position all input and output schedulersinside a central control chip. This placement enables fastercoordination between inputs and outputs, thus smaller cross-point buffers. With our scheme, the buffer space per crosspointgrows proportionally with the scheduling round-trip time,which can be made as small as one or two cell times: itis independent of the linecard-crossbar round-trip time. Ourcentral scheduling employs a request-grant protocol betweenthe linecards and the control chip. Although this resemblesthe type of scheduling used in bufferless crossbar switches,our system maintains the scheduling simplicity and the perfor-mance efficiency of traditional (distributed) buffered crossbars.In our scheme, the per-input and per-output schedulers operateindependently and in pipeline; they may even work asyn-chronously from each other. In terms of performance, whenthe linecard-crossbar RTT equals zero, our system performs

1-4244-1 206-4/07/$25.00 ©2007 IEEE

Page 2: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

identically with traditional buffered crossbars.The remainder of this paper is organized as follows. Sec-

tion 2 presents related work. Section 3 describes the centralscheduler architecture for buffered crossbars, and the methodof credit prediction. Section 4 describes how to modify creditprediction so as to make it work when the output ports of thecrossbar are subject to external backpressure. In section 5 wepresent performance simulation results, and in section 6 weconclude this paper.

1.2 Related Work

Several recent studies try to reduce the buffer space re-quirements of buffered crossbar switches. In [11], N per-inputqueues are implemented inside the buffered crossbar chip. Theincoming cells are first stored in these queues and afterwardsmove towards the crosspoint buffers. Effectively, the N per-input queues need have a size of one linecard-crossbar RTT,each, whereas the crosspoint buffers may have a smaller size,i.e. > FIFO-crosspoint round-trip time. The drawback of thisscheme is that a per-input queue may fill up with cells destinedfor a congested output thus blocking cells destined to other,probably non-congested outputs. Reference [12] proposes arate-controlled CICQ switch. Each ingress linecard is informedof the virtual-output-queue (VOQ) occupancy in all otherlinecards, and, using this global information, it computes VOQrates. Using rate control, the required crosspoint buffer spacegrows proportional with N rather than RTT. Obviously, thismethod makes sense only when N is smaller than RTT.Reference [13] employs a load balancing stage in front of thebuffered crossbar. Cells from a given input/output pair, sayi-*j, can be stored in any of the crosspoints along the j-thoutput of the crossbar; in this way, the minimum crosspointbuffer space is max(RTTIN, 1) worth of cells. However, dueto crosspoint buffer sharing, multiple inputs may concurrentlytry to write into the same crosspoint buffer. Furthermore, dueto multipath routing, cells may be forwarded from the crossbarout-of-order, hence resequencing must be implemented insidethe egress linecards. Compared to these studies, the creditprediction method that we present in this paper has the lowestcrosspoint buffer space requirements, while it maintains theoperation and performance of a traditional buffered crossbarsystem.

Credit prediction for buffered crossbars is motivated by ananalogous scheme that we have proposed for switches witha small queue in front of each fabric-output port [14]. Bothschemes achieve the same goal: the size of each individualqueue inside the fabric is made independent of the linecard-fabric round-trip time. However, whereas the result in [14]is for switches with a single queue in front of each output,this paper studies switches with N (per-input) queues in frontof each output, i.e. buffered crossbars. Figure 1 illustratesthe request-grant scheduler proposed in [14]. The control unitconsists of N per-output credit schedulers, and N per-inputgrant schedulers. Before injecting a cell into the fabric, aVOQ must first issue a request to the corresponding creditscheduler, and wait for a grant. In [14], this credit scheduler

buffered crossbar

. 0

1. -,

cel R-k _/lll~ - NN

Prop. Time:P

linecards

E 1VOQs

:nrate

1 grantlcell-time m_ N

X 34~~~~

Fig. 1. The control unit proposed in [14] applied to buffered crossbar switches(in [14], there is a single queue in front of each output of the crossbar, whichis shared among the N inputs); P is the one-way propagation delay betweenthe linecard and crossbar; hence RTT = 2 P.

uses one credit counter that maintains the available buffer-credits for the single queue in front of its corresponding output,and it will issue a grant, once it allocates a buffer-credit forthe corresponding cell. Because credit schedulers for differentoutputs work independently from each other, many of themmay concurrently issue grants to the same input. The role ofthe per-input grant schedulers is to serialize such concurrentgrants, forwarding one grant per input, per cell time.We can modify this scheduling scheme for buffered crossbar

switches if each credit scheduler maintains N credit counters,one for each crosspoint queue along its corresponding output:a credit scheduler will serve a request from input i only whenthe credit counter for the i-th crosspoint buffer is non-zero.

Unfortunately, applying the credit prediction proposed in [14]to buffered crossbar switches is not straightforward. When a

grant, say grant g, is selected by a grant scheduler at timet, it triggers the arrival of a cell, say cell c, at an output ofthe crossbar; this arrival will occur at time t + RTT. If thecrossbar employs a single queue in front of each output port,as in [14], we can be certain that the output queue targeted byc will generate a credit at time t + RTT + 11; hence, we can

predict the generation of this credit from time t, when grantg is sent to the linecard, and reserve this credit for anothercell already from time t + 12. In a buffered crossbar however,cell c will be stored in its corresponding crosspoint queue,

and the credit reserved for it will be released only when theoutput scheduler decides to serve that crosspoint among theN crosspoints that will possibly be eligible at that time. Attime t we do not know when this service will actually takeplace. The present paper shows how to predict the decisionsof the output schedulers one RTT ahead of time, thus makingcredit prediction applicable to buffered crossbar switches3.

'Since the queue will be non-empty at that time.2Without credit prediction, this credit would be available for another cell

at time t + RTT + 1.3Observe that the central scheduler that we present in the next section differs

from the one described in [14]: since the crosspoint buffers are organized per-input, credit reservations can be performed by N per-input schedulers, hencewe do not need per-output credit schedulers.

e _ I ~~~~~\ j vL _iI\ Ig.W1/*\

S ZI Je#,_ : I-

2. grant i _ :11NVUUS- *~~~~~~~~~~~

c

Page 3: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

linecards credits crossbar chip linecards request crossbar chipFV-OQ] SVQVOQS~~~~~~~~~~~~rhVOQS~ ~ ~ ~ FVQ TaLaL1 T < <}T -a

------

~~S S< S TX<

(a) (> V), (b) ($ )$outpl outp2 outp3 outpl outp2 outp3

Fig. 2. A 3x3, buffered crossbar switch (a) with the input schedulersdistributed over the N ingress linecards; (b) with all input schedulers placedin the crossbar chip.

2 . SYSTEM DESCRIPTION

In this section we present a central scheduler for bufferedcrossbar switches; next, we will use this scheduler to im-plement credit prediction. We assume virtual-output-queue(VOQ) buffered crossbars, where the main bulk of bufferingis performed in VOQs maintained at the ingress linecards.

Traditional buffered crossbars distribute the per-input sched-ulers over the N ingress linecards. Figure 2(a) depicts thearchitecture with distributed input schedulers and traditionalcredit-based flow control [8] [10]. An alternative architecture,first proposed in [15], is to place all input schedulers insidethe crossbar chip -see Fig. 2(b). In this scheme, the ingresslinecards communicate with the crossbar via a request-grantprotocol. Every time that a cell arrives at a VOQ, the linecardissues a request to its corresponding input scheduler. Eachinput scheduler maintains its outstanding requests using N,per-flow request counters, where a flow is defined as a distinctinput/output crossbar pair; it also maintains the available spaceat the crosspoint buffers along its row using N, per-flowcredit counters, initialized at B, the size of each crosspointbuffer measured in cells. Even though all input schedulersreside in the same chip, each one operates independently,serving one request per cell time, similar to the distributedsystem. (Eligible for service are flows with non-zero requestand credit counters.) Upon serving a request, the input sched-uler decrements by one the served request counter and therespective credit counter; in parallel, it issues a grant to itscorresponding ingress linecard. Upon receiving the grant, thelinecard immediately forwards the HOL cell of the grantedVOQ to the crossbar. In the baseline scheme, without creditprediction, the credit counter is incremented by one when thecell departs from its crosspoint queue.The data round-trip time in this request-grant protocol (i.e.

the round-trip time that must be used when dimensioning eachcrosspoint buffer) is equal to the minimum delay (i.e. assumingthat no contention is present) between consecutive reservationsof the same credit. Thus the data round-trip time is onepropagation delay, P, until the grant reaches the linecard, plusanother P delay until the injected cell reaches the crossbar,plus the delay of the output link scheduling operation thatreleases the credit, plus the delay of the input schedulingoperation that reuses the released credit. Observe that thisdata round-trip time includes a linecard-crossbar round-triptime (RTT), i.e. 2 P; hence the buffering requirements of this

scheme are similar to that of a traditional buffered crossbar.The credit prediction that we describe in the next sectioneliminates the RTT dependence.

Fig. 3. The schedulers needed for credit prediction in buffered crossbars; theinput schedulers reserve crosspoint buffer credits as in Fig. 2(b); the virtualoutput schedulers predict the selections of output link schedulers.

2.1 Credit PredictionIn this section we incorporate "logic" inside the crossbar

chip of Fig. 2(b), in order to predict the departures of cellsfrom crosspoint queues one RTT before they actually occur.Using this knowledge, an input scheduler is able to reuse acredit that it has just now reserved for a cell, without havingto wait for the RTT that it will take until the cell reaches thecrossbar.

Credit prediction uses N2 virtual crosspoint counters thatare served by N virtual output schedulers -see Fig. 3. There isone-to-one correspondence between the N2 virtual crosspointcounters and the N2 crosspoint queues, and between the Nvirtual output schedulers and the N output link schedulers4.Each virtual output scheduler operates at the same rate (onenew selection per cell time) and implements the same servicediscipline with the (data) output link scheduler that schedulesamong the N crosspoint queues along the correspondingoutput of the crossbar.The prediction mechanism operates as follows. Consider

a grant, gij, selected by grant scheduler for input i, attime5 t. At this time, gij is sent to ingress linecard i, inorder to trigger the injection of cell c into the crossbar; inparallel, virtual crosspoint counter i-j is incremented by one,anticipating the enqueue of cell c into crosspoint queue i->j,which will occur after a RTT (= 2 P) delay in the future.At time t + 1, virtual output scheduler j selects one amongthe non-zero virtual crosspoint counters k-*j, k C [1, N],anticipating the crosspoint selection of output link scheduler jthat will take place at time t+RTT+ 1. After serve, the virtualoutput scheduler decrements the selected virtual counter byone, and increments the corresponding credit counter by one.Thus, if the selected crosspoint counter corresponds to flowi-j, the credit reserved for grant gij will be available to

4The N virtual output schedulers operate independently of each other, justas the N output link schedulers.5We measure time in cell times.

Page 4: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

input scheduler i at time t + 26. The input scheduler may usethis (predicted) credit immediately at time t+ 2, thus triggeringthe arrival of a new cell, cl, at crosspoint queue i-*j for timet + RTT + 2. As we will show next, because virtual outputscheduler j selects crosspoint counter i-j at time t + 1, wecan be certain that output link scheduler jwill read a cellfrom crosspoint queue i-J after a RTT delay, i.e. at timet + RTT + 1; hence, when cell cl arrives at the crossbar, itwill definitely find a free slot in crosspoint queue i-j.

To see why credit prediction works, consider the followingtwo conditions: (condition 1) the non-zero virtual crosspointcounters for output j at time t+ 1 correspond to the only non-empty crosspoint queues for output j at time t + RTT + 1;(condition 2) the internal state of virtual output scheduler jat time t + 1 is the same with that of output link schedulerj at time t + RTT + 1. It is obvious that if both conditionshold, then crosspoint counter, x, that virtual output schedulerj serves at time t + 1 (null if it serves none), will correspondto crosspoint queue, y, that output link scheduler j serves attime t + RTT + 1, since the two schedulers implement thesame service discipline.

Therefore, for credit prediction to work correctly, we need toenforce conditions 1 and 2. Initially (t= 0) all virtual crosspointcounters are zero, and all crosspoint queues are empty. Fromthat point and on, if a virtual crosspoint counter is incrementedby one, a cell will definitely be enqueued into the respectivecrosspoint queue after a RTT delay; it follows that condition1 holds for t= 0. We can ascertain that condition 2 also holdsfor t= 0, by programming each virtual output scheduler to startat the same state with its corresponding output link scheduler.Now, if virtual output scheduler j and its corresponding linkscheduler serve the same crosspoints at times t and t + RTT,respectively, the two conditions will continue to hold for timest + 1 and t + RTT + 1. It follows that conditions 1 and 2 holdcontinously, and therefore the two schedulers serve the "same"crosspoints with a delay lag of RTT cell times as needed forcredit prediction purposes.

Returning to the previous example, the link scheduler foroutput j will, in general, read cell c from crosspoint queuei-j at time t + RTT + v, v c N+. With credit prediction,the virtual output scheduler j will have predicted that futureevent already from time t+ v. Hence, at cell time t+ v+ 1, wecan safely increment by one the credit counter for crosspointi-j, and reuse the space reserved for cell c to generate a newi-*j grant: at time t + RTT + v + 1, when the cell utilizingthis new grant will arrive at crosspoint queue i-j, cell c willhave just departed from that queue.

2.2 Data Round-Trip Time under Credit Preditction

Using credit prediction, the data round-trip time equals thedelay of a request going through input and virtual outputscheduling. Because the input schedulers reside in the samechip with the (virtual) output schedulers, we can estimate the

6Observe that without credit prediction, this credit would be available to

input scheduler i at time t + RTT + 2.

data round-trip timeoa reserved: a released: a reused:

cell time 1 cell time 2 cell time 3 cell time 4 cell time 5 cell time 6 cell time 7 cell time 8

| rehues sojourn l _1 grant soournd packet soourn o|utputlinecard ocrossbar |:l _ crossbarto linecard linecard toopointbuf schIedulingl

request sojrn input 1linecad tcrossbar i sheduli

yceduling'pipe rate" reques tsojurn grAn soj

in rdd t M M crossgbar od IInecardlpacketslinecad to

Fig. 4. Scheduling flow 1-1 in a buffered crossbar employing creditprediction; P= 2 cell times, B= 2 cells; with dark gray we mark inputscheduling operations that reserve the "same credit", a.

data round-trip time by 2 D, where D is the delay incurredin each individual scheduling operation -i.e. input or outputscheduling. Observe that in order to keep the lines of thecrossbar busy, each scheduler needs to perform one operationin every cell time. To sustain this rate, scheduling operationsmust have a delay D < 1 cell time. Thus, assuming that D=1 cell time7, the system will operate robustly with just two(2) cells buffer space per crosspoint; even 1-cell buffer spaceper crosspoint suffices given that D< 1/2 cell times. In anycase, the buffer space needed depends on the scheduling, input-output round-trip time: it is independent of the propagationdelay (P) between the linecards and the crossbar8.

2.3 Scheduling Process ExampleFigure 4 depicts the timing of scheduling operations under

credit prediction. In this example (i) only flow 1- >1 is active-assume that the only cells in the system are back-to-backcells that arrive at input 1 and head to output 1-, (ii) D= 1cell time, (iii) P= 2 cell times, and (iv) B= 2 cells. Withoutcredit prediction, the buffer space per crosspoint required inorder to sustain full throughput to flow 1-> 1 is six (6) cells. Inour example, we have only two credits available for crosspointqueue 1->*1. Name one of these credits a and the other Q. Asshown in the figure, input 1 issues a new request for output 1 inevery new cell time, i.e. every time that a new 1 -> 1 cell arrives.The first request arrives at the crossbar at the beginning of celltime 3. The input scheduler uses credit a to serve that requestat the end of cell time 3 (the actual scheduling operation takesplace in cell time 3), and forwards a grant back to the linecardat the beginning of cell time 4; at the same time (beginning ofcell time 4), it increments by one the virtual crosspoint counter1- >1. The virtual output scheduler selects that counter at theend of cell time 4, and at the beginning of cell time 5 itincrements by one the credit counter of flow 1 - 1. After thisevent, credit a is made available again, and the input scheduleruses it to generate a new grant at the end of cell time 5.Thus, with credit a alone, the input scheduler serves flow 1-> 1every second (odd) cell time; during the intervening (even) celltimes, the input scheduler serves flow 1->1 using credit Q.Effectively, flow 1 ->1 gets served in every new cell time, i.e.

7By setting D= 1 cell time, we relax the timing constraints of the schedulingsubsystem.8By comparison, the data round-trip time of a traditional buffered crossbar

is 2 D + 2 P.

Page 5: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

ax reserved

cell time 1II

lll_

Ischedulingidelay'pipe rate"

data round-trip timeax released-

cell time 2 cell time 3 cell time 4

packet sojourn output 1

linecard to crossbar schedulin

|Input 1|| packet sojournIschedulingl linecard to crossbar

cell time 5 cell time 6 cell time 7

credit sojournrossbar to linecard

schedulinQ crossbar to linecard

Fig. 5. Scheduling flow 1-1 in a traditional (distributed) buffered crossbar;P= 2 cell times, B= 2 cells; with dark gray we mark input schedulingoperations that reserve the "same credit", oz.

cell time 1 cell time 2: cell time 3: cell time 4 cell time 5 cell time 6 cell time 9 cell time 1 0, cell time 1 1 cell time 12,

III11 A

frart l-l §mour0 tu

:~~~~~~~~~~~rM tobvitopl

Ii t 2 h-Ipi u

l l

xpoint 1-1 0xpoint 1-1 0 xpoint 1-1: O: xpoint 1-1: Ol xpoint 1-1: O; xpoint 1-1: 1; xpoint 1-1: 1; xpoint 1-1: 2; xpoint 1-1: 1; xpoint 1-1: 2;xpoint 2-1: 0 xpoint 2-1: 0l xpoint 2-1: 0:xpoint 2-1: 01 xpoint 2-1: 01 xpoint 2-1: 1 1xpoint 2-1: 21 xpoint 2-1: 11 xpoint 2-1: 21 xpoint 2-1: 11

Fig. 6. Scheduling flows 1-1 and 2-1 in a buffered crossbar employingcredit prediction; we assume that both request counters 1-÷1 and 2-÷ 1 are

non-zero from the beginning of cell time 1; P= 2 cell times, B= 2 cells.

at full line rate. For comparison, Fig. 5 depicts the timing ofscheduling operations in a traditional buffered crossbar, withD= 1 cell time, P= 2 cell times, and B= 2 cells. Here, credita can be reserved for a new cell every 2 P + 2 D (= 6) celltimes. Thus flow 1 -> 1 can only reach a throughput of 2/6, i.e.0.33.

Figure 6 depicts the scheduling operations under credit pre-

diction when output 1 is requested by two inputs, input 1 andinput 2, at the same time. Observe that the virtual schedulerfor output 1 always serves the same input (crosspoint) that thelink scheduler at output 1 will serve four (4) cell times laterin the future. Effectively credit prediction works properly, andthe occupancy of crosspoint buffers is always < 2 cells.

2.4 Eliminating the Output Link Schedulers

Since each output link scheduler always selects the cross-

point which its corresponding virtual output scheduler hasserved before one RTT time in the past, there is no reason

to use output link schedulers. Instead, the output lines ofthe crossbar can directly implement the "program" computedby the virtual output schedulers. To that end, we employN scheduling FIFOs, one for each output, that store theidentifiers of the crosspoints selected by the virtual outputschedulers. When the virtual scheduler for output j serves a

virtual crosspoint counter, it enqueues the ID of the selectedcrosspoint into the scheduling FIFO for output j. When the j-

th output line of the crossbar is idle, it waits for the crosspoint

queue pointed by the head entry in the corresponding schedul-ing FIFO to become non-empty; when this occurs, the outputstarts reading the HOL cell from the pointed crosspoint buffer,and afterwards dequeues the head entry from the schedulingFIFO. Observe that each scheduling FIFO needs to storecrosspoint identifiers for one RTT worth of cells.

2.5 Placement of Input and Output Schedulers

It is possible to place the per-input schedulers together withthe per-output virtual schedulers inside a separate control chip,effectively removing complexity from the buffered crossbarchip. The control chip will accept requests and will issuegrants, like the schedulers that are nowadays used in bufferlesscrossbars. If we remove the output link schedulers, and use"scheduling FIFOs" instead, the control chip will need to com-municate the per-output crosspoint selections to the crossbar;otherwise, no such communication between the control chipand the crossbar chip is needed.

3. CREDIT PREDICTION WHEN DOWNSTREAMBACKPRESSURE IS PRESENT

Our description of credit prediction assumes that the HOLcells in crosspoint queues can always be forwarded to theoutputs. This may not hold however if these cells are blockeddue to backpressure exerted on the output ports of the crossbar.In this section, we modify credit prediction to account forcredit-based backpressure, exerted upon fabric-output portsfrom nodes in the downstream direction. It makes no differencewhether these nodes be the egress linecards of the presentswitch or the ingress linecards of the downstream neighbor:when backpressure is present, credit prediction, as describedso far, is not valid: the fact that a crosspoint queue willbe non-empty at time t + RTT no longer guarantees thatthe output link scheduler can select that queue at that time,since departures from crosspoint queues can be blocked atany time. We can work around this problem, if, instead ofexamining the downstream backpressure state at the outputports of the fabric, i.e. for cells that have already reachedtheir crosspoint queue, we consult downstream backpressurebefore issuing new grants. In this way we can ascertain that thecells that arrive into the crossbar will always have downstreambuffer space reserved, hence these will never need to block incrosspoint queues.Each virtual output scheduler maintains a downstream credit

counter, used for external backpressure purposes. Up to now,a virtual output scheduler could select a virtual crosspointcounter (hence increment the credit counter for the correspond-ing crosspoint) as long as this was non-zero. To account forexternal backpressure, we require that the downstream creditcounter also be non-zero. (The downstream credit counter isdecremented after serving a virtual crosspoint counter, and isincremented when credits from the downstream node reachthe virtual output scheduler9.) In this way we guarantee that

90bserve that this method increases the effective round-trip time pertainingto the downstream (external) backpressure by the (internal) scheduler-linecardround-trip time.

s4 A .A

Page 6: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

1000

a 100

a)'2- 10

-1a)

a) 0.1

0.01

~ ~

~~~~~ ~~~burst-yl 2

_-- -

eroli.---r----------- tw mac ig posf r

; ~ ~ ~~~traditional buf= crossbar andl ~buf. crossbar w. creditprediction

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1normalized input load

1.D° 0.1

a) O.,0 W

o1 0.=( 0.1g o

Fig. 7. Performance for N= 32, P=O, D= 1/2 cell times, and B= 2 cells;uniformly-destined, Bernoulli and bursty (abs= 12 cells) cell arrivals; onlythe queueing delay is shown, excluding all fixed scheduling delays.

for every (predicted) credit that returns to the input scheduler,there is buffer space reserved in the corresponding downstreambuffer. Hence the cells that use such (predicted) credits willhave downstream buffer reserved when they reach the crossbar,hence these will never need to block in crosspoint queues.

Now observe that there are no such downstream creditsreserved for the cells that the input schedulers can inject intothe crossbar at start time: for each output, say output j, theN input schedulers can generate a total of N x B grantsusing their initial pool of credits, i.e. without consulting virtualoutput scheduler j first. The cells that will use these grantswill not have downstream buffer credits reserved when theyarrive at the crossbar. We can circumvent this problem, if thedownstream buffer has some extra space specificly allocatedfor these N x B cellsl'. Obviously, the downstream creditcounter maintained by each virtual output scheduler needs tobe N x B smaller than the actual number of cells that fit inthe corresponding downstream buffer.

4. PERFORMANCE SIMULATION RESULTS

This section compares by simulation the performance of a

buffered crossbar with credit prediction to that of a traditional(distributed) buffered crossbar. We use plain, pointer-basedround-robin schedulers in both systems.

4.1 Delay Performance

First, we compare the delay performance of the two systemsunder uniformly-destined traffic. We assume zero propagationdelay (P= 0), and delay per single-resource scheduler, D =

1/2 cell times; thus the data round-trip time in both systems is1 cell time. Accordingly, we set the crosspoint buffer size, B=1 cell. Figure 7 depicts mean queueing delay under Bernoullicell arrivals, and under bursty cell arrivals with average burstsize (abs) equal to 12 cells. As can be seen, the two systemsperform identically.

'0Say that N= 64, that B= 2 cells, and that the cell size is 64 bytes; then,the extra buffer space needed per downstream buffer is 64 Kbits, or 8 KBytes.

Cl9876'5439

0(b)four (4) matching plots for:

Irttrtt2 12,rtt 32 and rtt 102 )

crosspoint buffer, B= 2 cellsI (WITH credit prediction)

D 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8unbalance factor, w unbalance factor, w

Fig. 8. Throughput performance for varying rtt under unbalanced traffic; B=2 cells; full input load; (a) traditional buffered crossbar; (b) buffered crossbaremploying credit prediction.

4.2 Throughput for B=2 cells & increasing RTT

In this experiment, we set B= 2 cell times, and we measure

switch throughput under unbalanced Bernoulli cell arrivals,for varying linecard-crossbar RTTs. By rtt we denote thedata round-trip time of a traditional buffered crossbar. Wehave plots for rtt= 1 (RTT= 0, D= 1/2), rtt= 2 (RTT= 0,D= 1), rtt= 12 (RTT= 10, D= 1), rtt= 32 (RTT= 30, D=1), and rtt= 102 (RTT= 100, D= 1). As in [6], w controlstraffic imbalance; traffic is uniformly-destined when w= 0,and completely unbalanced -i.e. non conflicting input-output(j-*j) flows-, when w= 1.

Figure 8(a) depicts the performance of a traditional bufferedcrossbar, which is denoted by NO credit prediction. As can

be seen, the performance of this system is satisfactory onlyfor rtt= 1 or 2 cell times; for larger rtt values, performancedeclines. For rtt= 12 or 32 cell times, the system achievesfull throughput under uniformly-destined traffic (w= 0). Underuniform traffic, the load for any particular output, j, comes

evenly from all inputs, thus all (32) crosspoint buffers alongthe respective (j-th) output of the crossbar are utilized. Thiscombined buffer space equals 64 cells (32 x 2 cells), and can

accommodate any rtt < 64 cell times. But with increasing w,

the load for output j gradually concentrates on input j, thusless buffer space is being utilized per output; in the extremecase, when w= 1, the 2-cell crosspoint buffer j-*j carries thetotal output load, which explains why rttl2 and rtt32 yieldnormalized throughputs of 0.166 and 0.062, respectively, i.e.equal to 2lrtt. For rtt= 102, the 2-cell buffer space per outputdoes not suffices to sustain switch throughput for any w value:rttlO2 yields a throughput of 2 x 32/102 (= 0.627), when trafficis uniformly-destined (w= 0), and a throughput of 2/102 (=

0.019) when traffic is completely unbalanced (w= 1).Figure 8(b) depicts the performance of the buffered crossbar

with credit prediction. As can be seen, credit prediction yieldshigh throughput for any rtt.

4.3 Throughput for "B = RTT"& increasing RTT

Next, we repeat the previous experiment, but now we setthe crosspoint buffer size, B, equal to one rtt worth of traffic.By doing so, the system with no credit prediction operatesrobustly. As can be seen in the figure, with increasing rtt,and thus with increasing B, we witness similar throughput

rtt~ ~ rt12-+

(NO crdi predition)

IC .1n

).1

Page 7: Crossbars with Minimally-Sized Crosspointweb.eecs.utk.edu/~ielhanan/courses/Archieves/ECE-653/... · 2008-03-30 · Crossbars with Minimally-Sized Crosspoint Buffers Nikos Chrysos

rttvi r 2

crosspoint buffer size, "B = rtt"(NO credit prediction)

0.60 0.2 0.4 0.6 0.8unbalance factor, w

LL .. .,>zS;A X-.f ,, integrated project #27648 (FP6), and the HiPEAC networkrtt 32 rtt 12 , of excellence. This work was also supported by an IBM Ph.D.

rtt2 Fellowship.

(b)

crosspoint buffer size, "B = rtt"(WITH credit prediction)

1 o 0.2 0.4 0.6 0.8unbalance factor, w

Fig. 9. Throughput performance for varying rtt under unbalanced traffic; ineach plot, B equals one rtt worth of traffic; (a) traditional buffered crossbar;(b) buffered crossbar employing credit prediction.

improvements in both systems, although with credit predictionthe improvement is marginally better.

5 . CONCLUSIONS

We presented credit prediction, a method that renders thebuffer space requirements of buffered crossbars independent ofthe round-trip time between the ingress linecards and the cross-

bar core. With our scheme, the crosspoint buffer size dependsonly on the scheduling round-trip time, which can be madeas small as one or two cell times if scheduling is performedcentrally. We also showed how to make credit prediction workwhen the output ports of the crossbar are subject to externalbackpressure. Performance simulation results demonstratedthat our scheme performs identically with traditional bufferedcrossbar systems when the linecard-crossbar round-trip timeequals zero, while it achieves significant buffer savings as thisround-trip time increases. Effectively, with our scheme we can

build effective buffered crossbars switches, using crosspointbuffers as small as one or two cells, each.

6. ACKNOWLEDGMENTS

This work was supported by the European Commission inthe context of the SARC (Scalable Computer Architecture)

REFERENCES

[1] T. Anderson, S. Owicki, J. Saxe, C. Thacker: "High-Speed SwitchScheduling for Local-Area Networks", ACM Trans. on Computer Sys-tems, vol. 11, no. 4, Nov. 1993, pp. 319-352.

[2] R. LaMaire, D. Serpanos: "Two-Dimensional Round-Robin Schedulersfor Packet Switches with Multiple Input Queues", IEEE/ACM Trans. on

Networking, vol. 2, no. 5, Oct. 1994, pp. 471-482.[3] N. McKeown: "The iSLIP Scheduling Algorithm for Input-Queued

Switches", IEEE/ACM Trans. on Networking, vol. 7, no. 2, April 1999.[4] P. Krishna, N. Patel, A. Charny, R. Simcoe: "On the Speedup Required

for Work-Conserving Crossbar Switches", IEEE Journal Selected Areasin Communications (JSAC), vol. 17, no. 6, June 1999, pp. 1057-1066.

[5] D. Stephens, H. Zhang: "Implementing Distributed Packet Fair Queueingin a scalable switch architecture", Proc. IEEE INFOCOM Conf, SanFrancisco, CA, March 1998, pp. 282-290.

[6] R. Rojas-Cessa, E. Oki, H. Jonathan Chao: "CIXOB-k: Combined Input-Crosspoint-Output Buffered Switch", Proc. IEEE GLOBECOM'01, vol.4, pp. 2654-2660.

[7] N. Chrysos, M. Katevenis: "Weighted Fairness in Buffered CrossbarScheduling", Proc. IEEE HPSR'03, Torino, Italy, pp. 17-22.http://archvlsi.ics.forth.gr/bufxbar/

[8] F. Abel, C. Minkenberg, R. Luijten, M. Gusat, I. Iliadis: "A Four-Terabit Packet Switch Supporting Long Round-Trip Times", IEEE MicroMagazine, vol. 23, no. 1, Jan./Feb. 2003, pp. 10-24.

[9] K. Yoshigoe, K. Christensen: "A Parallel-Polled Virtual Output QueuedSwitch with a Buffered Crossbar", Proc. IEEE Workshop High PerfSwitching & Routing (HPSR 2001), Dallas, TX, USA, May 2001, pp.

271-275.[10] M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, N. Chrysos:

"Variable Packet Size Buffered Crossbar (CICQ) Switches", Proc. IEEEICC'04, Paris, France, vol. 2, pp. 1090-1096.http://archvlsi.ics.forth.gr/bufxbar

[11] K. Yoshigoe: "The CICQ switch with virtual crosspoint queues for largeRTT", Proc. IEEE ICC'06, Istanbul, Turkey, June 2005.

[12] K. Yoshigoe: "Rate-based Flow-control for the CICQ Switch", Proc.IEEE LCN'05, Sydney, Australia, November 2005, pp. 44-50.

[13] R. Rojas-Cessa, Z. Dong, Z. Guo: "Load-balanced combined input-crosspoint buffered packet switch and long round-trip times", IEEECommunication Letters, July 2005, pp. 2654-2660.

[14] N. Chrysos, Manolis Katevenis: "Scheduling in Switches with SmallInternal Buffers", Proc. IEEE Globecom'05, St. Louis, MO, USA, 28Nov. - 2 Dec. 2005; http://archvlsi.ics.forth.gr/bpbenes

[15] N. Chrysos: "Design Issues of Variable-Packet-Size, Multiple-PriorityBuffered Crossbars", TR-325 Inst. of Computer Science, FORTH, Grete,Greece, October 2003.

-_50-

0) 0.9

F0.8-oa)N 0..c 0.7

I , -4T

a)