input queue switch technologies

Input Queue Switch Technologies

Speaker : Kuo-Cheng Lu

N300/CCL/ITRI

Outline

• Overview of switching fabric technologies

• Input queue scheduling algorithms

• Scheduling for quality of service

• Multicast scheduling

• Tiny Tera project

• PRIZMA project

• Conclusion

Evolution of Router/Switch

Fabric : two main categories

Why input queue switch is less efficient ?

• Head of Line Blocking (limited throughput)

• Input contention (difficult to control cell delay)

1

1

11

1

1

11

1

12

2

11

1

12

2

111 1

22

111 1

22

1 1

1 111

2 2

11 1 1

2 2

11 1 1

2 2

11 1 1

Input Queue Switch Output Queue Switch

X

HOL Blocking

1H

2H

2L

1H

2H

2L

1H

2H

2L

1H

2H

2L

1H

2H

2L

1H

2H

2L

X

Input Contention

Input Queue Switch Output Queue Switch

Solve the HOL blocking by VOQ

• Virtual Output Queuing to archive 100% throughput with suitable scheduling algorithm

1

1

Input Queue Switch(VOQ)

1

11

1

1 11

1

2

2

1 1

1

2

1

2

1 11

2

1

2

1 1

1 111

2 2

11 1 1

2 2

11 1 1

2 2

11 1 1

Output Queue Switch

*Still can't get desired celloutput sequence due to

input contention!

Control cell delay by speedup• Moderate speedup with suitable scheduling algorithm to

control cell delay

• Need output buffer

– CIOQ (combine input output queue) switch

Remark

• VOQ can avoid HOL blocking and provide 100% throughput but need a complex scheduling algorithm

• m-time speedup (m<N) can reduce HOL blocking and input/output contention by using a suitable scheduling algorithm to approach the performance of output queuing switch

• n-time speedup (n=2) with VOQ can emulate output queuing (Nick McKeown)

Input Queue Scheduling

Input Queue Scheduling Algorithms

• First Goal : 100% throughput under admissible input traffic • Second Goal : Control cell transfer delay

• Methods : find a matching for

– maximum matching

– maximal matching

– maximum/maximal weight

– stable matchingusing VOQ and/or moderate speedup!

• *Admissible : Sum(Lamda(I,J)) <1 for all input I, and Sum(Lamda(I,J)) <1 for all output J

• *Stable : Q(I,J) < infinit =>(define) 100% throughput

Maximum or Maximal matching• Maximum matching

– Maximizes instantaneous throughput

– Starvation

– Time complexity is very high

• Maximal matching– Can’t add any connection on the current match without alert

existing connections

– More practical (e.g. WFA, PIM, iSLIP, DRR,RRM)

Request Graph Maximum Matching Maximal Matching

Parallel Iterative Matching

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Requests

1

2

3

4

1

2

3

4Grant

1

2

3

4

1

2

3

4Accept/Match

1

2

3

4

1

2

3

4

#1

#2

Random Selection

1

2

3

4

1

2

3

4

Random Selection

PIM-Performance

E C Nlog

E Ui N2

4i------- C # of iterations required to resolve connections=

N # of ports =

Ui # of unresolved connections after iteration i=

iSLIP with multiple iterations

iSLIP Properties

• De-synchronization is the key to archive high throughput• Random under low load• TDM under high load• Lowest priority to MRU• 1 iteration: fair to outputs

• Converges in at most N iterations. On average <= log2N

• Implementation: N priority encoders• Up to 100% throughput for uniform traffic in one iteration

(c.f. PIM can only archive 63% throughput in one iteration)

iSLIP Performance

iSLIP Implementation

Grant

Grant

Grant

Accept

Accept

Accept

1

2

N

1

2

N

State

N

N

N

Decision

log2N

log2N

log2N

ProgrammablePriority Encoder

DRRM(Dual Round Robin Matching)

DRRM Implementation

DRRM Performance

Wrap Wave Front Arbiter

Requests Match

N steps instead of2N-1

Wave Front Arbiter

Requests Match

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

WFA-Implementation

2x2 WFA

Maximum/Maximal Weight Matching

• 100% throughput for admissible traffic (uniform or non-uniform)

• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time

– LQF (Longest Queue First):w=input queue occupancy

– LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port

• Maximal Weight Matching (practical algorithms)– iOCF

– iLQF

– iLPF (comparators in the critical path of iLQF are removed )

iLQF - Implementation

iLPF - Implementation*10ns arbitration time for a 32x32 switch using 0.25 CMOS process(Nick McKeown, IEEE Infocom 1998)

Scheduling for QoS• WPIM (Weighted PIM)• Prioritized iSLIP• FARR (Fair Access Round Robin)

– a kind of maximal weight matching

– US Patent#5517495

• Output Queue Emulation(Nick McKeown, IEEE JSAC 1999)– Speedup of 2 - 1/N is necessary and sufficient for OQ emulation

– using CCF algorithm(Critical Cell First)

• Rate Controller Approach(Anna Charny, IWQOS’98)– putting rate controllers at the input and output channels and using OCF

arbitration policy can provide deterministic delay guarantee for speedup >2 (delay is function of leaky bucket parameters of a flow, the speedup, and N)

– incorporating the rate-controllers into the arbiter with speedup >= 6 can approach OQ delay performance

WPIM

• Each Iteration consists of four stages :– Request : every unmatched port sends a request to the destination of

each active VOQ

– Mask : output create a mask (for per input)to indicate if the input has transmitted as many cell as their credit to the output in the current frame

– Grant : from the requests that remain from the masking stage, the output port selects one randomly and sends a grant signal to its originating input port

– Accept : every unmatched input port that receives one or more grants selects one with equal probability and notifies the corresponding output port

• Modification : – allowing each output port to clear all its mask bits when all of its

incoming requests are masked and the output port remains unmatched

Prioritized iSLIP

• Request : – input I select the highest priority

nonempty queue for output J, Lij

• Grant : – output J find the highest level request L(j)=max(Lij), the output

maintain a separate pointer for each level,for same level input, the arbiter use the pointer GjL(j) and normal iSLIP scheme to choose the input

• Accept : – same as Grant

FARR• Each input selects HOL cell of the highest priority queue

for each VOQ and sends the requests with the extended timestamps( a timestamp prepended with its priority)

• Repeat the following steps for R times– (1)for each unmatched output, if it has any request from

unmatched inputs, grants the request with smallest extended timestamp

– (2)for each unmatched input, if it receives any grants, grants smallest extended timestamp

– (3)any accepted grants are added into the match

Pri=0,time=23

Pri=1,time=18

Pri=0,time=12

Pri=1,time=18

o/p#0

Pri=0,time=23

Pri=1,time=18

empty

Pri=1,time=18

VOQ#0 of input #0

VOQ#1 of input #0

o/p#1

Output Queue Emulation(1/2)

• Definitions– TL(C) : Time to Leave of cell C

– OC(C) : Output Cushion of cell C

– IT(C) : Input Thread of cell C

– L(C) : Slackness of cell C. = OC(C)-IT(C)

TL(C)=3OC(C)=2IT(C)=1L(C)=1

Sorting according to TL(C)

PIAO(Push In Arbitrary Out) Queue

Output Queue Emulation (2/2)

• Using PIAO as an input queue, CCF(Critical Cell First) as an input queue insertion policy and stable matching can mimic output queuing with speedup=2– put the arriving cell at position OC(C)+1 of input PIAQ queue

– Slackness always >= 0

– when a cell reaches its time to leave(I.e.OC(C)=0), this means

– (1)the cell is already at its output and may depart on time or

– (2)the cell is simultaneously at the head of its input priority list(because its input thread is zero) and at the head of its output priority list(because it has reached its time to leave

Remarks

• iSLIP can get 100% throughput under uniform Bernoulli traffic (Nick McKeown IEEE Transactions on Networking, April 1999)

• Any maximum weight matching(e.g. OCF,LQF) algorithm delivers 100% throughput under admissible traffic(Balaji Prabhakar, IEEE Infocom 2000)

• Any maximal matching(e.g. PIM, iSLIP) with 2-time speedup delivers 100% throughput under admissible traffic

• Speedup of 2 is sufficient for OQ emulation (Nick McKeown, IEEE JSAC 1999)

• For bounded cell delay guarantee, exact OQ emulation may be too costly! Probabilistic or soft-emulation is more practical (Mounir Hamdi, IEEE Comm. Mag. 2000)

Summary of Fabric Architectures

==Centralized Shared Memory Switch==Throughput = 100%Delay = GuaranteedMem. BW = 2NR

==Output Queue Switch==Throughput = 100%Delay = Guaranteed

Speedup = NMem. BW = (N+1)R

==Input Queue Switch==Throughput = 58.6%

Delay = ?Speedup = 1

Mem. BW = 2R

==Virtual Output Queue==Throughput = 100%

Delay = ? (due to input blocking)Speedup = 1

Mem. BW = 2R

==Combined Input and Output Queue Swicth==Throughput = 100%Delay = Guaranteed

Speedup = 2Mem. BW = 3R

iSLIP, PIM, WFA => 100% throughput (uniform)(maximal size matching)

LQF, LPF =>100% throughput (un-uniform)(maximal weight matching)

Theoritical result for any arrivals(Can emulate the behavior of an OQ switch)

PIM => 100% throughput (uniform)need random access FIFOs(look ahead)

OQ orCIOQSW#1

OQ orCIOQSW#2

OQ orCIOQSW#k

==Parallel Packet Swicth==Throughput = 100%Delay = GuaranteedSpeeddown = 3(R/k)

Mem. BW = 3 x 3(R/k) (CIOQ)Theoritical result for any arrivals

(Can emulate the behavior of an OQ switch)Only meaningful for R > mem BW

RR/k

R/k

cell leveldemux.

cell levelmux.

Multicast method #1

Copy networks

Copy network + unicast switching

Increased hardware, increased input contention

Multicast method #2

Use copying properties of crossbar fabric

No fanout-splitting: Easy, but lowthroughput

Fanout-splitting: higher throughput, but not as simple.Leaves “residue”.

The effect of fanout-splitting

Performance of an 8x8 switch with and without fanout-splittingunder uniform IID traffic

Placement of residue

Key question: How should outputs grant requests?

(and hence decide placement of residue)

Residue and throughput

Result: Concentrating residue brings more new workforward. Hence leads to higher throughput.

But, there are fairness problems to deal with.

This and other problems can be looked at in a unifiedway by mapping the multicasting problem onto a variation of Tetris.

ESLIP - Cisco 12000

Stanford University Tiny Tera Project

IBM’s PRIZMA Project •16 input ports •16 output ports •1.6 - 1.8 Gbps per port •QoS: up to four priorities •Built-in support for modular growth in number of ports •Built-in support for modular growth in port speed •Built-in support for modular growth in aggregate throughput •Built-in support for automatic load-sharing •Self-routing switch element •Dynamically shared-output buffered element •Built-in multicast and broadcast •Aggregate data rate 28 Gbit/s per module •3.8 Million transistors on chip •624 I/O pins

Architecture Descriptions

• A kind of CIOQ (VOQ+Output Queuing)

• Schedulers are distributed with complexity of O(N)– The arbiters at the input side perform input contention resolution

– The output-buffered switch element performs classical output contention resolution

• By means of the flow-control/VOQ interaction b.t. switch element and input queues, the less expensive input-queue memory is used to cope with burst-ness

Core of PRIZMA

• Conventional shared memory v.s PRIZMA

Performance of PRIZMA

• 16x16 switch element (N=16)

• Shared memory size M= 256 cells

• *Delay-throughput performance improves notably as the degree of memory sharing is reduced– when VOQ is used, there is no HOL blocking, and the

performance is determined only by the o/p queue space available for every output to resolve contention!

Scalability of PRIZMA

input queue switch technologies

Documents

cell delaysolve

timelqf longest queue

speedupmoderate speedup

admissible input traffic

output channels

voq andor moderate speedup

voqvirtual output queuing

admissible traffic uniform