a comparative study of arbitration algorithms for the alpha 21364 pipelined router
DESCRIPTION
A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router. Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/1.jpg)
Slide 1
Inte
lA Comparative Study of Arbitration
Algorithms for the Alpha 21364 Pipelined Router
Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve
Lang*, & Dave Webb$
(ack: Richard Kessler)
Intel*, UPV!, & HP$
Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
![Page 2: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/2.jpg)
Slide 2
Inte
lAlpha 21364 NetworkAlpha 21364 Network
21364 Chip(including Router)
RambusMemory
I/O
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
L2 CacheData
L2 CacheData
Router MC2 MC1
L2 Cache Tags
21264CORE
![Page 3: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/3.jpg)
Slide 3
Inte
lThe Alpha 21364 8x7 RouterThe Alpha 21364 8x7 Router
CROSSBAR
Input Ports
OutputPorts
Distributed Arbitration Algorithm Controls the Crossbar
• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through
![Page 4: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/4.jpg)
Slide 4
Inte
lProblem: Maximize # Matches Problem: Maximize # Matches
Input Port 0 1 2
Input Port 1 1 2 3
Input Port 2 1 2 3
Input Port 3 1 2 3
Input Port 4 1 6 3
Input Port 5 0 2 3
Input Port 6 4 2 3
Input Port 7 5 2 3
• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)
numbers in table cells: destination output port
older packet at input port
3
![Page 5: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/5.jpg)
Slide 5
Inte
lSimpler Algorithms Have Fewer MatchesSimpler Algorithms Have Fewer Matches
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30
% Occupied Input Packet Buffers in a 21364 router
# A
rbit
rati
on
Ma
tch
es
Pe
r C
yc
le
Perfect
Complex (WFA)
Complex (PIM)
Complex (PIM1)
Simple (SPAA)
Assumes all output ports are free
complexity
![Page 6: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/6.jpg)
Slide 6
Inte
lComplexity may not pay offComplexity may not pay off
0
1
2
3
4
5
6
7
0 0.25 0.5 0.75
Fraction of Output Ports Occupied
# A
rbitr
atio
n M
atch
es P
er
Cyc
le
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)
complexity
@ 30% input buffer occupancy
![Page 7: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/7.jpg)
Slide 7
Inte
lKey ResultsKey Results
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many
output ports are busy)
+ SPAA minimizes interactions between ports
+ SPAA can be pipelined more effectively
Rotary Rule + avoids network saturation under very heavy load
![Page 8: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/8.jpg)
Slide 8
Inte
lWave Front Arbiter (WFA)Wave Front Arbiter (WFA)
Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch
Implement via “connection” matrix
E
N
S
W
Grant
Request
i,j
1 2 3 4
5
6
7
output ports
Grant = Request & N & W
S = N & NOT(Grant)
E = W & NOT(Grant)
input port 0
input port 1
input port 2
input port 3
![Page 9: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/9.jpg)
Slide 9
Inte
lWFA Advantage & PipelineWFA Advantage & Pipeline
+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
![Page 10: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/10.jpg)
Slide 10
Inte
lWFA LimitationsWFA Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell
changes every cycle restarting (1) before (2) completes is complex
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
![Page 11: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/11.jpg)
Slide 11
Inte
lParallel Iterative Matching (PIM)Parallel Iterative Matching (PIM)
Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every
output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet
randomly Accept: unselected input port selects a grant randomly
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Output Port 0 unused in this arbitration round
![Page 12: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/12.jpg)
Slide 12
Inte
lPIM1 Advantage & PipelinePIM1 Advantage & Pipeline
+ High interaction between input and output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
![Page 13: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/13.jpg)
Slide 13
Inte
lPIM1 LimitationsPIM1 Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively restarting (1) before (2) completes is complex
same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
![Page 14: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/14.jpg)
Slide 14
Inte
lSimple, Pipelined Arbitration Algorithm (SPAA)
used in the Alpha 21364 Router
Simple, Pipelined Arbitration Algorithm (SPAA)used in the Alpha 21364 Router
Algorithm Nominate: each input port nominates packets for exactly
one output port (one packet nominated only once) Grant: each output port selects an input port packet based
on the least-recently selected one Reset: input ports reset state of all unselected packets and
renominate them in subsequent cycles
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Reset
![Page 15: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/15.jpg)
Slide 15
Inte
lSPAA’s SimplicitySPAA’s Simplicity
Low degree of interaction among ports- increases arbitration collisions+ reduces complexity
Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports
(1 cycle)
1
(1) (2) (3)11
![Page 16: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/16.jpg)
Slide 16
Inte
lSPAA’s AdvantagesSPAA’s Advantages
+ Fewer cycles 3 cycles in 0.18micron
+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port
+ Easier to pipeline restart (1) for free input ports before (2) completes
only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix
speculative read allows data flits to follow header flits
(1) (2) (3)
1
(1) (2) (3)11
1 cycle
![Page 17: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/17.jpg)
Slide 17
Inte
lSummary: Simpler is BetterSummary: Simpler is Better
WFA PIM1 SPAA
Alpha 21364
# Matches Per Cycle High Medium Lower
# cycles
(0.18 microns)
4 4 3
Restart
Rate
Every 3 cycles
Every 3 cycles
Every cycle
![Page 18: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/18.jpg)
Slide 18
Inte
lSaturation BehaviorSaturation Behavior
• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)
• Ideally, operate at saturation bandwidth • Solution: throttle input load
64 Node Network, Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
SPAA-base
saturation point
![Page 19: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/19.jpg)
Slide 19
Inte
lRotary RuleRotary Rule
21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16
Rotary Rule: more throttling+ 21364 is a “direct” network
+ Rotary Rule prioritizes traffic in network ports over local ports
+ also, clears network congestion
+ relies on anti-starvation mechanism
WFA+Rotary: change first cell SPAA+Rotary: change output port priority to
the Rotary Rule
![Page 20: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/20.jpg)
Slide 20
Inte
lSimulation MethodologySimulation Methodology
Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL
Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
![Page 21: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/21.jpg)
Slide 21
Inte
l64 Node Network: Base Case64 Node Network: Base Case
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
PIM1
WFA-base
SPAA-base
• SPAA outperforms WFA & PIM124% higher throughput at knee
Knee
![Page 22: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/22.jpg)
Slide 22
Inte
l64 Node Network: With Rotary Rule64 Node Network: With Rotary Rule
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
PIM1
WFA-base
WFA-rotary
SPAA-base
SPAA-rotary
• Rotary Rule helps both SPAA & WFA
![Page 23: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router](https://reader034.vdocuments.net/reader034/viewer/2022051418/568157c3550346895dc547f7/html5/thumbnails/23.jpg)
Slide 23
Inte
lSummary & ConclusionsSummary & Conclusions
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when
many output ports are busy)
+ SPAA minimizes interactions between ports
+ SPAA can be pipelined more effectively
Rotary Rule+ avoids network saturation under heavy load