dynamic pipelining: making ip-lookup truly scalable jahangir hasan t. n. vijaykumar presented by...
DESCRIPTION
3 - Sailesh Kumar - 2/22/2016 This Paper’s Contribution n This paper presents an IP lookup ASIC architecture which addresses following 5 scalability challenges »Memory size - grow slowly with #prefixes »Lookup throughput – line rate »Implementation cost - complexity, chip area, etc »Power dissipation - grow slowly with #prefixes and line rate »Routing table update cost – O(1) n No existing lookup architecture effectively addresses all 5 challenges!TRANSCRIPT
Dynamic Pipelining: Making IP-Lookup Truly Scalable
Jahangir Hasan T. N. Vijaykumar
Presented by Sailesh Kumar
2 - Sailesh Kumar - 05/05/23
A Simple router
IP Lookup Crossbar:
Arriving Packets
VOQs
Routing table containsprefix, dest. pairs
IP-lookup finds dest. withlongest matching
prefix
At OC768, IP lookup needs to be carried out in 2 ns, can become a bottleneck
3 - Sailesh Kumar - 05/05/23
This Paper’s Contribution This paper presents an IP lookup ASIC architecture
which addresses following 5 scalability challenges
» Memory size - grow slowly with #prefixes» Lookup throughput – line rate» Implementation cost - complexity, chip area, etc» Power dissipation - grow slowly with #prefixes and line rate» Routing table update cost – O(1)
No existing lookup architecture effectively addresses all 5 challenges!
4 - Sailesh Kumar - 05/05/23
Previous work Several IP lookup schemes proposed Memory access time > packet inter-arrival time
» Must use pipelining
Several papers have proposed using pipeliningSpace Throughput Updates Power Area
TCAMs Yes Yes Yes
HLP [Varghese et al – ISCA’03]
Yes Yes
DLP [Basu, Narlikar - Infocom’05]
Yes Yes
This paper Yes Yes Yes Yes Yes
5 - Sailesh Kumar - 05/05/23
IP Address Lookup Routing tables at router input ports
contain (prefix, next hop) pairs Address in packet is compared to
stored prefixes, starting at left. Prefix that matches largest number of
address bits is desired match. Packet is forwarded to the specified
next hop.
01* 5110* 31011* 50001* 0
10* 7
0001 0* 10011 00* 21011 001* 31011 010* 5
0101 1* 7
0100 1100* 41011 0011* 81001 1000*100101 1001* 9
0100 110* 6
prefix nexthop
routing table
address: 1011 0010 1000Taken from CSE 577 Lecture Notes
6 - Sailesh Kumar - 05/05/23
Address Lookup Using Tries Prefixes stored in
“alphabetical order” in tree. Prefixes “spelled” out by
following path from top.»green dots mark prefix ends
To find best prefix, spell out address in tree.
Last green dot marks longest matching prefix.
address: 1011 0010 1000
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1
11
1
1
1 1
1
1 1
1
1
1
1
1
0
0
1
0 0
0
0 1
3
7 - Sailesh Kumar - 05/05/23
P2
P2 P3
Leaf Pushing
P1 0
1
1
1
0
0
1* P2101* P30* P1prefix next
hop
routing table
P2
Every Internal node might need to store
the next hop information
Leaf Pushing avoids using longest prefix matching,
also reduces the node size with proper encoding
Leaf Pushing, push P2 to all
leaves
Complicates the updates, as all leaves needs to be
updated
8 - Sailesh Kumar - 05/05/23
Multibit Trie
Match several bits in one step instead of single bit.» equivalent to turning sub-trees of binary trie into single nodes.
Each node may be associated with several prefixes. For stride of s, reduces tree depth by factor of s.
address: 101 100 101 0001
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1
11
1
1
1 1
1
1 1
1
1
1
1
1
0
0
1
0 0
0
0 1
01,10
000 001 010 100 101 110
011 110 110 100 101100
* 010,00 1,11 000
11 -- 1 *--1,10
9 - Sailesh Kumar - 05/05/23
Controlled Prefix Expansion
1* P2101* P30* P1prefix next
hop
routing table
01* P110* P200* P1
1010* P31011* P311* P2
Stride 2, multibit trieP3P2 P3
P1 0
11
11
0100
00
P210
01 10P2
P1
Controlled prefix expansion to align
the stride boundaries
In worst-case, controlled prefix expansion causes non-
deterministic increases in the routing table size
There are schemes, which uses variable strides to
improve average case, but worst-case remains the
same
10 - Sailesh Kumar - 05/05/23
Need for Pipelined Tries Tomorrow’s routers will run at 160 Gbps, 2 ns per
packet At most one memory access / 2 ns (may be less) Moreover there may be millions of prefixes
In worst-case, memory requirements will be very high» Memory will be slower
Needs an architecture which» Uses multiple smaller memories» Accesses them in a pipelined manner
11 - Sailesh Kumar - 05/05/23
Pipelined Trie-based IP-lookup
Each level in different stage → overlap multiple packets
Tree data-structure, prefixes in leaves (leaf pushing)Process IP address level-by-level to find the longest match
P4 = 10010*
10
10
0
1
0
P1 P2 P4P3 P5
1 P6 P7
12 - Sailesh Kumar - 05/05/23
Closest Previous Work
Maps trie level to stage but this is a static mapping Updates change prefix distribution but mapping persists
In worst-case any stage can have all prefixes Large worst-case memory for each stage
0*00*000*..
P1P2P3.. P1
P3 P2
P2
X
Data Structure Level Pipelining (DLP) - level to stage mapping
No bound on worst-case update → Could be O(1) using Tree Bitmap But constant huge, 1852 memory accesses per update[SIGCOMM Comm Review ’04]
Figure taken from Hasan et al.
13 - Sailesh Kumar - 05/05/23
Memory bound per stage
Figure below, shows the worst case prefix distributionThere are 1 million prefixes, each of length 32-bits
In this case Largest stage will be 5 MB. Total memory size will be 80 MB
as opposed to 6 MB of the total prefix size
Figure taken from Hasan et al.
Moreover, a 5 MB memory can’t be
accessed faster than 6 ns or so
14 - Sailesh Kumar - 05/05/23
Hardware Level Pipelining - HLP HLP pipelines the memory accesses at hardware level Multiple words of memory are read together in a
pipelined manner» Throughput only limited by the memory array access time
Such memories can improve the IP lookup
throughput
Figure taken from Sherwood et al.
As such not scalable as higher degree of
pipelining leads to a prohibitive chip area and power
dissipation
15 - Sailesh Kumar - 05/05/23
Key Idea HLP doesn’t scale well in chip area and power DLP scales well in power but doesn’t scale well in
» Memory size (due to static level to stage mapping)» Throughput, as one stage can’t go faster than 6 ns
Combine these two (SDP)» Use a DLP, but with a better mapping so that each stage is
smaller» Use HLP at every stage to accelerate it further
16 - Sailesh Kumar - 05/05/23
Key Idea: Use Dynamic Mapping
Map node height to stage (instead of level to stage)Height changes with updates, captures distribution of prefixes below
Hence the name dynamic mapping
0*00*000*..
P1P2P3.. P1
P3 P2
P2
X
However, the worst-case memory requirements will remain the same, i.e. when all prefixes are 32-bit long
Figure taken from Hasan et al.
17 - Sailesh Kumar - 05/05/23
Key Idea: Use Jump NodesUse Jump nodes
so that the worst-case memory requirements can be reducedAlso restores the relation between height and distribution
However, one can argue that jump nodes will reduce the memory requirements of SDP too, NO we will soon see why!
Figure taken from Hasan et al.
Jump 010..1*1010*..
..P4P5..
P5
P5
XP4
P4P5
X
18 - Sailesh Kumar - 05/05/23
Another example of Jump nodes
Leaf Pushing =>
Jump 100Jump 11
Note that this trie will need more than one node operation for
table updates, different from
what the paper CLAIMS!Adding Jump nodes =>
19 - Sailesh Kumar - 05/05/23
Tries with jump nodesKey properties(1) Number of leaves = number of prefixes No replication Avoids inflation of prefix expansion, leaf-pushing(2) Updates do not propagate to subtrees No replication(3) Each internal node has 2 children Jump nodes collapse away single-child nodes
20 - Sailesh Kumar - 05/05/23
Total versus Per-Stage Memory
Jump-nodes bound total size by 2NWould DLP+Jump nodes → small per-stage memory?
log 2
NW
- lo
g 2 N
N
No, DLP is still static mapping → large worst-case per-stageTotal bounded but not per-stage
Figure taken from Hasan et al.
21 - Sailesh Kumar - 05/05/23
SDP’s Per-Stage Memory Bound
Proposition: Map all nodes of height h to (W-h)th pipeline stage
Result:Size of kth stage = min( N / (W-k) , 2k )
22 - Sailesh Kumar - 05/05/23
Key Observation #1
A node of height h has at least h prefixes in its subtree
At least one path of length h to some leaf h -1 nodes along pathEach node leads to at least 1 leafPath has h -1+1 leaves = h prefixes h
Figure taken from Hasan et al.
23 - Sailesh Kumar - 05/05/23
Key Observation #2
No more than N / h nodes of height h for any prefix distribution
Assume more than N / h nodes of height hEach accounts for at least h prefixes (obs #1)Total prefixes would exceed NBy contradiction, obs #2 is true
24 - Sailesh Kumar - 05/05/23
Main Result of the Proposition
Map all nodes of height h to (W-h)th pipeline stageK-th stage has only N / (W-k) nodes from obs #21-bit trie has binary fanout → at most 2k nodes in k-th stageSize of k-th stage = min( N / (W-k) , 2k ) nodes
Results in ~20 MB for 1 million prefix4x better than DLP
Static pipelining(DLP)
Dynamic pipelining(SDP)
Figure taken from Hasan et al.
25 - Sailesh Kumar - 05/05/23
Optimum Incremental Updates
1 update → change height and stage of many nodesMust migrate all affected nodes → inefficient update?
Each ancestor in different stage = 1 node-write in each stage = 1 write bubble for any update
update
Updating SDP not just O(1) but exactly 1
Not many nodes needs to be moved as only ancestors’ heights can be affected
Figure taken from Hasan et al.
26 - Sailesh Kumar - 05/05/23
Incremental Updates
2
4
6 7
10 11
5
9
12 13
16
1
3
8
1714 15
Pipe 0 Pipe 1 Pipe 2 Pipe 3 Pipe 4 Pipe 53 10 7 4 2 16 12 9 58111314151617
27 - Sailesh Kumar - 05/05/23
Incremental Updates
2
4
6 7
15 11
5
9
12 13
16
1
3
8
17
Pipe 0 Pipe 1 Pipe 2 Pipe 3 Pipe 4 Pipe 53 10 2 16 12 9 58111314151617
7, Jump
7 4
The implementation complexity may be pretty high, cos on the fly you
might need to compute the jump nodes (e.g. for 7)
28 - Sailesh Kumar - 05/05/23
Efficient Memory Management
Tree bit map and segmented hole compaction requires multiple memory accesses for updates
Multibit trie with variable stride requires even more complex memory management
SDP:No variable striding / compression → all nodes same sizeNo fragmentation/compaction upon updatesMemory management is trivial and has zero fragmentation
29 - Sailesh Kumar - 05/05/23
Scaling SDP for Throughput
Each SDP stage can be further pipelined in hardwareHLP [ISCA’03] pipelined only in hardware without DLP
Too deep at high line-ratesCombine HLP + SDP for feasibly deep hardware
Throughput matches future line rates
Size = N / (W-k)
Size = 2k12223
# of HLP
stages
Figure taken from Hasan et al.
30 - Sailesh Kumar - 05/05/23
Experiments
Figure taken from Hasan et al.
31 - Sailesh Kumar - 05/05/23
Experiments
Figure taken from Hasan et al.
32 - Sailesh Kumar - 05/05/23
Experiments
Figure taken from Hasan et al.
33 - Sailesh Kumar - 05/05/23
Discussion / Questions
Figure taken from Hasan et al.