dynamic pipelining: making ip-lookup truly scalable jahangir hasan t. n. vijaykumar presented by...

Dynamic Pipelining: Making IP-Lookup Truly Scalable

Jahangir Hasan T. N. Vijaykumar

Presented by Sailesh Kumar

2 - Sailesh Kumar - 05/05/23

A Simple router

IP Lookup Crossbar:

Arriving Packets

VOQs

Routing table containsprefix, dest. pairs

IP-lookup finds dest. withlongest matching

prefix

At OC768, IP lookup needs to be carried out in 2 ns, can become a bottleneck


This Paper’s Contribution This paper presents an IP lookup ASIC architecture

which addresses following 5 scalability challenges

» Memory size - grow slowly with #prefixes» Lookup throughput – line rate» Implementation cost - complexity, chip area, etc» Power dissipation - grow slowly with #prefixes and line rate» Routing table update cost – O(1)

No existing lookup architecture effectively addresses all 5 challenges!


Previous work Several IP lookup schemes proposed Memory access time > packet inter-arrival time

» Must use pipelining

Several papers have proposed using pipeliningSpace Throughput Updates Power Area

TCAMs Yes Yes Yes

HLP [Varghese et al – ISCA’03]

Yes Yes

DLP [Basu, Narlikar - Infocom’05]

Yes Yes

This paper Yes Yes Yes Yes Yes


IP Address Lookup Routing tables at router input ports

contain (prefix, next hop) pairs Address in packet is compared to

stored prefixes, starting at left. Prefix that matches largest number of

address bits is desired match. Packet is forwarded to the specified

next hop.

01* 5110* 31011* 50001* 0

10* 7

0001 0* 10011 00* 21011 001* 31011 010* 5

0101 1* 7

0100 1100* 41011 0011* 81001 1000*100101 1001* 9

0100 110* 6

prefix nexthop

routing table

address: 1011 0010 1000Taken from CSE 577 Lecture Notes


Address Lookup Using Tries Prefixes stored in

“alphabetical order” in tree. Prefixes “spelled” out by

following path from top.»green dots mark prefix ends

To find best prefix, spell out address in tree.

Last green dot marks longest matching prefix.

address: 1011 0010 1000

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 1

11

1

1

1 1

1

1 1

1

1

1

1

1

0

0

1

0 0

0

0 1

3


P2

P2 P3

Leaf Pushing

P1 0

1

1

1

0

0

1* P2101* P30* P1prefix next

hop

routing table

P2

Every Internal node might need to store

the next hop information

Leaf Pushing avoids using longest prefix matching,

also reduces the node size with proper encoding

Leaf Pushing, push P2 to all

leaves

Complicates the updates, as all leaves needs to be

updated


Multibit Trie

Match several bits in one step instead of single bit.» equivalent to turning sub-trees of binary trie into single nodes.

Each node may be associated with several prefixes. For stride of s, reduces tree depth by factor of s.

address: 101 100 101 0001

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 1

11

1

1

1 1

1

1 1

1

1

1

1

1

0

0

1

0 0

0

0 1

01,10

000 001 010 100 101 110

011 110 110 100 101100

* 010,00 1,11 000

11 -- 1 *--1,10


Controlled Prefix Expansion

1* P2101* P30* P1prefix next

hop

routing table

01* P110* P200* P1

1010* P31011* P311* P2

Stride 2, multibit trieP3P2 P3

P1 0

11

11

0100

00

P210

01 10P2

P1

Controlled prefix expansion to align

the stride boundaries

In worst-case, controlled prefix expansion causes non-

deterministic increases in the routing table size

There are schemes, which uses variable strides to

improve average case, but worst-case remains the

same


Need for Pipelined Tries Tomorrow’s routers will run at 160 Gbps, 2 ns per

packet At most one memory access / 2 ns (may be less) Moreover there may be millions of prefixes

In worst-case, memory requirements will be very high» Memory will be slower

Needs an architecture which» Uses multiple smaller memories» Accesses them in a pipelined manner


Pipelined Trie-based IP-lookup

Each level in different stage → overlap multiple packets

Tree data-structure, prefixes in leaves (leaf pushing)Process IP address level-by-level to find the longest match

P4 = 10010*

10

10

0

1

0

P1 P2 P4P3 P5

1 P6 P7


Closest Previous Work

Maps trie level to stage but this is a static mapping Updates change prefix distribution but mapping persists

In worst-case any stage can have all prefixes Large worst-case memory for each stage

0*00*000*..

P1P2P3.. P1

P3 P2

P2

X

Data Structure Level Pipelining (DLP) - level to stage mapping

No bound on worst-case update → Could be O(1) using Tree Bitmap But constant huge, 1852 memory accesses per update[SIGCOMM Comm Review ’04]

Figure taken from Hasan et al.


Memory bound per stage

Figure below, shows the worst case prefix distributionThere are 1 million prefixes, each of length 32-bits

In this case Largest stage will be 5 MB. Total memory size will be 80 MB

as opposed to 6 MB of the total prefix size


Moreover, a 5 MB memory can’t be

accessed faster than 6 ns or so


Hardware Level Pipelining - HLP HLP pipelines the memory accesses at hardware level Multiple words of memory are read together in a

pipelined manner» Throughput only limited by the memory array access time

Such memories can improve the IP lookup

throughput

Figure taken from Sherwood et al.

As such not scalable as higher degree of

pipelining leads to a prohibitive chip area and power

dissipation


Key Idea HLP doesn’t scale well in chip area and power DLP scales well in power but doesn’t scale well in

» Memory size (due to static level to stage mapping)» Throughput, as one stage can’t go faster than 6 ns

Combine these two (SDP)» Use a DLP, but with a better mapping so that each stage is

smaller» Use HLP at every stage to accelerate it further


Key Idea: Use Dynamic Mapping

Map node height to stage (instead of level to stage)Height changes with updates, captures distribution of prefixes below

Hence the name dynamic mapping

0*00*000*..

P1P2P3.. P1

P3 P2

P2

X

However, the worst-case memory requirements will remain the same, i.e. when all prefixes are 32-bit long



Key Idea: Use Jump NodesUse Jump nodes

so that the worst-case memory requirements can be reducedAlso restores the relation between height and distribution

However, one can argue that jump nodes will reduce the memory requirements of SDP too, NO we will soon see why!


Jump 010..1*1010*..

..P4P5..

P5

P5

XP4

P4P5

X


Another example of Jump nodes

Leaf Pushing =>

Jump 100Jump 11

Note that this trie will need more than one node operation for

table updates, different from

what the paper CLAIMS!Adding Jump nodes =>


Tries with jump nodesKey properties(1) Number of leaves = number of prefixes No replication Avoids inflation of prefix expansion, leaf-pushing(2) Updates do not propagate to subtrees No replication(3) Each internal node has 2 children Jump nodes collapse away single-child nodes


Total versus Per-Stage Memory

Jump-nodes bound total size by 2NWould DLP+Jump nodes → small per-stage memory?

log 2

NW

- lo

g 2 N

N

No, DLP is still static mapping → large worst-case per-stageTotal bounded but not per-stage



SDP’s Per-Stage Memory Bound

Proposition: Map all nodes of height h to (W-h)th pipeline stage

Result:Size of kth stage = min( N / (W-k) , 2k )


Key Observation #1

A node of height h has at least h prefixes in its subtree

At least one path of length h to some leaf h -1 nodes along pathEach node leads to at least 1 leafPath has h -1+1 leaves = h prefixes h



Key Observation #2

No more than N / h nodes of height h for any prefix distribution

Assume more than N / h nodes of height hEach accounts for at least h prefixes (obs #1)Total prefixes would exceed NBy contradiction, obs #2 is true


Main Result of the Proposition

Map all nodes of height h to (W-h)th pipeline stageK-th stage has only N / (W-k) nodes from obs #21-bit trie has binary fanout → at most 2k nodes in k-th stageSize of k-th stage = min( N / (W-k) , 2k ) nodes

Results in ~20 MB for 1 million prefix4x better than DLP

Static pipelining(DLP)

Dynamic pipelining(SDP)



Optimum Incremental Updates

1 update → change height and stage of many nodesMust migrate all affected nodes → inefficient update?

Each ancestor in different stage = 1 node-write in each stage = 1 write bubble for any update

update

Updating SDP not just O(1) but exactly 1

Not many nodes needs to be moved as only ancestors’ heights can be affected



Incremental Updates

2

4

6 7

10 11

5

9

12 13

16

1

3

8

1714 15

Pipe 0 Pipe 1 Pipe 2 Pipe 3 Pipe 4 Pipe 53 10 7 4 2 16 12 9 58111314151617


Incremental Updates

2

4

6 7

15 11

5

9

12 13

16

1

3

8

17

Pipe 0 Pipe 1 Pipe 2 Pipe 3 Pipe 4 Pipe 53 10 2 16 12 9 58111314151617

7, Jump

7 4

The implementation complexity may be pretty high, cos on the fly you

might need to compute the jump nodes (e.g. for 7)


Efficient Memory Management

Tree bit map and segmented hole compaction requires multiple memory accesses for updates

Multibit trie with variable stride requires even more complex memory management

SDP:No variable striding / compression → all nodes same sizeNo fragmentation/compaction upon updatesMemory management is trivial and has zero fragmentation


Scaling SDP for Throughput

Each SDP stage can be further pipelined in hardwareHLP [ISCA’03] pipelined only in hardware without DLP

Too deep at high line-ratesCombine HLP + SDP for feasibly deep hardware

Throughput matches future line rates

Size = N / (W-k)

Size = 2k12223

# of HLP

stages



Experiments



Discussion / Questions


dynamic pipelining: making ip-lookup truly scalable jahangir hasan t. n. vijaykumar presented by...

Documents