camp: fast and efficient ip lookup architecture sailesh kumar, michela becchi, patrick crowley,...

CAMP: Fast and Efficient IP Lookup Architecture

Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner

Washington University in St. Louis

Michela Becchi - 04/20/23

Context Trie based IP lookup Circular pipeline architectures



0* P1

000* P2

0010* P3

0011* P4

011* P5

10* P6

11* P7

110* P8

Prefix dataset

IP address111010…



0* P1

000* P2

0010* P3

0011* P4

011* P5

10* P6

11* P7

110* P8

Prefix dataset

P2

P7

P1

P3 P4

P5

P6

P8

0

0

0

0

0

0

1

11

1

11

TrieIP address111010…



0* P1

000* P2

0010* P3

0011* P4

011* P5

10* P6

11* P7

110* P8

Prefix dataset

P2

P7

P1

P3 P4

P5

P6

P8

0

0

0

0

0

0

1

11

1

11

Trie

Stage 1

Stage 2

Stage 3

Stage 4

IP address111010…



0* P1

000* P2

0010* P3

0011* P4

011* P5

10* P6

11* P7

110* P8

Prefix dataset

P2

P7

P1

P3 P4

P5

P6

P8

0

0

0

0

0

0

1

11

1

11

Trie

Stage 1

Stage 2

Stage 3

Stage 4

1

2

4

3

Circular pipelineIP address111010…


CAMP: Circular Adaptive and Monotonic Pipeline

Problems:» Optimize global memory requirement» Avoid bottleneck stages» Make the per stage utilization uniform

Idea:» Exploit a Circular pipeline:

– Each stage can be a potential entry-exit point– Possible wrap-around

» Split the trie into sub-trees and map each of them independently to the pipeline


CAMP (cont’d)

Implications:» PROS:

– Flexibility: decoupling of maximum prefix length from pipeline depth

– Upgradeability: memory bank updates involve only partial remapping

» CONS:– A stage can be simultaneously an entry point and a transition

stage for two distinct requests Conflicts’ origination Scheduling mechanism required Possible efficiency degradation


Trie splitting

P8P2 P3

P6 P7

P4 P5

P1

P2 P3P6 P7 P8

P4 P5

P1 P1

P 3

P4

P5

P 6

P 7

P8

P 1P2

0 0 * E nte r a t p ip e li nes tag e 1

0 1 * E nte r a t p ip e li nes tag e 2

1 0 * N o M a tc h

1 1 *E nte r a t p ip e li ne

s tag e 3

P ip e li ne sta g e 3 P i p el in e s ta ge 4

P i p e li ne sta g e 2 P i pe li n e s tag e 1

1

0

10

10

0

1 1

P1

Define initial stride x

Use a direct index table with 2x entries for first x levels

Expand short prefixes to length x

Map the sub-trees

E.g.: initial stride x=2

Direct index table

Subtree 1

Subtree 2

Subtree 3

x=

2


Dealing with conflicts

Idea: use a request queue in front of each stage Intuition: without request queues,

» a request may wait till n cycles before entering the pipeline» a waiting request causes all subsequent requests to wait as well,

even if not competing for the same stages Issue: ordering

» Limited to requests with different entry stages (addressed to different destinations)

» An optional output reorder buffer can be used

Stage 1

S tage 2

S tage n

0 0 ..0 * x

0 0 ..1 * y

. .

. .

. .

to s tage 1

r eq u es t q u eu es

D ir ec t lo o k u p tab lef o r in it ia l p r e f ix b its

d es tin a tio nad d r es s n ex t

h o p

r eo r d er in g b u f f e r( o p tio n a l)


Pipeline Efficiency Metrics:

» Pipeline utilization: fraction of time the pipeline is busy provided that there is a continuous backlog of requests

» Lookups per Cycle (LPC): average request dispatching rate

Linear pipeline: » LPC=1 » Pipeline utilization generally low

– Not uniform stage utilization

CAMP pipeline:» High pipeline utilization

– Uniform stage utilization» LPC close to 1

– Complete pipeline traversal for each request– # pipeline stages = # trie levels

» LPC > 1 – Most requests don’t make complete circles around pipeline– # pipeline stages > # trie levels


Pipeline efficiency – all stages traversed

Setup: » 24 stages, all traversed by each packet» Packet bursts: sequences of packets to same entry point

Results:» Long bursts result in high utilization and LPC» For all burst size, enough queuing (32) guarantees 0.8 LPC

0.5

0.6

0.7

0.8

0.9

1

1 5 9 13 17 21 25 29

Request queue size

LP

C (

req

ues

ts p

er c

ycle

)Uniformly random

Burst length = 2

8

24

4064

96


Pipeline efficiency – LPC > 1

0

1

2

3

4

5

1 5 9 13 17 21 25 29

Request queue size

LP

C (

req

ues

ts p

er c

ycle

)

Burst length = 12

4

12 16 20 24 288 32

Setup: » 32 stages, rightmost 24 bits, tree-bit map of stride 3» Average prefix length 24

Results:» LPC between 3 and 5» Long bursts result in lower utilization and LPC


Nodes-to-stages mapping Objectives:

» Uniform distribution of nodes to stages– Minimize the size of the biggest stage

» Correct operation of the circular pipeline– Avoid multiple loops around pipeline

» Simplified update operation– Avoid skipping levels

1 2

34a

b

c

d

1 2

34a

b

c1 2

34a

b

c

d

2

444

11

3

4

2

P1 P2

P3 P4

2

444

11

3

4

2

P1 P2

P3 P4

d

1 2

34ab

c

P1

22

111

44

3

1

P2

P3 P4

d

1 2

34ab

c

d

1 2

34ab

c

1 2

34ab

c

P1

22

111

44

3

1

P2

P3 P4

P1

22

111

44

3

1

P2

P3

22

111

44

3

1

P2

P3 P4

44

4433

32

1

44

3333

22

1

4


Nodes-to-stages mapping (cont’d)

Problem Formulation (constrained graph coloring):» Given:

– A list of sub-trees– A list of colors represented by numbers

» Color nodes so that:– Every color is nearly equally used– A monotonic ordering relationship without gaps among colors is respected

when traversing sub-trees from root to leaves

Algorithm (min-max coloring heuristic)» Color sub-trees in decreasing order of size» At each steps:

– Try all possible colors on root (the rest of the sub-tree is colored consequentially)

– Pick the local optimum


Min-max coloring heuristic - example

44444

3333

22

1T2T1

T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 1

Color 2 2

Color 3 4

Color 4 5



44444

3333

22

1T2T1

T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 1 2 5 3 2

Color 2 2 3 3 6 4

Color 3 4 6 5 5 8

Color 4 5 9 7 6 6



44444

3333

22

1

2222

11

4

3T2T1T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 1 2 5 3 2

Color 2 2 3 3 6 4

Color 3 4 6 5 5 8

Color 4 5 9 7 6 6



44444

3333

22

1

2222

11

4

3T2T1T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 3 4 5 4 5

Color 2 6 8 7 8 7

Color 3 5 6 7 6 7

Color 4 6 8 7 8 7



44444

3333

22

1

11

4

33

2

2222

11

4

3T2T1T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 3 4 5 4 5

Color 2 6 8 7 8 7

Color 3 5 6 7 6 7

Color 4 6 8 7 8 7



1

44444

3333

22

1

11

4

33

2

2222

11

4

3T2T1T4T3

Present coloring

If 1 on new root

If 2 on new root

If 3 on new root

If 4 on new root

Color 1 5

Color 2 7

Color 3 7

Color 4 7


Evaluation settings

Trends in BGP tables:» Increasing number of prefixes» Most of prefixes are <26 bit (~24 bit) long» Route updates can concentrate in short period of time;

however, they rarely change the shape of the trie

50 BGP tables containing from 50K to 135K prefixes


Memory requirements

0

0.03

0.06

0.09

0.12

0.15

1 6 11 16 21 26 31

Pipeline stage #

Rel

ativ

e si

ze o

f th

e p

ipel

ine

Sum of normalized upper bounds = 1.31

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 6 11 16 21 26 31

Pipeline stage #

Rel

ativ

e si

ze o

f th

e p

ipel

ine


0.01

0.02

0.03

0.04

0.05

1 4 7 10 13 16 19 22 25

Pipeline stage #

Rel

ativ

e si

ze o

f th

e p

ipel

ine


Level based mapping

Height based mapping

CAMP

Balanced distribution across stages

Reduced total memory requirements»Memory overhead: 2.4% w/ initial stride 8, 0.02% w/ initial stride 12, 0.01% w/ initial stride 16


Updates

Techniques for handling updates» Single updates inserted as “bubbles” in the pipeline» Rebalancing computed offline and involving only a subset of

tries Scenario

» migration between different BGP tables» imbalance leads to 4% increase in occupancy of larger stage

0.043

0.044

0.045

0.046

0.047

0.048

0.049

0.05

1 3 5 7 9 11 13 15 17 19 21

# of migrations

Re

lati

ve

siz

e o

f th

e p

ipe

line

min

max

""

""

Series5Series6Series7Series8Series9Series10Series11Series12Series13Series14Series15Series16Series17Series18Series19Series20


Summary

Analysis of a circular pipeline architecture for trie based IP lookup

Goals:» Minimize memory requirement» Maximize pipeline utilization» Handle updates efficiently

Design:» Decoupling # of stages from maximum prefix length» LPC analysis» Nodes to stages mapping heuristic

Evaluation:» On real BGP tables» Good memory utilization and ability to keep 40Gbps line rate

through small memory banks


Thank you!


Addressing the worst case Observations:

» We addressed practical datasets» Worst case tries may have long and skinny sections difficult to split

Idea: adaptive CAMP» Split trie into “parent” and “child” subtries» Map the parent sub-trie into pipeline » Use more pipeline stages to mitigate effect of multiple loops around pipeline

k

ro o t

k

ro o t

ra n k o f n o d e i =s iz e o f s u b -triero o te d a t i

camp: fast and efficient ip lookup architecture sailesh kumar, michela becchi, patrick crowley,...

Documents

pipeline stages

pipeline depthupgradeability

complete pipeline traversal

transition stage

stage utilization uniformidea

different entry stages

stages traversedsetup

stages mappingobjectives