memory compression algorithms for networking features sailesh kumar
TRANSCRIPT
Memory Compression Algorithms for Networking
Features
Sailesh Kumar
2 - Sailesh Kumar - 04/18/23
Outline
Regular expressions based packet content inspection (main focus)
» D2FA» CD2FA
Packet header processing» HEXA (History based Encoding, eXecution and Addressing)
3 - Sailesh Kumar - 04/18/23
Why care about Regular Expressions?
Widely used» Network intrusion detection systems, NIDS» Layer 7 switches, load balancing» Firewalls, filtering, authentication and monitoring» Content-based traffic management and routing
Expensive»Space: Large amount of memory»Bandwidth: Requires 1+ state traversal per byte
Performance bottleneck»In enterprise switches, etc»Security appliances
– Use DFA, 1+ GB memory, still sub-gigabit throughput
»Need to accelerate RegEx!
4 - Sailesh Kumar - 04/18/23
Can we do Better?
Well studied in compiler literature» What’s different in Networking?» Can we do better?
Performance metric (grep)» Traditionally, (construction + execution) time is the metric» In networking context, execution time is critical» Also, there may be thousands of patterns
DFAs are fast» But can have exponentially large number of states» Algorithms exist to minimize number of states» Still 1) low performance and 2) gigabytes of memory
How to achieve high performance?» Use ASIC/FPGA
– On-chip memories provides ample bandwidth– Volume and need for speed justifies custom solution
» Limited memory, need space efficient representation!
5 - Sailesh Kumar - 04/18/23
Introduction to Our Approach
How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?
– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state
Three rulesa+b+cc*d+
2
1 3b
4
5
a
d
a
c
a b
d
a
c
b
cb
b
a
c
d
d
d
c
4 transitionsper state
Look at state pairs: there are many common transitions.How to remove them?
6 - Sailesh Kumar - 04/18/23
Introduction to Our Approach
How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?
– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state
Three rulesa+b+cc*d+
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
4 transitionsper state
AlternativeRepresentation
d
c
a
b
d
c
a
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
d
c
a
b
d
c
a
Fewer transitions,less memory
7 - Sailesh Kumar - 04/18/23
D2FA Operation
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
d
c
a
b
d
c
a
1 3
a
2
5
4
cc
b
d
Input stream: a b d DFA and D2FA visits thesame accepting state after
consuming a character
Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing
DFA D2FA
Three rulesa+b+cc*d+
8 - Sailesh Kumar - 04/18/23
D2FA Operation
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
d
c
a
b
d
c
a
1 3
a
2
5
4
cc
b
d
Any set of default transitions will suffice ifthere are no cycles of default transitions
Thus, we need to construct trees of default transitions
So, how to construct space efficient D2FAs?while keeping default paths bounded
2
1
3
4
d
c
b
2
1
3
4
c
b
d
a
5 5
a
c
c
Above two set of default transitions trees are also correctHowever, we may traverse 2 default transitions to consume a character
Thus, we need to do more work => lower performance
9 - Sailesh Kumar - 04/18/23
D2FA Construction
Present systematic approach to construct D2FA Begin with a state minimized DFA Construct space reduction graph
» Undirected graph, vertices are states of DFA» Edges exist between vertices with common transitions» Weight of an edge = # of common transitions - 1
2
1 3 b
4
5
a
d
a
c
a b
d
a
c
b
cb
b
a
c
d
d
d
c
2
1
3
4
5
3
3
3
2
32
2
2
3
3
10 - Sailesh Kumar - 04/18/23
D2FA Construction
Convert certain edges into default transitions» A default transition reduces w transitions (w = wt. of edge)» If we pick high weight edges => more space reduction» Find maximum weight spanning forest» Tree edges becomes the default transitions
Problem: spanning tree may have very large diameter» Longer default paths => lower performance
2
1 3b
4
5
a
d
a
c
a b
d
a
c
b
cb
b
a
c
d
d
d
c
2
1
3
4
5
3
3
3
2
32
2
2
3
3
# of transitionsremoved = 2+3+3+3=11 root
11 - Sailesh Kumar - 04/18/23
D2FA Construction
We need to construct bounded diameter trees» NP-hard» Small diameter bound leads to low trees weight
– Less space efficient D2FA» Time-space trade-off
We propose heuristic algorithm based upon Kruskal’s algorithm to create compact bounded diameter D2FAs» Details in SIGCOMM 2006 paper
2
1 3b
4
5
a
d
a
c
a b
d
a
c
b
cb
b
a
c
d
d
d
c
2
1
3
4
5
3
3
3
2
32
2
2
3
3
12 - Sailesh Kumar - 04/18/23
Results
We ran experiments on» Cisco RegEx rules» Linux application protocol classifier rules» Bro rules» Snort rules (subset of rules)
Size of DFA versus D2FA (No default path length bound applied)
Original DFA D2FA Normal spanning tree Refined spanning tree DFA # of
states Total # of transitions # of transitions %
reduction Max. default
length # of transitions %
reduction Max. default
length
Cisco590 17713 4.5M 36k 99.2 57 36k 99.2 17
Cisco103 21050 5.3M 53k 99.0 54 53k 99.0 19
Cisco7 4260 1.0M 28k 97.4 61 28k 97.4 23
Linux56 13953 3.5M 58k 98.3 30 58k 98.3 21
Linux10 13003 3.3M 285k 91.3 20 285k 91.3 17
Snort11 41949 10.7M 168k 98.4 9 168k 98.4 6
Bro648 6216 1.5M 7k 99.5 17 7k 99.5 8
13 - Sailesh Kumar - 04/18/23
Space-Time Tradeoff
0.001
0.01
0.1
1 2 3 4 5 6 7
Bound on default path length
No
rma
lize
d D
2F
A s
ize
Cisco103
Linux56
Snort11
Bro648
Longer default path => more work but less space
Space efficient region
Default paths have length 4+Requires 4+ memory accesses per character
We propose memory architectureWhich enables us to consume
one character per clock cycle
14 - Sailesh Kumar - 04/18/23
Outline
Regular expressions based packet content inspection (main focus)
» D2FA» CD2FA
Packet header processing» HEXA (History based Encoding, eXecution and Addressing)
15 - Sailesh Kumar - 04/18/23
D2FA versus DFA
D2FAs are compact but requires multiple memory accesses» Up to 20x increased memory accesses» Not desirable in off-chip architecture
Can D2FAs match the performance of DFAs» YES!!!!» Content Addressed D2FAs (CD2FA)
CD2FAs require only one memory access per byte» Matches the performance of a DFA in cacheless system» Systems with data cache, CD2FA are 2-3x faster
CD2FAs are 10x compact than DFAs
16 - Sailesh Kumar - 04/18/23
Introduction to CD2FA, ANCS’06
How to avoid multiple memory accesses of D2FAs?» Avoid lookup to decide if default path needs to be taken» Avoid default path traversal
Solution: Assign labels to each state, labels contain:» Characters for which it has labeled transitions» Information about all of its default states» Characters for which its default states have labeled transitions
find node Rat location R
R
c
d
a
b
all
ab,cd,R
cd,R
R
V
U find node U athash(c,d,R)
find node V athash(a,b,hash(c,d,R))
ContentLabels
17 - Sailesh Kumar - 04/18/23
Introduction to CD2FAR
c
d
all
ab,cd,R
cd,R
R
V
U
Input char =
hash(a,b, c,d,R)
Z
l
m
P
q
all
X
Ypq,lm,Z
lm,Z
hash(c,d,R)
Current state: V (label = ab,cd,R)
hash(p,q, l,m,Z)
a
b
d a
(R, a)
(R, b)
…
(Z, a)
(Z, b)
…
lm,Z
pq,lm,Z
(X, p)
(X, q)
(V, a)
(V, b)→ X (label = pq,lm,Z)
18 - Sailesh Kumar - 04/18/23
Construction of CD2FA We seek to keep the content labels small
Twin Objectives:» Ensure that states have few labeled transitions» Ensure that default paths are as small as possible
D2FA construction heuristic based upon maximum weight spanning tree creates long default paths» Limit default paths => less space efficient D2FAs
Proposed new heuristic called CRO to construct D2FAs» Runs in 3 phases: Construction, Reduction and Optimization» Default path bound = 2 edges => CRO algorithm constructs
upto 10x space efficient D2FAs» CD2FAs are constructed from these D2FAs
19 - Sailesh Kumar - 04/18/23
Memory Mapping in CD2FAR
c
d
all
ab,cd,R
cd,R
R
V
U
Z
l
m
P
q
all
X
Y
pq,lm,R
lm,R
a
b
(R, a)
(R, b)
…
(Z, a)
(Z, b)
…
WE HAVE ASSUMEDTHAT HASHING ISCOLLISION FREE
hash(a,b,hash(c,d,R))hash(c,d,R))hash(p,q,hash(l,m,Z))
COLLISION
20 - Sailesh Kumar - 04/18/23
Collision-free Memory Mapping
aab
c
pq
r
lm
n
de
f
b c , ….
p q r , ….
n , ….
d e f , ….
hash(abc, …)
hash(def, …)
hash(pqr, …)
hash(lmn, …)
hash(edf, …)
l m hash(mln, …)WE NEED
SYSTEMATICAPPRAOCH
Four states
4 memorylocations
21 - Sailesh Kumar - 04/18/23
Bipartite Graph Matching Bipartite Graph
» Left nodes are state content labels» Right nodes are memory locations» Map state labels to unique memory locations» An edge for every choice of content label» Perfect matching problem
With n left and right nodes» Need O(logn) random edges» n = 1M implies, we need ~20 edges per node
If we provide slight memory over-provisioning» We can uniquely map state labels with much fewer edges
In our experiments, we found perfect matching without memory over-provisioning
4
5
2
6
1
3
2
4
C o nte nt M e m o ry labe l addre s s
22 - Sailesh Kumar - 04/18/23
Memory Reduction Results
CD2FA Original DFA Creation phase Reduction phase Optimization phase and
alphabet reduction
size of CD2FA ÷ size of DFATC
Memory (MB) Dataset # of states No table
compression With table
compression
Memory (MB) Memory (MB) Memory (MB)
Alphabet size
Cisco590 17713 9.07 6.23 8.87 0.80 0.39 98 0.062 Cisco103 21050 10.77 9.56 9.72 1.87 0.86 106 0.089
Cisco7 4260 2.18 1.14 1.76 0.44 0.23 126 0.201 Linux56 13953 7.14 3.62 3.73 1.17 0.61 123 0.168 Linux10 13003 6.65 3.35 7.27 3.01 1.48 118 0.441 Snort11 37167 19.03 3.55 6.31 1.28 0.36 37 0.101 Bro648 6216 3.18 1.26 0.77 0.08 0.05 83 0.039
23 - Sailesh Kumar - 04/18/23
Throughput Results
16-bit wide, 250MHz DDR RLDRAM (access size 8B)
0
1
2
3
4
5
DFA-TC DFA CD2FA
Th
rou
gh
pu
t (G
bp
s)
no cache
1 KB, 1-way Dcache
4 KB, 1-way Dcache
3x Faster 4KB cache
24 - Sailesh Kumar - 04/18/23
Outline
Regular expressions based packet content inspection (main focus)
» D2FA» CD2FA
Packet header processing» HEXA (History based Encoding, eXecution and Addressing)
25 - Sailesh Kumar - 04/18/23
HEXA, ICNP’07
HEXA (History-based Encoding, eXecution and Addressing)» Challenges the assumption that graph structures must store
log2n bits pointers to identify successor nodes» Requires only 2-bit versus 20-bit pointers (for 1 million nodes)
Useful for» IP lookup tries (directed acyclic graph)» Simple finite automaton such as Aho-Corasick String Matchers
26 - Sailesh Kumar - 04/18/23
Tries - Traditional Implementation
Addr data
1 0, 2, 3
2 0, 4, 5
3 1, NULL, 6
4 1, NULL, NULL
5 0, 7, 8
6 1, NULL, NULL
7 0, 9, NULL
8 1, NULL, NULL
9 1, NULL, NULL
0 1
0 1
0
0
1* P 1
00* P 2
11* P 3
011* P 4
0100* P 5
1
2 3
54
7
9
P 2
(a)
(b)
P 5
1
6
P 31
8
P 4
P 1
There are nine nodes; we will need 4-bit node identifiersTotal memory = 9 x 9 bits
A node will require 9-bitsTwo 4-bit child pointersOne flag indicates if node is a prefix
27 - Sailesh Kumar - 04/18/23
HEXA based Implementation
0 1
0 1
0
0
1* P 1
00* P 2
11* P 3
011* P 4
0100* P 5
1
2 3
54
7
9
P 2
(a)
(b)
P 5
1
6
P 31
8
P 4
P 1
Define HEXA identifier of a node as the path which leads to it from the root
1. -2. 03. 1
4. 005. 016. 11
7. 0108. 0119. 0100
Notice that these identifiers are uniqueThus, they can potentially be mapped tounique memory addresses
28 - Sailesh Kumar - 04/18/23
HEXA based Implementation
0 1
0 1
0
0
1* P 1
00* P 2
11* P 3
011* P 4
0100* P 5
1
2 3
54
7
9
P 2
(a)
(b)
P 5
1
6
P 31
8
P 4
P 1
Use hashing to map the HEXA identifier to memory address
1. -2. 03. 1
4. 005. 016. 11
7. 0108. 0119. 0100
If we have a minimal perfect hash function f -A function that maps elements to unique location
Then we can store the trie as shown below
f(010) = 5f(011) = 3f(0100) = 6
f(-) = 4f(0) = 7f(1) = 9
f(00) = 2f(01) = 8f(11) = 1
Addr node mem Prefix1 1,0,0 P3
2 1,0,0 P2
3 1,0,0 P4
4 0,1,15 0,1,06 1,0,0 P5
7 0,1,18 0,1,19 1,0,1 P1
Here we use only3-bits per nodein fast path
IP addr. - 1 1 …. The prefix, we were looking
29 - Sailesh Kumar - 04/18/23
Devising One-to-one Mapping
Finding a minimal perfect hash function is difficult» One-to-one mapping is essential for HEXA to work
Use discriminator bits» Attach c-bits to every HEXA identifier, that we can modify» Thus a node can have 2c choices of identifiers» We now need to store these c-bits for every child
instead of a single flag
With multiple choices of HEXA identifiers for a node, reduce the problem to a bipartite graph matching» We need to find a perfect matching in the graph to map nodes
to unique memory locations
30 - Sailesh Kumar - 04/18/23
Devising One-to-one Mapping
-
0
1
00
01
11
010
011
0100
00 0, 01 0,10 0, 11 0
00 -, 01 -,10 -, 11 -
0
1
2
3
4
5
6
7
8
h(00) = 0, h(01) = 4h(10) = 1, h(11) = 5
h(000) = 1, h(010) = 5h(100) = 2, h(110) = 6
00 1, 01 1,10 1, 11 1
00 00, 01 00,10 00, 11 00
00 01, 01 01,10 01, 11 01
00 11, 01 11,10 11, 11 11
00 010, 01 010,10 010, 11 010
00 011, 01 011,10 011, 11 011
00 0100, 01 0100,10 0100, 11 0100
h() = 0, h() = 4h() = 1, h() = 5
h() = 2, h() = 6h() = 3, h() = 7
h() = 1, h() = 5h() = 2, h() = 6
h() = 8, h() = 3h() = 0, h() = 4
h() = 1, h() = 5h() = 6, h() = 2
h() = 0, h() = 4h() = 5, h() = 1
h() = 0, h() = 3h() = 4, h() = 6
Input labels ORHEXA identifier
Four choices ofHEXA identifiers
Choices ofmemory locations
Bipartite graph
1
2
3
4
5
6
7
8
9
Nodes
PERFECTMATCHING
Pick AppropriateDiscriminators
31 - Sailesh Kumar - 04/18/23
HEXA based Implementation
0 1
0 1
0
0
1* P 1
00* P 2
11* P 3
011* P 4
0100* P 5
1
2 3
54
7
9
P 2
(a)
(b)
P 5
1
6
P 31
8
P 4
P 1
1. -2. 03. 1
4. 005. 016. 11
7. 0108. 0119. 0100
Store its discriminator instead of a single flag for every left and right child
Addr node mem Prefix1 1,xx,xx P3
2 1,xx,xx P2
3 1,xx,xx P4
4 0,xx,xx5 0,xx,xx6 1,xx,xx P5
7 0,xx,xx8 0,xx,xx9 1,xx,xx P1
Here we use only5-bits per nodein fast path
32 - Sailesh Kumar - 04/18/23
Results
3 choices are enough to find a perfect matching» Thus 2-bits discriminators (00 value reserved for no child)
– Significant reduction 2-bits per node versus log2n bits
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6
Stride
Fa
st
pa
th t
rie
me
mo
ry (
MB
)
without HEXA
with HEXA
32 Eatherton tries,each contains 100-120k prefixes.
33 - Sailesh Kumar - 04/18/23
Incremental Updates
IP table updates are very frequent» When a node is removed and another added, we must ensure
a few memory operations.
In the new bipartite graph, a new perfect matching can be found» Quickly (O(n) time in the worst case, typically constant time)
» New matching is slightly different from the previous matching– Typically around 10 different edges, experimental worst-case - 18– Thus less than 18 memory operations are needed for an update
34 - Sailesh Kumar - 04/18/23
Questions???