memory compression algorithms for networking features sailesh kumar

Memory Compression Algorithms for Networking

Features

Sailesh Kumar

2 - Sailesh Kumar - 04/18/23

Outline

Regular expressions based packet content inspection (main focus)

» D2FA» CD2FA

Packet header processing» HEXA (History based Encoding, eXecution and Addressing)


Why care about Regular Expressions?

Widely used» Network intrusion detection systems, NIDS» Layer 7 switches, load balancing» Firewalls, filtering, authentication and monitoring» Content-based traffic management and routing

Expensive»Space: Large amount of memory»Bandwidth: Requires 1+ state traversal per byte

Performance bottleneck»In enterprise switches, etc»Security appliances

– Use DFA, 1+ GB memory, still sub-gigabit throughput

»Need to accelerate RegEx!


Can we do Better?

Well studied in compiler literature» What’s different in Networking?» Can we do better?

Performance metric (grep)» Traditionally, (construction + execution) time is the metric» In networking context, execution time is critical» Also, there may be thousands of patterns

DFAs are fast» But can have exponentially large number of states» Algorithms exist to minimize number of states» Still 1) low performance and 2) gigabytes of memory

How to achieve high performance?» Use ASIC/FPGA

– On-chip memories provides ample bandwidth– Volume and need for speed justifies custom solution

» Limited memory, need space efficient representation!


Introduction to Our Approach

How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?

– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state

Three rulesa+b+cc*d+

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

4 transitionsper state

Look at state pairs: there are many common transitions.How to remove them?


Introduction to Our Approach

How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?

– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state


1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

4 transitionsper state

AlternativeRepresentation

d

c

a

b

d

c

a

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

Fewer transitions,less memory


D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Input stream: a b d DFA and D2FA visits thesame accepting state after

consuming a character

Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing

DFA D2FA



D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Any set of default transitions will suffice ifthere are no cycles of default transitions

Thus, we need to construct trees of default transitions

So, how to construct space efficient D2FAs?while keeping default paths bounded

2

1

3

4

d

c

b

2

1

3

4

c

b

d

a

5 5

a

c

c

Above two set of default transitions trees are also correctHowever, we may traverse 2 default transitions to consume a character

Thus, we need to do more work => lower performance


D2FA Construction

Present systematic approach to construct D2FA Begin with a state minimized DFA Construct space reduction graph

» Undirected graph, vertices are states of DFA» Edges exist between vertices with common transitions» Weight of an edge = # of common transitions - 1

2

1 3 b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3


D2FA Construction

Convert certain edges into default transitions» A default transition reduces w transitions (w = wt. of edge)» If we pick high weight edges => more space reduction» Find maximum weight spanning forest» Tree edges becomes the default transitions

Problem: spanning tree may have very large diameter» Longer default paths => lower performance

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3

# of transitionsremoved = 2+3+3+3=11 root


D2FA Construction

We need to construct bounded diameter trees» NP-hard» Small diameter bound leads to low trees weight

– Less space efficient D2FA» Time-space trade-off

We propose heuristic algorithm based upon Kruskal’s algorithm to create compact bounded diameter D2FAs» Details in SIGCOMM 2006 paper

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3


Results

We ran experiments on» Cisco RegEx rules» Linux application protocol classifier rules» Bro rules» Snort rules (subset of rules)

Size of DFA versus D2FA (No default path length bound applied)

Original DFA D2FA Normal spanning tree Refined spanning tree DFA # of

states Total # of transitions # of transitions %

reduction Max. default

length # of transitions %

reduction Max. default

length

Cisco590 17713 4.5M 36k 99.2 57 36k 99.2 17

Cisco103 21050 5.3M 53k 99.0 54 53k 99.0 19

Cisco7 4260 1.0M 28k 97.4 61 28k 97.4 23

Linux56 13953 3.5M 58k 98.3 30 58k 98.3 21

Linux10 13003 3.3M 285k 91.3 20 285k 91.3 17

Snort11 41949 10.7M 168k 98.4 9 168k 98.4 6

Bro648 6216 1.5M 7k 99.5 17 7k 99.5 8


Space-Time Tradeoff

0.001

0.01

0.1

1 2 3 4 5 6 7

Bound on default path length

No

rma

lize

d D

2F

A s

ize

Cisco103

Linux56

Snort11

Bro648

Longer default path => more work but less space

Space efficient region

Default paths have length 4+Requires 4+ memory accesses per character

We propose memory architectureWhich enables us to consume

one character per clock cycle


Outline


» D2FA» CD2FA



D2FA versus DFA

D2FAs are compact but requires multiple memory accesses» Up to 20x increased memory accesses» Not desirable in off-chip architecture

Can D2FAs match the performance of DFAs» YES!!!!» Content Addressed D2FAs (CD2FA)

CD2FAs require only one memory access per byte» Matches the performance of a DFA in cacheless system» Systems with data cache, CD2FA are 2-3x faster

CD2FAs are 10x compact than DFAs


Introduction to CD2FA, ANCS’06

How to avoid multiple memory accesses of D2FAs?» Avoid lookup to decide if default path needs to be taken» Avoid default path traversal

Solution: Assign labels to each state, labels contain:» Characters for which it has labeled transitions» Information about all of its default states» Characters for which its default states have labeled transitions

find node Rat location R

R

c

d

a

b

all

ab,cd,R

cd,R

R

V

U find node U athash(c,d,R)

find node V athash(a,b,hash(c,d,R))

ContentLabels


Introduction to CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Input char =

hash(a,b, c,d,R)

Z

l

m

P

q

all

X

Ypq,lm,Z

lm,Z

hash(c,d,R)

Current state: V (label = ab,cd,R)

hash(p,q, l,m,Z)

a

b

d a

(R, a)

(R, b)

…

(Z, a)

(Z, b)

…

lm,Z

pq,lm,Z

(X, p)

(X, q)

(V, a)

(V, b)→ X (label = pq,lm,Z)


Construction of CD2FA We seek to keep the content labels small

Twin Objectives:» Ensure that states have few labeled transitions» Ensure that default paths are as small as possible

D2FA construction heuristic based upon maximum weight spanning tree creates long default paths» Limit default paths => less space efficient D2FAs

Proposed new heuristic called CRO to construct D2FAs» Runs in 3 phases: Construction, Reduction and Optimization» Default path bound = 2 edges => CRO algorithm constructs

upto 10x space efficient D2FAs» CD2FAs are constructed from these D2FAs


Memory Mapping in CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Z

l

m

P

q

all

X

Y

pq,lm,R

lm,R

a

b

(R, a)

(R, b)

…

(Z, a)

(Z, b)

…

WE HAVE ASSUMEDTHAT HASHING ISCOLLISION FREE

hash(a,b,hash(c,d,R))hash(c,d,R))hash(p,q,hash(l,m,Z))

COLLISION


Collision-free Memory Mapping

aab

c

pq

r

lm

n

de

f

b c , ….

p q r , ….

n , ….

d e f , ….

hash(abc, …)

hash(def, …)

hash(pqr, …)

hash(lmn, …)

hash(edf, …)

l m hash(mln, …)WE NEED

SYSTEMATICAPPRAOCH

Four states

4 memorylocations


Bipartite Graph Matching Bipartite Graph

» Left nodes are state content labels» Right nodes are memory locations» Map state labels to unique memory locations» An edge for every choice of content label» Perfect matching problem

With n left and right nodes» Need O(logn) random edges» n = 1M implies, we need ~20 edges per node

If we provide slight memory over-provisioning» We can uniquely map state labels with much fewer edges

In our experiments, we found perfect matching without memory over-provisioning

4

5

2

6

1

3

2

4

C o nte nt M e m o ry labe l addre s s


Memory Reduction Results

CD2FA Original DFA Creation phase Reduction phase Optimization phase and

alphabet reduction

size of CD2FA ÷ size of DFATC

Memory (MB) Dataset # of states No table

compression With table

compression

Memory (MB) Memory (MB) Memory (MB)

Alphabet size

Cisco590 17713 9.07 6.23 8.87 0.80 0.39 98 0.062 Cisco103 21050 10.77 9.56 9.72 1.87 0.86 106 0.089

Cisco7 4260 2.18 1.14 1.76 0.44 0.23 126 0.201 Linux56 13953 7.14 3.62 3.73 1.17 0.61 123 0.168 Linux10 13003 6.65 3.35 7.27 3.01 1.48 118 0.441 Snort11 37167 19.03 3.55 6.31 1.28 0.36 37 0.101 Bro648 6216 3.18 1.26 0.77 0.08 0.05 83 0.039


Throughput Results

16-bit wide, 250MHz DDR RLDRAM (access size 8B)

0

1

2

3

4

5

DFA-TC DFA CD2FA

Th

rou

gh

pu

t (G

bp

s)

no cache

1 KB, 1-way Dcache

4 KB, 1-way Dcache

3x Faster 4KB cache


Outline


» D2FA» CD2FA



HEXA, ICNP’07

HEXA (History-based Encoding, eXecution and Addressing)» Challenges the assumption that graph structures must store

log2n bits pointers to identify successor nodes» Requires only 2-bit versus 20-bit pointers (for 1 million nodes)

Useful for» IP lookup tries (directed acyclic graph)» Simple finite automaton such as Aho-Corasick String Matchers


Tries - Traditional Implementation

Addr data

1 0, 2, 3

2 0, 4, 5

3 1, NULL, 6

4 1, NULL, NULL

5 0, 7, 8

6 1, NULL, NULL

7 0, 9, NULL

8 1, NULL, NULL

9 1, NULL, NULL

0 1

0 1

0

0

1* P 1

00* P 2

11* P 3

011* P 4

0100* P 5

1

2 3

54

7

9

P 2

(a)

(b)

P 5

1

6

P 31

8

P 4

P 1

There are nine nodes; we will need 4-bit node identifiersTotal memory = 9 x 9 bits

A node will require 9-bitsTwo 4-bit child pointersOne flag indicates if node is a prefix


HEXA based Implementation

0 1

0 1

0

0

1* P 1

00* P 2

11* P 3

011* P 4

0100* P 5

1

2 3

54

7

9

P 2

(a)

(b)

P 5

1

6

P 31

8

P 4

P 1

Define HEXA identifier of a node as the path which leads to it from the root

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

Notice that these identifiers are uniqueThus, they can potentially be mapped tounique memory addresses



0 1

0 1

0

0

1* P 1

00* P 2

11* P 3

011* P 4

0100* P 5

1

2 3

54

7

9

P 2

(a)

(b)

P 5

1

6

P 31

8

P 4

P 1

Use hashing to map the HEXA identifier to memory address

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

If we have a minimal perfect hash function f -A function that maps elements to unique location

Then we can store the trie as shown below

f(010) = 5f(011) = 3f(0100) = 6

f(-) = 4f(0) = 7f(1) = 9

f(00) = 2f(01) = 8f(11) = 1

Addr node mem Prefix1 1,0,0 P3

2 1,0,0 P2

3 1,0,0 P4

4 0,1,15 0,1,06 1,0,0 P5

7 0,1,18 0,1,19 1,0,1 P1

Here we use only3-bits per nodein fast path

IP addr. - 1 1 …. The prefix, we were looking


Devising One-to-one Mapping

Finding a minimal perfect hash function is difficult» One-to-one mapping is essential for HEXA to work

Use discriminator bits» Attach c-bits to every HEXA identifier, that we can modify» Thus a node can have 2c choices of identifiers» We now need to store these c-bits for every child

instead of a single flag

With multiple choices of HEXA identifiers for a node, reduce the problem to a bipartite graph matching» We need to find a perfect matching in the graph to map nodes

to unique memory locations


Devising One-to-one Mapping

-

0

1

00

01

11

010

011

0100

00 0, 01 0,10 0, 11 0

00 -, 01 -,10 -, 11 -

0

1

2

3

4

5

6

7

8

h(00) = 0, h(01) = 4h(10) = 1, h(11) = 5

h(000) = 1, h(010) = 5h(100) = 2, h(110) = 6

00 1, 01 1,10 1, 11 1

00 00, 01 00,10 00, 11 00

00 01, 01 01,10 01, 11 01

00 11, 01 11,10 11, 11 11

00 010, 01 010,10 010, 11 010

00 011, 01 011,10 011, 11 011

00 0100, 01 0100,10 0100, 11 0100

h() = 0, h() = 4h() = 1, h() = 5

h() = 2, h() = 6h() = 3, h() = 7

h() = 1, h() = 5h() = 2, h() = 6

h() = 8, h() = 3h() = 0, h() = 4

h() = 1, h() = 5h() = 6, h() = 2

h() = 0, h() = 4h() = 5, h() = 1

h() = 0, h() = 3h() = 4, h() = 6

Input labels ORHEXA identifier

Four choices ofHEXA identifiers

Choices ofmemory locations

Bipartite graph

1

2

3

4

5

6

7

8

9

Nodes

PERFECTMATCHING

Pick AppropriateDiscriminators



0 1

0 1

0

0

1* P 1

00* P 2

11* P 3

011* P 4

0100* P 5

1

2 3

54

7

9

P 2

(a)

(b)

P 5

1

6

P 31

8

P 4

P 1

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

Store its discriminator instead of a single flag for every left and right child

Addr node mem Prefix1 1,xx,xx P3

2 1,xx,xx P2

3 1,xx,xx P4

4 0,xx,xx5 0,xx,xx6 1,xx,xx P5

7 0,xx,xx8 0,xx,xx9 1,xx,xx P1

Here we use only5-bits per nodein fast path


Results

3 choices are enough to find a perfect matching» Thus 2-bits discriminators (00 value reserved for no child)

– Significant reduction 2-bits per node versus log2n bits

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6

Stride

Fa

st

pa

th t

rie

me

mo

ry (

MB

)

without HEXA

with HEXA

32 Eatherton tries,each contains 100-120k prefixes.


Incremental Updates

IP table updates are very frequent» When a node is removed and another added, we must ensure

a few memory operations.

In the new bipartite graph, a new perfect matching can be found» Quickly (O(n) time in the worst case, typically constant time)

» New matching is slightly different from the previous matching– Typically around 10 different edges, experimental worst-case - 18– Thus less than 18 memory operations are needed for an update


Questions???

memory compression algorithms for networking features sailesh kumar

Documents

c b b c d d d c d c

c b c b b

c c b d input stream

b d dfa

d slide

space efficient d

missing dfa d

number of transitions