fast and scalable pattern matching for content filtering sarang dharmapurikar john lockwood
TRANSCRIPT
![Page 1: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/1.jpg)
Fast and Scalable Pattern Matching for Content Filtering
Sarang DharmapurikarJohn Lockwood
![Page 2: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/2.jpg)
Sarang Dharmapurikar
Motivation
● Deep packet inspection Detection of Internet worms, computer viruses,
SPAM, copyrighted material, Intrusion Detection/Prevention Layer-7 switching Content classification
● Needs fast string matching mechanism
● Some desirable features of the mechanism String matching at line speed Ability to detect strings at random locations in the payload Ability to detect 1000s of strings Ability to handle arbitrarily long strings
![Page 3: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/3.jpg)
Sarang Dharmapurikar
Aho-Corasick Algorithm
● Two Problems At least 1 memory access per
character (at the most 2)o Slows it down
Only one character at a timeo bottleneck
s3 : tel
s5 : phones6 : elephant
s4 : telephone
s1 : technicals2 : technically
l
e
p
h
a
n
q24
q25
q26
q27
q28
q29
q30
tq31
e
l
e
p
h
o
n
e
q12
q13
q14
q15
q16
q17
q18
q0
q1
t
e
c
h
n
i
q2
q3
q4
q5
q6
c
a
l
q7
q8
q9
q11y
q10
l
p
h
o
n
e
q19
q20
q21
q22
q23
![Page 4: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/4.jpg)
Sarang Dharmapurikar
Why not use multiple engines?
Engine1
Engine2
Engine3
Engine4
Incoming connections
Each engine needs plenty of memory….
On-chip memory not practical
We need a memory chip
Multiple memory chipsMore pins, more power, more cost
![Page 5: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/5.jpg)
Sarang Dharmapurikar
Can we…
● Process Multiple characters at a time● Without using multiple memory chips
?● What if we have a small amount of on-chip
memory?
![Page 6: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/6.jpg)
Sarang Dharmapurikar
Our Approach
● Modify Aho-Corasick to jump ahead by k characters Jump Ahead Aho-CorasicK (JACK)-FA
● Represent JACK-FA as a hash table. Keep only one copy in the off-chip memory
● Keep k copies of the compressed & approximate JACK-FA hash table in on-chip memory Use Bloom filters for approximate
representation Consumes very little memory
Off-chipJACK-FA
Data stream
On-chip approximate JACK-FAs
![Page 7: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/7.jpg)
Sarang Dharmapurikar
JACK-FA
s3 : tel
s5 : phon e
s6 : elep hant
s4 : tele phon e
s1 : tech nica l
s2 : tech nica lly
s3 : tel
s5 : phone
s6 : elephant
s4 : telephone
s1 : technical
s2 : technically
q0
q1
q5
tech
nica
s3,q2
q6
tele
phon
q3
phon
hant
q4
S6 q7
elep
s3
tel
S4,s5
e
s5
e
s1
l lly
S1,s2
![Page 8: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/8.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
l lly e
e
S1,s2
w
![Page 9: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/9.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
l lly e
e
S1,s2
w
![Page 10: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/10.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
l lly e
e
S1,s2
w
![Page 11: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/11.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
llly e
e
S1,s2
w
![Page 12: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/12.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
llly e
e
S1,s2
w
![Page 13: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/13.jpg)
Sarang Dharmapurikar
String matching with JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
llly e
e
S1,s2
w
![Page 14: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/14.jpg)
Sarang Dharmapurikar
Why we need k JACK-FA
t e c h nx y z i c a l l y a b c
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
llly e
e
S1,s2
![Page 15: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/15.jpg)
Sarang Dharmapurikar
Speed up
t e c h nx y z i c a l l y a b
![Page 16: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/16.jpg)
Sarang Dharmapurikar
Speed up
t e c h nx y z i c a l l y a b
A single machine inoff-chip memory
k approximte and compressed machinesin on-chip memory
Use Bloom filters
![Page 17: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/17.jpg)
Sarang Dharmapurikar
Tabular Representation
hant
q0
q3 q4q1 q2
q5 q6 S6 q7
tech
nica
tele
phon
phon
elep
s3
s1 S4,s5
s5
tel
l lly e
e
S1,s2[state, substr] Next State Matching str Failure Chain
[q0, tech] q1 - q0
[q0, tele] q2 S3 q0
[q0, phon] q3 - q0
[q0, elep] q4 - q0
[q1, nica] q5 - q0
[q2, phon] q6 - q3,q0
[q4, hant] q7 S6 q0
[q0, tel] - S3 -
[q3, e] - S5 -
[q5, lly] - S1, S2 -
[q5, l] - S1 -[q6, e] - S4 , S5 -
![Page 18: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/18.jpg)
Sarang Dharmapurikar
Implementation with Bloom Filters
[state, substr] Next State Matching str Failure Chain[q0, tech] q1 - q0
[q0, tele] q2S3 q0
[q0, phon] q3 - q0
[q0, elep] q4 - q0
[q1, nica] q5 - q0
[q2, phon] q3 - q3,q0
[q4, hant] q7S6 q0
[q0, tel] - S3 -
[q3, e] - S5 -[q5, lly] - S1, S2 -
[q5, l] - S1 -[q6, e] - S4 , S5 -
B4B3B1 B2
q
![Page 19: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/19.jpg)
Sarang Dharmapurikar
Implementation with Bloom Filters
[state, substr] Next State Matching str Failure Chain[q0, tech] q1 - q0[q0, tele] q2
S3 q0[q0, phon] q3 - q0[q0, elep] q4 - q0[q1, nica] q5 - q0[q2, phon] q3 - q3,q0[q4, hant] q7
S6 q0[q0, tel] - S3 -
[q3, e] - S5 -[q5, lly] - S1, S2 -
[q5, l] - S1 -[q6, e] - S4 , S5 -
B4B3B1 B2
q1
B4B3B1 B2
q2
B4B3B1 B2
q3
B4B3B1 B2
q4
![Page 20: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/20.jpg)
Sarang Dharmapurikar
Throughput with Snort strings
● Off-chip memory: 250 MHz QDR-SRAM, 64-bit wide● String concentration: 1 in 100 characters● 2250 strings● 2 to 122 character strings
![Page 21: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/21.jpg)
Sarang Dharmapurikar
Conclusions
● Fast string matching is an important module for Content filtering applications
● Off-chip memory accesses slow down string matching
● A large fraction of memory accesses can be avoided Using a small on-chip memory and Bloom filters
● Our accelerated Aho-Corasick algorithm can process 2250 strings with less than 50KB on-chip memory At a speed of more than 10Gbps
![Page 22: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/22.jpg)
Thanks!
Questions ?
![Page 23: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/23.jpg)
Sarang Dharmapurikar
Motivation
● The multi-pattern matching algorithm works for short strings (16 bytes) Hash computation over long strings becomes problematic Some virus signatures can be several hundred bytes long Snort’s longest string is 122 bytes
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120 140
# s
trin
gs
string length in bytes
![Page 24: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/24.jpg)
Sarang Dharmapurikar
![Page 25: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/25.jpg)
Sarang Dharmapurikar
Accelerated Aho-Corasick Algorithm
● How to support arbitrarily large strings? At the cost of more memory? Break a long string into multiple smaller pieces Stitch them in a state machine Match individual segment and track the state machine
q0 q1 q2 q3
tech nica lly
SymbolsTail
![Page 26: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/26.jpg)
Sarang Dharmapurikar
Speed up
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 27: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/27.jpg)
Sarang Dharmapurikar
Multiple machines
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 28: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/28.jpg)
Sarang Dharmapurikar
Multiple machines
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 29: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/29.jpg)
Sarang Dharmapurikar
Multiple machines
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 30: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/30.jpg)
Sarang Dharmapurikar
Multiple machines
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 31: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/31.jpg)
Sarang Dharmapurikar
Multiple machines
t e c h nx y z i c a l l y a b
s1 s2 s3 s4
![Page 32: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/32.jpg)
Sarang Dharmapurikar
Aho-Corasick Algorithm
● Two Problems At least 1 memory access per
character (at the most 2)o Slows it down
Only one character at a timeo bottleneck
s3 : tel
s5 : phones6 : elephant
s4 : telephone
s1 : technicals2 : technically
q0
l
e
p
h
a
n
q24
q25
q26
q27
q28
q29
q30
tq31
q1
pe
t
e
lc
h
n
i
e
p
h
o
n
e
q2
q3
q4
q5
q6
q12
q13
q14
q15
q16
q17
q18
c
a
l
q7
q8
q9
q11y
q10
l
h
o
n
e
q19
q20
q21
q22
q23
![Page 33: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/33.jpg)
Sarang Dharmapurikar
Bloom Filter
X
1
1
1
1
1
m-bit Array
H1
H2
H3
H4
Hk
Bloom Filter
![Page 34: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/34.jpg)
Sarang Dharmapurikar
Bloom Filter
Y
1
1
1
1
1
m-bit Array
1
1
1
H1
H2
H3
H4
Hk
![Page 35: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/35.jpg)
Sarang Dharmapurikar
Bloom Filter
X
1
1
1
1
1
m-bit Array
1
1
1
match
H1
H2
H3
H4
Hk
![Page 36: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/36.jpg)
Sarang Dharmapurikar
Bloom Filter
W
1
1
1
1
1
m-bit Array
1
1
1
Match
(false positive)
H1
H2
H3
H4
Hk
![Page 37: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/37.jpg)
Sarang Dharmapurikar
Speed up
t e c h nx y z i c a l l y a b
![Page 38: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/38.jpg)
Sarang Dharmapurikar
Speed up
t e c h nx y z i c a l l y a b
![Page 39: Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood](https://reader035.vdocuments.net/reader035/viewer/2022062314/56649c755503460f94928e5b/html5/thumbnails/39.jpg)
Sarang Dharmapurikar
Bloom filter
BloomFilter
Is x present in the filter?
{No, Yes}
Can be a false positive
But false positive probability is very small…like 0.001
Represents a set of strings
Each string consumes very few bits…like 12 to 16 bits