co-evaluation of pattern matching algorithms on iot devices ...¢iot security is a concern ¢recent...
TRANSCRIPT
Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs
Charalampos StylianopoulosSimon KindströmMagnus AlmgrenOlaf LandsiedelMarina Papatriantafilou
Distributed Computing and Systems
Motivation
2
¢ IoT security is a concern ¢ Recent attacks:
l Show that IoT security is lacking• Mirai botnet• Attacks on a casino’s aquarium
thermostat
l Underline the need for countermeasures
Motivation
Standard security countermeasures (e.g. NIDS) can be applied
l on the IoT devices themselves l on the entry point to the network of IoT devices
3
Motivation
¢ Challengesl Resource constrained devicesl More connected devices -> More traffic to inspect
¢ NIDSl Performance bottleneckl Not tailored to hardware
4
5
… http://some.site.com/get.asp?f=/etc/passwd … GET HTTP try_backdoor…
Input Stream
…/etc/passwdadmin.dllget.aspbackdoor
…Pattern set
Search for all patterns, anywhere in the network stream.
… http://some.site.com/get.asp?f=/etc/passwd … GET HTTP try_backdoor…
Pattern matching = The core functionality of NIDS
Goal:
Compare all network traffic against all malicious signatures
Motivation: Pattern matching
more than 70% of running time [1]
[1] "Generating realistic workloads for network intrusion detection systems", Antonatos et al.
Motivation: New Devices
¢ Opportunitiesl IoT/Embedded hardware is evolvingl New hardware features
• Example: ODROID single board computers with embedded Graphic Processor Units (GPUs)
6
Making use of those features is an open issue
¢ The questions we are trying to answer in this work:l Which algorithms to use?l What are the hardware characteristics that
affect the performance?l How to create new algorithms that make
best use of those characteristics?
8
Our work
Our work
¢ Co-evaluation of pattern matching algorithmsl Evaluate existing implementationsl Influence the design of new ones
¢ Target embedded GPUsl Deep look in their architectural features
¢ Extensive evaluationl Different datasets, patterns, l Energy efficiency
9
Outline
¢ Backgroundl GPU computing
¢ Our Benchmark¢ Evaluation
10
Background
¢ General Purpose GPU computing (GPGPU)l Other than graphics, GPUs can be used for
general tasks as welll Highly parallel architecture
¢ Pattern matching on a GPU: Not a new thingl Not much work on embedded GPUs
11
[1]"Gnort: High Performance Network Intrusion Detection Using Graphics Processors”, Vasiliadis et al., RAID 2008[2]"APUNet: Revitalizing GPU as Packet Processing Accelerator”, Go et al, NSDI 2017[3]"A highly-efficient memory-compression scheme for GPU-accelerated intrusion detectionsystems”, Bellekens et al. SINCONF 2017
Background
¢ The platform
12Source :Energy efficient run-time mapping and thread partitioning of concurrentOpenCL applications on CPU-GPU MPSoCs
Background
Important characteristics(unique to embedded GPUS)
¢ Small number of cores/threads¢ No main memory on the GPU
Ø Shared main memory between CPU and GPU¢ No local memory on chip¢ Vectorization in each GPU thread¢ Separate instruction counter per GPU thread
Ø No need to worry about divergent execution13
Outline
¢ Background¢ Our Benchmark
l Algorithmsl Optimizations
¢ Evaluation
14
Algorithms
Representative algorithms from two categories:
15
Aho Corasick DFCCPU
GPU
State machine based Filtering based
Algorithms (CPU)
The Aho-Corasick algorithm¢ Used in many Network Intrusion Detection Systems¢ Builds a State Machine (SM) from all the patterns¢ Traverses the SM reading the input byte by byte
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’75
• Poor cache locality• Data dependenciesLimitations
• Only one lookup per input byteBenefits
¢ Aho Corasick¢ DFC
16
Algorithms (CPU)
The DFC algorithm¢ Creates a filter from patterns¢ Quickly filter outs parts of
the input
“DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
…a c t i v a t ea d m i n . d l lb a c k d o o rg e t . a s p
… Pattern set
… 0 1 1 0 0 0 1 0 0 0 1 0 0
Filter (8 KB)
ac adab ... ba bb ... ge ...
Fits in cache! … t h i s i s a n i n p u t
Input Stream
¢ Aho Corasick¢ DFC
17
Algorithms (CPU)¢ Aho Corasick¢ DFC
¢ Progressive filteringl in cache
¢ Verificationl in memory
Hash%tables
Initial%filter
1B
… …223B 427B 82 B
… … … … … …
… …
Patternlengthspecificfilters
… … … …
… …
… …
• Verification phase is costlyLimitations
• Cache locality (on filtering)• No data dependenciesBenefits
“DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
The DFC algorithm (continued)
18
Algorithms
Representative algorithms from two categories:
19
Aho Corasick
PFAC [1]
DFC
DFC (GPU)
HYBRID
CPU
GPU
[1] “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs” Lin et al., TOC 2013
State machine based Filtering based
Hardware-oriented optimizations
Relevant aspects that we investigate:¢ Memory mapping vs data transfers
l 2-5X faster with memory mapping¢ Placement of the filters
l Global memoryl Texture memoryl Local memory
¢ Vectorizationl No significant speedup 20More in the paper…
Outline
¢ Background¢ Our Benchmark¢ Evaluation
21
Evaluation Methodology
Hardware
22
CPU 4 ARM big.LITTLE
GPU ARM Mali-T628 (6 shader cores)
Memory 2GB RAM
Sensors On board energy sensors
l 3 publicly available traffic tracesl 1 randomly generated data set
l 2183 patterns (from Snort)
Datasets
MaliciousPatterns l 5000 patterns (emergingthreats.net)
Evaluation Methodology
¢ Goal of the evaluation:1. How fast we can process the input (execution time)2. How much energy we spent for processing (energy consumption)3. Effect of datasets and number of patterns4. Influence the design of new algorithms
¢ Versions: l Aho-Corasickl DFC l PFACl DFC on GPU (w/wo vectorization)l HYBRID (w/wo vectorization)
CPU
GPU
23
Evaluation Results
¢ Experiment 1: execution time breakdown
24( Post-processing = Output which and how many patterns matched, on the CPU )
Post-processingCPU->GPUCPU->GPU
CPU Versions GPU VersionsVect
Evaluation Results
¢ Experiment 2: energy consumption
25
Evaluation Results
¢ Experiment 3: effect of datasets and #patterns
26
2183 patterns
5000 patterns
Evaluation Results
¢ Experiment 4: configuring Hybrid
27Bigger Filter =Slower access time (green trend, left y-axis)
Higher hit ratio -> Less verification (red trend, right y-axis)
Conclusions & Future Work
¢ Conclusionsl New hardware features (embedded GPUs) can alleviate the
bottleneck of pattern matchingl Architecture characteristics important for high performance and
low energy consumptionl Possible to design new algorithms tailored to the hardware
¢ Future Workl Overlap CPU/GPU execution (heterogeneous design)l More algorithms and devices (e.g. Nvidia’s Jetson Nano)l Integrate with existing systems (e.g. Snort)
¢ Code available online 28
¢ Backup Slides
29
Background (1/3)
¢ Snortl The de-facto NIDSl Signature based (malicious signatures are
known in advance) l The main pipeline looks like that
30more than 70%of running timeincludes pattern
matching
Algorithms (CPU)
The Aho-Corasick algorithm¢ Used in many Network Intrusion Detection Systems¢ Builds a State Machine (SM) from all the patterns¢ Traverses the SM reading the input byte by byte
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’7531
• Poor cache locality• Data dependenciesLimitations
• Only one lookup per input byteBenefits
¢ Aho Corasick¢ DFC
Related work
¢ State machine basedl Aho Corasick
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’75
¢ Filter basedl DFC
• Poor cache locality• Data dependenciesLimitations
• Only one lookup per input byteBenefits
…a c t i v a t ea d m i n . d l lb a c k d o o rg e t . a s p
… Pattern set
… 0 0 0 0 0 0 0 0 0 0 0 0 0
Filter (8 KB)
ac adab ... ba bb ... ge ...
… 0 1 0 0 0 0 0 0 0 0 0 0 0… 0 1 1 0 0 0 0 0 0 0 0 0 0… 0 1 1 0 0 0 1 0 0 0 1 0 0
"DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
• Much of the hardware remains underutilizedLimitations
• Cache locality (on filtering)• No data dependenciesBenefits
e.g. vectorinstructions?32