measurement algorithms: bloom filters and beyond
DESCRIPTION
Measurement Algorithms: Bloom Filters and Beyond. George Varghese University of California, San Diego. Network Evolution?. Basic : stateless, transparent. Tools: protocol design (e.g., soft-state) 2. Active : customizable, re-configurable Tools : Code Safety (e.g., sandboxing) - PowerPoint PPT PresentationTRANSCRIPT
Measurement Algorithms: Bloom Filters and Beyond
George Varghese
University of California, San Diego
1. Basic: stateless, transparent.
Tools: protocol design (e.g., soft-state)
2. Active: customizable, re-configurable
Tools: Code Safety (e.g., sandboxing)
3. Introspective: pattern detection/response
Tools: Streaming algorithms, statistical inference (e.g. Bloom Filters, sampling)
Network Evolution?
Hawkeye enables introspection for measurement & security
What is Introspection?
Detecting patterns in data traffic, either in real-
time or based on packet logs. Examples:
Measurement Introspection: Identify resource
usage patterns for better resource management
Security Introspection: Identify attack patterns to
mitigate or prevent attacks.
Fault Introspection: Identify fault or anomaly
patterns to allow automated fault repair.
Motivated by market pull and technology push
Market Pull
• Better ROI: Optimize network resources (BGP policy, OSPF weights, light up fibers, add bandwidth) based on resource usage patterns.
• Better security: Allowing organization to be open for business during mass or targeted attacks is major differentiator.
• Better Fault Detection: Many performance anomalies can be detected by better measurement primitives (e.g., Goldman-Sachs)
Customer Site 1Customer Site 3
Customer Site 2reroute or add B/W
Technology Push: Streaming Algorithms and Hardware Gates
• Algorithms: Recent major thrust in streaming algorithms in database, web analysis, theory, networks
• Hardware: Memory accesses remain expensive (< 100) and SRAM not scaling as fast as number of connections (< 32 Mbits), but gates are plentiful.
• Mapping: Many randomized streaming algorithms (e.g., Bloom Filters, Min-wise hashing) developed to find patterns in disk logs map well to network ASICs.
• Opportunity: Invent or adapt streaming algorithms for networking patterns.
Concerns about Network Introspection
• Speed: Can hardware run fast enough? Recall IP lookups in 1990’s, surprisingly complex things (branch
predictors, TCP Offload) being done routinely today.
Most of the algorithms described below are being implemented at 24 Gbps in Hawkeye
• Inflexible: Hardware not easy to change. Design hardware to identify useful “primitive” patterns that can be
combined. (Exactly what Hawkeye does)
Network Processors can offer flexibility & speed.
• End-to-end argument: Not simple, stateless core. Not required for correctness of basic forwarding, but only as an
optimization or value-add.
Introspection as Pattern Detection
• Within Packet Patterns: Prefix matches, classification, signature detection (e.g., Code Red Payload)
• Across Packet Patterns: Scheduling, Timing, Membership Checks Heavy-hitters, large flows, partial completion, counting flows
S1 S2 S2 S5 S2 S1ROUTER
Pattern Detection Algorithm Requirements
• Low memory: On-chip SRAM limited to around 10-32 Mbits. Not constant but is not scaling with number of concurrent conversations. May need to replicate.
• Small processing: For wire-speed at 40 Gbps, using 40 byte packets, have 8 nsec. Using 1 nsec SRAM, 8 memory accesses. Factor of 30 in parallelism buys
240 accesses.
Talk Outline
• Part 1: Motivation
• Part 2: Basic Patterns and Algorithms
(membership checks, heavy-hitters, many flows,
partial completion)
• Part 3: Combining patterns to solve useful
application problems
• Part 4: Conclusions.
Pattern 1: Membership Check
Membership Check: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on that belong to a pre-specified set (e.g., black list)
S1S6 S2 S5S2 S8
Set contains only S2, S5
B. Bloom, Comm. ACM, July 1970
Field Extraction
Equal to 1 ?
Equal to 1
Equal to 1 ?
BitMapHash 1
Hash 2
Hash 3
Stage 1
Stage 2
Stage 3
ALERT ! If
all bitsare set
Membership Check via Bloom Filter
Set
Trivial Bloom Filter Analysis
Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each.
• Why trouble?: F can pass a stage if it hashes to a bit set by some real member of the set.
• Single stage probability: At most 1000/10,000 buckets can have set bits. Thus probability F passing a stage is less than 1000/10,000 = 0.1
• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than0.1 6 which is very small. Can play with numbers
Accurate Bloom Filter Analysis
Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each. Previous analysis ignores bit collisions
• Single stage probability: Probability of F passing a stage is s = (1 – (1-1/10,000)^1000) = 1 – e^{-0.1}
• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than s 6 which is very small.
Applications
• Replacement for a hash table: useful when storage is important, identifiers are long, false positives are acceptable, & membership check suffices
• Example 1: String Matching: exact strings of up to 4000 strings of 40 bytes each using only on-chip SRAM.
• Example 3: Reporting
Example 1: String Matching
A0
A1
An
String Database to Block
A2
ST0
ST1
ST2
STn
Anchor Strings
Multi Stage Filter
Hash Function
Sushil Singh, G. Varghese, J. Huber, Sumeet Singh, Patent Application
String Matching Continued: String Grouping
A0
A1
An
A2
ST0
ST1
ST2
STn
Hash Function
Hash Bucket-0
Hash Bucket-1
Hash Bucket-m
String Matching Continued: Bit Trees
A2
A8
A11
ST2
ST8
ST11
A17ST17
1
0
0
1
LOC L1
A8
A11
ST8
ST11
A2ST2
A17ST17
L1
L2
L3
ST8
ST11
ST2
ST17
0
1
1
1
0
0
1
0
1
0
LOC L2
LOC L3
Strings in a single hash bucket
Example 3: Scalable Reporting (Carousel, NSDI 2010)
• Problem: When a worm breaks out, how do we report all infected machines. Logging packets w. pattern can result in millions of sources and many duplicates
• Solution: Use a sampled Bloom filter and a more bit. Start with no sampling. Any source IP in a worm packet is reported and placed in Bloom filter of size B to suppress duplicates. Stop when B are reported and set “more” bit.
• Recursive Solution: If “more” bit repeat algorithm twice for LSB of Hashed (SourceIP) = 0 and 1. If still more repeat it four times. Nearly optimal solution.
Timed Bloom Filters
• Question: How can we add notion of time to Bloom Filters without lots of memory?
• Solution: Use 2 Bloom Filters, Old and New. Insert: Insert into New Search: Search in both New and Old Age every T seconds: Old := New; New:= Empty
• Property: Any entry not refreshed for 2T is deleted. An entry refreshed within T is in.
U.S Patent Application: Paul Owen & Andy Fingerhut et al
Pattern 2a: Heavy-hitters with Threshold
Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that send more than a threshold T (say 1% of the traffic) on a link.
S1S6 S2 S5S2 S2
Source S2 is 30 percent of traffic sequence
Estan,Varghese, ACM TOCS 2003
Field Extraction
Equal to T?
Equal to T?
Equal to T?
CountersHash 1
Hash 2
Hash 3
Stage 1
Stage 2
Stage 3
ALERT ! If
all countersabove
threshold T
Heavy Hitters with Multistage Filters
Increment
Multistage filters in Action
Grey = other flowsYellow = small flow
Green = large flow
Stage 1
Stage 3
Stage 2
Counters
Threshold. . .
Multistage Filter Analysis
Assume 1 percent threshold. Bound probability that a flow F of
0.1 % or less gets through 6 stages of size 1000 each.
• Why trouble?: F can fall into a ``hot'' bucket if and only the sum of traffic of all other flows in that bucket is morethan 0.9 %
• Single stage probability: At most 100/0.9 = 111 bucketsthat can be over 0.9 % before we bring on F. Thus probability F falls in a ``hot'' bucket is less than 111/1000 = 0.111
• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than0.111 6 which is very small. Thus at most 1000 false positiveswith very high probability.
Pattern 2b: Top K heavy-hitters
Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that are the top K talkers on a link.
S1S1 S2 S5S2 S2
Source S2 and S1 are top K talkers
Bonomi, Prabhakar, Zhang, Wu, Cisco Internal
Two simpler proposals
• SIFT (Prabhakar) and Sample-and-Hold (Estan-Varghese) both suggest sampling a packet with small probability p. Once sampled, place in CAM and watch all packets
• Idea: large flows are more likely to be sampled, and then we get exact counts
• Problem: CAM quickly gets muddied with mice and then elephants can be lost.
Elephant Traps: Sample and Recycle
Masked PIV1 = xMasked PIV1 = yMasked PIV1 = mMasked PIV1 = aMasked PIV1 = bMasked PIV1 = cMasked PIV1 = d
C = 2, Ts = 100C = 0, Ts = 120C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20
Two Key Knobs:
Sampling Probability pCounter Threshold K
Counter and Timestamp
C = C/D
Evict if:a) FIFO is full andb) (C < K) and (t – Ts >= T_min)
Recycle if:a) FIFO is full andb) (C >= K) or (t – Ts < T_min)
Insert entry only if:a) no match, andb) a coin toss with small probability p is successful, andc) there is room or the front entry is evicted
ProbabilisticSampling
p
Y
Masked PIV1 is not inserted in the ET
NN
Y
Y
Masked PIV1 is not inserted in the ET
N
NMatch?
CMFSelect?
CMFADD?
YMasked PIV1 = m
Packet Match:Masked PIV1 in ET?
Update entry
Pattern 3: Partial Completion
Partial Completion: In a measurement interval, detect the flows (e.g., destinations) which have several Start Packets (e.g., SYN) without the corresponding End (e.g., FIN).
Destination X has 3 partial completions in sequence
SYNx SYNY SYNz FINY SYNx SYNx FINZ
Kompella,Singh,Varghese, IMC 2003
Field Extraction
Equal to T?
Equal to T?
Equal to T?
CountersHash 1
Hash 2
Hash 3
Stage 1
Stage 2
Stage 3
ALERT ! If
all countersabove
threshold
Partial Completion Filters
Increment for SYN, Decrement for FIN
Interval 1 Interval 2 Interval 3 Interval 4
Long Lived Connection
SYNy Retransmissions
FINz
Retransmissions
SYNxFINx
Analysis 1: Benign but Malformed Connections
Model benign but malformed connections as addingextra SYN or FIN to an interval with probability 0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Greater than 6
Probability of falsepositives = 0.0013
Probability of falsenegatives = 0.0013
Analysis 2: using Gaussian approximation
Counter Values
Pro
babi
lity
Pattern 4: Many Flows
Many Flows: In a measurement interval, find if number of tuples exceeds a threshold.
S1S6 S2 S5S2 S2
6 packets but only 4 distinct sources
Estan, Fisk, Varghese, IMC 2003, ACM TONS to appear
Simple Bitmap counting
Problem: bitmap takes too much memory to count a large number of flows
Hash based on flow identifierF
1 1 1
Estimate: based on the number of bits set
111 1
Sampled Bitmap counting
Problem: inaccurate if too few or too many flows
Solution: keep only a sample of the bitmap
1 1
Estimate: scale up sampled count
Multi-resolution Bitmap counting
Solution: multiple bitmaps, each covering a different range
Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale
1-10 flows
10-100
100-1000
Scalable Bitmap counting
1-10 flows
At time 0, start with scale = 1
Later use with scale = 100
100-1000
Solution: one bitmap with an additional scale factor that is increased when all bits are set
Estimate: count bits, correct, multiply by , scale factor. Can count to 1 million using
15 bit scale factor and 32-bit vector
Scaled Multi-resolution Bitmap counting
Solution: multiple bitmaps, each covering a different range but each with a scale factor
Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale
1-10 flows
10-100
100-1000Scale = 5
Scale = 2
Scale = 8
F. Shahid et al, U.S. Patent Application
Pattern 4: Concurrent Approximate State Machines
State Machine: In a measurement interval, detect the flows which hit a specified state machine (Bloom filter is a special case where state machine is a membership check)
Flow X has two packets in B frame
Bx IY X Y Px x Bx
Bonomi, Mitzenmacher, Panigraphy, Singh, Varghese 2006
Concurrent State Machines
• Implementation: We know 3 good implementations. The best of these uses a good hash table implementation (d-left) and simply substitutes the identifier of a flow with a smaller signature for the flow.
• Results: For 64 K flows, we need roughly 1 Mbit of memory.
• Applications: First, for video congestion control. We found good results by dropping B-frames during congestion and then tail-dropping till the next I-frame. Can handle twice the loss rates with same quality. Second, for P2P identification.
Outline of Talk
• Part 1: Motivation• Part 2: Basic Patterns and Algorithms• Part 3: Combining base patterns to solve
useful application problems (traffic matrix, DoS, worms)
• Part 4: Conclusions.
Application 1: Traffic Matrix
• Each entry router uses a multistage filter on traffic to destination prefixes to isolate subnets to which there is large traffic.
• Aggregating across all entry routers gives the “dominant” part of traffic matrix. ATT reports 80-20 rule for prefixes.
ISP
Customer Site 1Customer Site 3
Customer Site 2reroute or add B/W
Application 2: DoS Attacks
• Bandwidth attacks: (e.g.. Smurf). Pound victim with large traffic of certain type. Use heavy-hitter pattern relative to traffic type
(e.g., ICMP) to find attacked destinations
• Partial Completion attacks: (e.g., TCP SYN-Flood). May not be unusual bandwidth but characterized by partial connections. Use partial completion pattern as a front-end for
Riverhead Guard module in Jaffa.
Application 4: Worm Detection
• Manual signature extraction: slow and enormous
effort for each new worm.
• Automatic signature extraction of a specific worm by
automatically detecting an abstract worm
ISP
Infected 1
Infected N
New Victim
Inactive Address
Sumeet Singh, G. Varghese, C. Estan, S. Savage, OSDI 2004, more in next class
Abstract Worm Definition and Detection
• F1, Content Repetition: Payload of worm is seen frequently at router. Use heavy-hitter pattern with hash H of content as index. NetSift used large multistage filters. A variant of elephant
traps invented by John Huber and Sumeet Singh seems to be the best solution for Hawkeye.
• F2, Increasing Infection Levels: Same content is disbursed to increasing number of distinct source-destination pairs. Use many flows pattern with content hash H as index
Hashing Implementation
• Need a hash function, especially for content, that is easy to compute and random.
• NetSift used a Rabin hash function but that requires multiplies. For Bloom Filters can make 1 large hash and take portions/stage
• Much nicer hash function using Galois multiplication (Xor and Shift)
Conclusions
• Introspection/Pattern detection can be useful for the next generation of networks. Beyond faster-cheaper
• Can implement base patterns at high speeds.
• Base patterns can be combined to solve useful application issues (traffic matrix, DoS, worms, etc.)
• Only scratching surface: need to build a library of patterns.
Introspection at UCSD
Ramana Kompella Cristian Estan Sumeet Singh