measurement algorithms: bloom filters and beyond

46
Measurement Algorithms: Bloom Filters and Beyond George Varghese University of California, San Diego

Upload: raven

Post on 06-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Measurement Algorithms: Bloom Filters and Beyond. George Varghese University of California, San Diego. Network Evolution?. Basic : stateless, transparent. Tools: protocol design (e.g., soft-state) 2. Active : customizable, re-configurable Tools : Code Safety (e.g., sandboxing) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Measurement Algorithms: Bloom Filters and Beyond

Measurement Algorithms: Bloom Filters and Beyond

George Varghese

University of California, San Diego

Page 2: Measurement Algorithms: Bloom Filters and Beyond

1. Basic: stateless, transparent.

Tools: protocol design (e.g., soft-state)

2. Active: customizable, re-configurable

Tools: Code Safety (e.g., sandboxing)

3. Introspective: pattern detection/response

Tools: Streaming algorithms, statistical inference (e.g. Bloom Filters, sampling)

Network Evolution?

Hawkeye enables introspection for measurement & security

Page 3: Measurement Algorithms: Bloom Filters and Beyond

What is Introspection?

Detecting patterns in data traffic, either in real-

time or based on packet logs. Examples:

Measurement Introspection: Identify resource

usage patterns for better resource management

Security Introspection: Identify attack patterns to

mitigate or prevent attacks.

Fault Introspection: Identify fault or anomaly

patterns to allow automated fault repair.

Motivated by market pull and technology push

Page 4: Measurement Algorithms: Bloom Filters and Beyond

Market Pull

• Better ROI: Optimize network resources (BGP policy, OSPF weights, light up fibers, add bandwidth) based on resource usage patterns.

• Better security: Allowing organization to be open for business during mass or targeted attacks is major differentiator.

• Better Fault Detection: Many performance anomalies can be detected by better measurement primitives (e.g., Goldman-Sachs)

Customer Site 1Customer Site 3

Customer Site 2reroute or add B/W

Page 5: Measurement Algorithms: Bloom Filters and Beyond

Technology Push: Streaming Algorithms and Hardware Gates

• Algorithms: Recent major thrust in streaming algorithms in database, web analysis, theory, networks

• Hardware: Memory accesses remain expensive (< 100) and SRAM not scaling as fast as number of connections (< 32 Mbits), but gates are plentiful.

• Mapping: Many randomized streaming algorithms (e.g., Bloom Filters, Min-wise hashing) developed to find patterns in disk logs map well to network ASICs.

• Opportunity: Invent or adapt streaming algorithms for networking patterns.

Page 6: Measurement Algorithms: Bloom Filters and Beyond

Concerns about Network Introspection

• Speed: Can hardware run fast enough? Recall IP lookups in 1990’s, surprisingly complex things (branch

predictors, TCP Offload) being done routinely today.

Most of the algorithms described below are being implemented at 24 Gbps in Hawkeye

• Inflexible: Hardware not easy to change. Design hardware to identify useful “primitive” patterns that can be

combined. (Exactly what Hawkeye does)

Network Processors can offer flexibility & speed.

• End-to-end argument: Not simple, stateless core. Not required for correctness of basic forwarding, but only as an

optimization or value-add.

Page 7: Measurement Algorithms: Bloom Filters and Beyond

Introspection as Pattern Detection

• Within Packet Patterns: Prefix matches, classification, signature detection (e.g., Code Red Payload)

• Across Packet Patterns: Scheduling, Timing, Membership Checks Heavy-hitters, large flows, partial completion, counting flows

S1 S2 S2 S5 S2 S1ROUTER

Page 8: Measurement Algorithms: Bloom Filters and Beyond

Pattern Detection Algorithm Requirements

• Low memory: On-chip SRAM limited to around 10-32 Mbits. Not constant but is not scaling with number of concurrent conversations. May need to replicate.

• Small processing: For wire-speed at 40 Gbps, using 40 byte packets, have 8 nsec. Using 1 nsec SRAM, 8 memory accesses. Factor of 30 in parallelism buys

240 accesses.

Page 9: Measurement Algorithms: Bloom Filters and Beyond

Talk Outline

• Part 1: Motivation

• Part 2: Basic Patterns and Algorithms

(membership checks, heavy-hitters, many flows,

partial completion)

• Part 3: Combining patterns to solve useful

application problems

• Part 4: Conclusions.

Page 10: Measurement Algorithms: Bloom Filters and Beyond

Pattern 1: Membership Check

Membership Check: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on that belong to a pre-specified set (e.g., black list)

S1S6 S2 S5S2 S8

Set contains only S2, S5

B. Bloom, Comm. ACM, July 1970

Page 11: Measurement Algorithms: Bloom Filters and Beyond

Field Extraction

Equal to 1 ?

Equal to 1

Equal to 1 ?

BitMapHash 1

Hash 2

Hash 3

Stage 1

Stage 2

Stage 3

ALERT ! If

all bitsare set

Membership Check via Bloom Filter

Set

Page 12: Measurement Algorithms: Bloom Filters and Beyond

Trivial Bloom Filter Analysis

Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each.

• Why trouble?: F can pass a stage if it hashes to a bit set by some real member of the set.

• Single stage probability: At most 1000/10,000 buckets can have set bits. Thus probability F passing a stage is less than 1000/10,000 = 0.1

• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than0.1 6 which is very small. Can play with numbers

Page 13: Measurement Algorithms: Bloom Filters and Beyond

Accurate Bloom Filter Analysis

Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each. Previous analysis ignores bit collisions

• Single stage probability: Probability of F passing a stage is s = (1 – (1-1/10,000)^1000) = 1 – e^{-0.1}

• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than s 6 which is very small.

Page 14: Measurement Algorithms: Bloom Filters and Beyond

Applications

• Replacement for a hash table: useful when storage is important, identifiers are long, false positives are acceptable, & membership check suffices

• Example 1: String Matching: exact strings of up to 4000 strings of 40 bytes each using only on-chip SRAM.

• Example 3: Reporting

Page 15: Measurement Algorithms: Bloom Filters and Beyond

Example 1: String Matching

A0

A1

An

String Database to Block

A2

ST0

ST1

ST2

STn

Anchor Strings

Multi Stage Filter

Hash Function

Sushil Singh, G. Varghese, J. Huber, Sumeet Singh, Patent Application

Page 16: Measurement Algorithms: Bloom Filters and Beyond

String Matching Continued: String Grouping

A0

A1

An

A2

ST0

ST1

ST2

STn

Hash Function

Hash Bucket-0

Hash Bucket-1

Hash Bucket-m

Page 17: Measurement Algorithms: Bloom Filters and Beyond

String Matching Continued: Bit Trees

A2

A8

A11

ST2

ST8

ST11

A17ST17

1

0

0

1

LOC L1

A8

A11

ST8

ST11

A2ST2

A17ST17

L1

L2

L3

ST8

ST11

ST2

ST17

0

1

1

1

0

0

1

0

1

0

LOC L2

LOC L3

Strings in a single hash bucket

Page 18: Measurement Algorithms: Bloom Filters and Beyond

Example 3: Scalable Reporting (Carousel, NSDI 2010)

• Problem: When a worm breaks out, how do we report all infected machines. Logging packets w. pattern can result in millions of sources and many duplicates

• Solution: Use a sampled Bloom filter and a more bit. Start with no sampling. Any source IP in a worm packet is reported and placed in Bloom filter of size B to suppress duplicates. Stop when B are reported and set “more” bit.

• Recursive Solution: If “more” bit repeat algorithm twice for LSB of Hashed (SourceIP) = 0 and 1. If still more repeat it four times. Nearly optimal solution.

Page 19: Measurement Algorithms: Bloom Filters and Beyond

Timed Bloom Filters

• Question: How can we add notion of time to Bloom Filters without lots of memory?

• Solution: Use 2 Bloom Filters, Old and New. Insert: Insert into New Search: Search in both New and Old Age every T seconds: Old := New; New:= Empty

• Property: Any entry not refreshed for 2T is deleted. An entry refreshed within T is in.

U.S Patent Application: Paul Owen & Andy Fingerhut et al

Page 20: Measurement Algorithms: Bloom Filters and Beyond

Pattern 2a: Heavy-hitters with Threshold

Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that send more than a threshold T (say 1% of the traffic) on a link.

S1S6 S2 S5S2 S2

Source S2 is 30 percent of traffic sequence

Estan,Varghese, ACM TOCS 2003

Page 21: Measurement Algorithms: Bloom Filters and Beyond

Field Extraction

Equal to T?

Equal to T?

Equal to T?

CountersHash 1

Hash 2

Hash 3

Stage 1

Stage 2

Stage 3

ALERT ! If

all countersabove

threshold T

Heavy Hitters with Multistage Filters

Increment

Page 22: Measurement Algorithms: Bloom Filters and Beyond

Multistage filters in Action

Grey = other flowsYellow = small flow

Green = large flow

Stage 1

Stage 3

Stage 2

Counters

Threshold. . .

Page 23: Measurement Algorithms: Bloom Filters and Beyond

Multistage Filter Analysis

Assume 1 percent threshold. Bound probability that a flow F of

0.1 % or less gets through 6 stages of size 1000 each.

• Why trouble?: F can fall into a ``hot'' bucket if and only the sum of traffic of all other flows in that bucket is morethan 0.9 %

• Single stage probability: At most 100/0.9 = 111 bucketsthat can be over 0.9 % before we bring on F. Thus probability F falls in a ``hot'' bucket is less than 111/1000 = 0.111

• Multistage probability: To be branded, F must beunlucky in all 6 stages with a probability of no more than0.111 6 which is very small. Thus at most 1000 false positiveswith very high probability.

Page 24: Measurement Algorithms: Bloom Filters and Beyond

Pattern 2b: Top K heavy-hitters

Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that are the top K talkers on a link.

S1S1 S2 S5S2 S2

Source S2 and S1 are top K talkers

Bonomi, Prabhakar, Zhang, Wu, Cisco Internal

Page 25: Measurement Algorithms: Bloom Filters and Beyond

Two simpler proposals

• SIFT (Prabhakar) and Sample-and-Hold (Estan-Varghese) both suggest sampling a packet with small probability p. Once sampled, place in CAM and watch all packets

• Idea: large flows are more likely to be sampled, and then we get exact counts

• Problem: CAM quickly gets muddied with mice and then elephants can be lost.

Page 26: Measurement Algorithms: Bloom Filters and Beyond

Elephant Traps: Sample and Recycle

Masked PIV1 = xMasked PIV1 = yMasked PIV1 = mMasked PIV1 = aMasked PIV1 = bMasked PIV1 = cMasked PIV1 = d

C = 2, Ts = 100C = 0, Ts = 120C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20C = 6, Ts = 20

Two Key Knobs:

Sampling Probability pCounter Threshold K

Counter and Timestamp

C = C/D

Evict if:a) FIFO is full andb) (C < K) and (t – Ts >= T_min)

Recycle if:a) FIFO is full andb) (C >= K) or (t – Ts < T_min)

Insert entry only if:a) no match, andb) a coin toss with small probability p is successful, andc) there is room or the front entry is evicted

ProbabilisticSampling

p

Y

Masked PIV1 is not inserted in the ET

NN

Y

Y

Masked PIV1 is not inserted in the ET

N

NMatch?

CMFSelect?

CMFADD?

YMasked PIV1 = m

Packet Match:Masked PIV1 in ET?

Update entry

Page 27: Measurement Algorithms: Bloom Filters and Beyond

Pattern 3: Partial Completion

Partial Completion: In a measurement interval, detect the flows (e.g., destinations) which have several Start Packets (e.g., SYN) without the corresponding End (e.g., FIN).

Destination X has 3 partial completions in sequence

SYNx SYNY SYNz FINY SYNx SYNx FINZ

Kompella,Singh,Varghese, IMC 2003

Page 28: Measurement Algorithms: Bloom Filters and Beyond

Field Extraction

Equal to T?

Equal to T?

Equal to T?

CountersHash 1

Hash 2

Hash 3

Stage 1

Stage 2

Stage 3

ALERT ! If

all countersabove

threshold

Partial Completion Filters

Increment for SYN, Decrement for FIN

Page 29: Measurement Algorithms: Bloom Filters and Beyond

Interval 1 Interval 2 Interval 3 Interval 4

Long Lived Connection

SYNy Retransmissions

FINz

Retransmissions

SYNxFINx

Analysis 1: Benign but Malformed Connections

Model benign but malformed connections as addingextra SYN or FIN to an interval with probability 0.5

Page 30: Measurement Algorithms: Bloom Filters and Beyond

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Greater than 6

Probability of falsepositives = 0.0013

Probability of falsenegatives = 0.0013

Analysis 2: using Gaussian approximation

Counter Values

Pro

babi

lity

Page 31: Measurement Algorithms: Bloom Filters and Beyond

Pattern 4: Many Flows

Many Flows: In a measurement interval, find if number of tuples exceeds a threshold.

S1S6 S2 S5S2 S2

6 packets but only 4 distinct sources

Estan, Fisk, Varghese, IMC 2003, ACM TONS to appear

Page 32: Measurement Algorithms: Bloom Filters and Beyond

Simple Bitmap counting

Problem: bitmap takes too much memory to count a large number of flows

Hash based on flow identifierF

1 1 1

Estimate: based on the number of bits set

111 1

Page 33: Measurement Algorithms: Bloom Filters and Beyond

Sampled Bitmap counting

Problem: inaccurate if too few or too many flows

Solution: keep only a sample of the bitmap

1 1

Estimate: scale up sampled count

Page 34: Measurement Algorithms: Bloom Filters and Beyond

Multi-resolution Bitmap counting

Solution: multiple bitmaps, each covering a different range

Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale

1-10 flows

10-100

100-1000

Page 35: Measurement Algorithms: Bloom Filters and Beyond

Scalable Bitmap counting

1-10 flows

At time 0, start with scale = 1

Later use with scale = 100

100-1000

Solution: one bitmap with an additional scale factor that is increased when all bits are set

Estimate: count bits, correct, multiply by , scale factor. Can count to 1 million using

15 bit scale factor and 32-bit vector

Page 36: Measurement Algorithms: Bloom Filters and Beyond

Scaled Multi-resolution Bitmap counting

Solution: multiple bitmaps, each covering a different range but each with a scale factor

Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale

1-10 flows

10-100

100-1000Scale = 5

Scale = 2

Scale = 8

F. Shahid et al, U.S. Patent Application

Page 37: Measurement Algorithms: Bloom Filters and Beyond

Pattern 4: Concurrent Approximate State Machines

State Machine: In a measurement interval, detect the flows which hit a specified state machine (Bloom filter is a special case where state machine is a membership check)

Flow X has two packets in B frame

Bx IY X Y Px x Bx

Bonomi, Mitzenmacher, Panigraphy, Singh, Varghese 2006

Page 38: Measurement Algorithms: Bloom Filters and Beyond

Concurrent State Machines

• Implementation: We know 3 good implementations. The best of these uses a good hash table implementation (d-left) and simply substitutes the identifier of a flow with a smaller signature for the flow.

• Results: For 64 K flows, we need roughly 1 Mbit of memory.

• Applications: First, for video congestion control. We found good results by dropping B-frames during congestion and then tail-dropping till the next I-frame. Can handle twice the loss rates with same quality. Second, for P2P identification.

Page 39: Measurement Algorithms: Bloom Filters and Beyond

Outline of Talk

• Part 1: Motivation• Part 2: Basic Patterns and Algorithms• Part 3: Combining base patterns to solve

useful application problems (traffic matrix, DoS, worms)

• Part 4: Conclusions.

Page 40: Measurement Algorithms: Bloom Filters and Beyond

Application 1: Traffic Matrix

• Each entry router uses a multistage filter on traffic to destination prefixes to isolate subnets to which there is large traffic.

• Aggregating across all entry routers gives the “dominant” part of traffic matrix. ATT reports 80-20 rule for prefixes.

ISP

Customer Site 1Customer Site 3

Customer Site 2reroute or add B/W

Page 41: Measurement Algorithms: Bloom Filters and Beyond

Application 2: DoS Attacks

• Bandwidth attacks: (e.g.. Smurf). Pound victim with large traffic of certain type. Use heavy-hitter pattern relative to traffic type

(e.g., ICMP) to find attacked destinations

• Partial Completion attacks: (e.g., TCP SYN-Flood). May not be unusual bandwidth but characterized by partial connections. Use partial completion pattern as a front-end for

Riverhead Guard module in Jaffa.

Page 42: Measurement Algorithms: Bloom Filters and Beyond

Application 4: Worm Detection

• Manual signature extraction: slow and enormous

effort for each new worm.

• Automatic signature extraction of a specific worm by

automatically detecting an abstract worm

ISP

Infected 1

Infected N

New Victim

Inactive Address

Sumeet Singh, G. Varghese, C. Estan, S. Savage, OSDI 2004, more in next class

Page 43: Measurement Algorithms: Bloom Filters and Beyond

Abstract Worm Definition and Detection

• F1, Content Repetition: Payload of worm is seen frequently at router. Use heavy-hitter pattern with hash H of content as index. NetSift used large multistage filters. A variant of elephant

traps invented by John Huber and Sumeet Singh seems to be the best solution for Hawkeye.

• F2, Increasing Infection Levels: Same content is disbursed to increasing number of distinct source-destination pairs. Use many flows pattern with content hash H as index

Page 44: Measurement Algorithms: Bloom Filters and Beyond

Hashing Implementation

• Need a hash function, especially for content, that is easy to compute and random.

• NetSift used a Rabin hash function but that requires multiplies. For Bloom Filters can make 1 large hash and take portions/stage

• Much nicer hash function using Galois multiplication (Xor and Shift)

Page 45: Measurement Algorithms: Bloom Filters and Beyond

Conclusions

• Introspection/Pattern detection can be useful for the next generation of networks. Beyond faster-cheaper

• Can implement base patterns at high speeds.

• Base patterns can be combined to solve useful application issues (traffic matrix, DoS, worms, etc.)

• Only scratching surface: need to build a library of patterns.

Page 46: Measurement Algorithms: Bloom Filters and Beyond

Introspection at UCSD

Ramana Kompella Cristian Estan Sumeet Singh