bloom filters: general theory and...

80
Introduction Bloom Filters: the classic Bloom Filters: variants Conclusions Bloom Filters: general theory and variants G. Caravagna [email protected] Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives. G.Caravagna Bloom Filters, general theory and variants

Upload: others

Post on 27-May-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Bloom Filters: general theory and variants

G. [email protected]

Information RetrievalWherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

When using a Bloom Filter, consider the effects of false positives.

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Indice

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Membership query

Definition

The Membership Problem

� Given a set S and an element y : y?∈S

� Given a set S compute its characteristic function χs

χs(y) =

{1, if y ∈ S0, if y 6∈ S

well-known solutions

Linear ScanDeterministic ArraysHash Functions

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Membership query

Definition

The Membership Problem

� Given a set S and an element y : y?∈S

� Given a set S compute its characteristic function χs

χs(y) =

{1, if y ∈ S0, if y 6∈ S

well-known solutions

Linear ScanDeterministic ArraysHash Functions

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Well-known Solutions

Linear Scan

not in our world

Deterministic Arrays (exactly compute χs)

elements of S belong to a finite universea boolean array big as the universemap elements of universe on it

Hash Functions (approximate χs)

use α bits for each elements of S (usually α = log(n))sort hashed values (sort α-tuples)what may happen with collisions?

Bloom Filters use these as starting point..

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Well-known Solutions

Linear Scan

not in our world

Deterministic Arrays (exactly compute χs)

elements of S belong to a finite universea boolean array big as the universemap elements of universe on it

Hash Functions (approximate χs)

use α bits for each elements of S (usually α = log(n))sort hashed values (sort α-tuples)what may happen with collisions?

Bloom Filters use these as starting point..

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

The problem

Well-known Solutions

Linear Scan

not in our world

Deterministic Arrays (exactly compute χs)

elements of S belong to a finite universea boolean array big as the universemap elements of universe on it

Hash Functions (approximate χs)

use α bits for each elements of S (usually α = log(n))sort hashed values (sort α-tuples)what may happen with collisions?

Bloom Filters use these as starting point..

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Reminder

Wherever a list or set is used, and space is aconsideration, a Bloom Filter should be considered.

When using a Bloom Filter, consider the effects of falsepositives.

Bloom[1970]

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Preconditions

given a set of objects S = {x1, . . . , xn}no restrictions on objects

a vector B of m bits were bi ∈ {0, 1}will discuss about m-value

suppose we have k hash functions h1, . . . , hk

each hi is defined as hi : U ⊇ S → [1;m]hi indexes the B vector

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Building vector B

how to build B

∀i . bi = 1 ⇔ ∃(j , t). hj(xt) = i

B

xi

1 1 100 01

xj

hi1(xi)

hik(xi)hj1

(xj) hjk(xj)

. . .

. . .

. . . . . . . . . . . .

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Building vector B

procedure buildbeginfor each s in S do // is Θ(|S |)for each h in H do // is O(1)B[h(s)] = 1; // suppose is O(1)

donedone

end

All the build is Θ(|S |) = Θ(n) time and Θ(|B|) = Θ(m) space

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Searching into vector B

how to search for y

y ∈ S ⇔ ∀i = 1, . . . , k. bhi (y) = 1

B

y

1 1 100 01

. . .

. . . . . . . . . . . .

y 6∈ S

S is useless when B is built

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Searching into vector B

how to search for y

y ∈ S ⇔ ∀i = 1, . . . , k. bhi (y) = 1

B

y

1 1 100 01

. . .

. . . . . . . . . . . .

y 6∈ S

S is useless when B is built

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Searching into vector B

procedure searchbeginfor each h in H do // is O(1)if B[h(y)] = 0 then return "not found"

donereturn "found"

end

All the search is O(1) time and O(1) space

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Searching into vector B

it’s easy to notice that computing χs(y)

1 if ∃i .Bhi (y) = 0 ⇒ χs(y) = 0 thus y 6∈ S

2 if ∀i .Bhi (y) = 1 ⇒ χs(y) = 1 thus y ∈ S

of course sentence 1 is true

what about 2?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Searching into vector B

it’s easy to notice that computing χs(y)

1 if ∃i .Bhi (y) = 0 ⇒ χs(y) = 0 thus y 6∈ S

2 if ∀i .Bhi (y) = 1 ⇒ χs(y) = 1 thus y ∈ S

of course sentence 1 is true

what about 2?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

The real problem of Bloom Filters

Definition

The main problem are false positives:may exist an xj 6= y so that h...(xj) = h...(y)

B

y 6∈ S

1. . . . . . . . . . . .

xj

sentence 2 is not always true

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

What we compute

what may happen with a false positive?

we say ”y ∈ S” even if this is false

we shall say ”y ∈ S , probably”

can we compute this probability?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

S = {TTA, TCT, ATA}

0 0 0 0 0 000 0 0 0 0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example: building B

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

S = {TTA, TCT, ATA}

1 0 0 0 0 000 0 1 1 0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example: building B

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

S = {TTA, TCT, ATA}

1 0 0 0 0 000 1

h1(TCT ) = h2(TTA) = 9

111

we may have introduceda false positive in B10

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example: building B

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

S = {TTA, TCT, ATA}

1 0 0 0 0 00 11 111

we may have introducedtwo false positives in B10 and B11

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example: searching into B

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

CGA

S = {TTA, TCT, ATA}

CGA?∈S → NO

1 11 11 10 0 0 0 0 0

Bh3(CGA) = 0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Example: searching into B

B

ACGATACGATTAATACGCAAATCT

h1 h2 h3

3211169410

6119102314

4106913211

. . . . . . . . . . . .

1 2 3 4 5 6 7

. . .

8 9 10 11 12

AAA

S = {TTA, TCT, ATA}

AAA?∈S → Y ES

1 11 11 10 0 0 0 0 0

h2(ATA) = h3(AAA) = 2

false positive

h1(TTA) = h2(AAA) = 1

h2(TCT ) = h1(AAA) = 4

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Probability of a false positive

assumption that hash are perfectly random

after build

P(bi = 0) =

(1− 1

m

)kn

≈ e−kn/m = p

probability of a false positive is

(1− e−kn/m)k = (1− p)k = ε

other formulations are asymptotically equivalent

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Optimizing number of hash functions

higher k-value

more chances to find a 0-bit for y 6∈ S

lower k-value

increase fraction of 0-bits in B

minimize the ε function

k̃ = ln 2 · (m/n)

if p = 0.5 then ε is a constant

ε = (0.5)k̃ = (0.6185)m/n

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Optimizing number of hash functions

higher k-value

more chances to find a 0-bit for y 6∈ S

lower k-value

increase fraction of 0-bits in B

minimize the ε function

k̃ = ln 2 · (m/n)

if p = 0.5 then ε is a constant

ε = (0.5)k̃ = (0.6185)m/n

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

How big should be the B vector?

depends on the ε value we want

given n we fix m

m ε %

n 0.61 61%2n 0.38 38%5n 0.09 9%10n 0.008 0.1%

m = O(n) is generally a good choice

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Bloom Filters v.s. hash functions

� hash functions Bloom Filters

build time Θ(n + n log(n)) Θ(n)

space needed Θ(n log(n)) Θ(m)

search time O(log(n)) O(1)

ε value 1/n (1− p)k

Hash functions are Bloom Filters with k = 1

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Bloom Filters tricks

union by OR1 we have sets S1,S2 and Bloom Filters B1,B2

2 suppose m1 = m2 and same hashing functions3 just OR the counters

B12i = B1

i ∨ B2i

halved size1 suppose m = 2α

2 make union by OR of the two half3 when hashing mask high-order bit

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Main ideaMathematics

Summary

we have a tradeoff between space and false positives

ε value is computable (and constant)

we use ”abstraction” provided by hash functions on xi ∈ S

we approximate the characteristic function

we have an easy to code data structure

we started from the Membership Problem, we solve this one:

”Handle massive data sets to support membershipqueries using compact data structure”

what else shall we want?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Why variants

extend Bloom Filters to multisets

Spectral Bloom Filter, Matias and Cohen [2003]

compute almost any function

Bloomier Filter, Chazelle et al. [2004]

something else?

someone else..

...are up to date results, let’s try to give a brief overview...

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Spectral Bloom Filters (SBF)

We extend the Bloom Filters to multisets

Definition

M = 〈S , fx〉 is a multiset were

S is a set

fx is a function

fx are the occurrences of x in M

ex M = 〈{A=2,B=1,C=2}, fx〉|M| = 5, fA = fC = 2, fB = 1

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Main features

space usage is slightly larger

performances are generally better

insertions are always possible, deletion not

can be built incrementally for streaming data

we query values fx > T with T a threshold

with T = 0 we guess for membership

we have tricks for SBF

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

SBF

B vector is replaced by a vector of counters C1,C2, . . . ,Cm

Ci is a sum of fx values for each x ∈ S mapping to i

as always, approximations of fx are stored into

Ch1(x),Ch2(x), . . . ,Chk (x)

thus, to compute fx , we have

mx = min{Ch1(x),Ch2(x), . . . ,Chk (x)}

mx is the basic estimator or Minimum Selection (MS)

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The Minimum Selection

. . . . . .

fx fx fx

fyfz

fz

CiCi−1

. . .Cj+1Cj

h...(y) = h...(x) = i− 1

Ci−1 is not a good approximation of fx (neither of fy )

Ci is an exact approximation of fx

Cj+1 is an exact approximation of fz

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Insertion and Deletion

insertion is simple

increase each counter by 1. . .for each h in H do // O(1)C[h(x)] = C[h(x)] + 1;

done. . .

deletion is simple

decrease each counter by 1

search for an element x

compute the MinimumSelection mx

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

On the error of SBF

error is the same ε of Bloom Filters

Theorem

For all x fx 6 mx . Furthermore fx 6= mx with probability

ESBF = ε ≈ (1− p)k

Proof.

With no collisions mx = fx .With collisions mx > fx .The mx < fx cannot happen with collision.The event fx 6= mx is ”all counters have a collision”, that is a”false positive”.

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Implementing a SBF: challenges

Mainly two challenges

1 vector of counters

computational complexity of

random accessesinsertiondeletion

2 performances

allow insertion/deletion keeping low ESBF

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 2 with Minimal Increase(MI)

We minimize redundant insertions

Minimal Increase principle

When performing insertion of element x, increase onlythe counters that equals mx .Each lookup will return value mx .

We get the inequalityESBF 6 ε

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Minimal Increase: example of increase

insertion is always possible

. . . . . . . . .

x

1 3 2 . . . . . . . . .

x

1 3 22insert x

mx = 1 mx = 2

. . . . . . . . .

x

2 3 23insert x

mx = 33

mx

mx mx

mx mxmx

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Minimal Increase: example of decrease

deletion may introduce false negatives

. . . . . . . . .

x

1 1 1 . . . . . . . . .

x

1 1 1delete ymx = 1 = my 0

mxmx

mx

mx

. . . . . . . . .

x

1 1 1insert y

mx = 1my = 0

0

y

0

y

1

mx = 0 = my

y

10

mx mxmx

my

mymy

my

we lie saying ”x 6∈ S”

MI doesn’t allow deletion

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 2 with Recurring Minimum(RM)

Recurring Minimum : definition

. . . . . .

fx fx fx

fz

fz

mx mz

x has a Recurring Minimum (RM)z has a Single Minimum (SM)

”An element has a RM iff exist more than one counter withits MS value”

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 2 with Recurring Minimum(RM)

We identify Bloom Errors and handle them

Recurring Minimum : principle

For item x with RM we use mx as estimator

ESBF < ε

For items with a single minimum we use a secondary SBFwith |SBF2| � |SBF1|

ESBF2 � ε

Improvements are remarkable.

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Recurrent Minimum: insertion

insertion handles potential future errors1 increase(SBF1,x)2 if x has a RM in SBF1, stop3 look for x in SBF2

1 if x ∈ SBF2 increase(SBF2,x)2 if x 6∈ SBF2 increase(SBF2,search(SBF1,x))

. . . . . .

x

1 2 4

SM

. . . . . . . . .

x

1 6 6insert x

. . .

. . .

. . . . . .0 0..

. . . . . . . . .

x

1 6 6

insert x

. . . . . .0 0..

2

2 2

SM

SM

. . . . . .

x

1 2 42

. . .

. . .

RM

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Recurrent Minimum: lookup and deletion

lookup looks, if needed, in both SBF

1 if x has a RM in SBF1, return it2 say mx2 is value of x in SBF2

1 if mx2 > 0 , return it2 return min value of x in SBF1

deletion is reverse of insertion

1 decrease(SBF1,x)2 if x has a SM in SBF1, decrease(SBF2,x)

As insertion is in both SBT, deletion can’t create falsepositives

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Methods Comparison: MS v.s. MI v.s. RM

error ratesMI ≺ RM ≺ MS = ε

space overheadMI ≺ MS ≺ RM

complexityMS ≺ MI ≺ RM

insertion/deletionMS = RM ≺ MI

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 1 with an integer vector

Each counter fits in one word, for example a 4 bytes word.

All the m-counter are (4m) bytes.To get ε < 0.01% we have m = 10n.So m-counter are (40n) bytes.

With n = 220 (few more than 106 objects) we have that countersneed 40MB!

vector of integer it’s too big

do we need to count up to 232 − 1 ?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 1 with an integer vector

Each counter fits in one word, for example a 4 bytes word.

All the m-counter are (4m) bytes.To get ε < 0.01% we have m = 10n.So m-counter are (40n) bytes.

With n = 220 (few more than 106 objects) we have that countersneed 40MB!

vector of integer it’s too big

do we need to count up to 232 − 1 ?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 1 with a static bit vector

Suppose ∀i .fi < 10 thus Cj < 10 + α with α depending fromcollisions.

Use for each counter dlog2 10e = 4 bits (= 0.5 bytes).To get ε < 0.01% we have m = 10n.So m-counter are (5n) bytes.

With n = 220 we have a 5MB static vector!

this static vector doesn’t allow insertion or deletion

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 1 with a static bit vector

Suppose ∀i .fi < 10 thus Cj < 10 + α with α depending fromcollisions.

Use for each counter dlog2 10e = 4 bits (= 0.5 bytes).To get ε < 0.01% we have m = 10n.So m-counter are (5n) bytes.

With n = 220 we have a 5MB static vector!

this static vector doesn’t allow insertion or deletion

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Solving Problem 1 with String Array Index

use number of bits (per counter) strictly needed

use some slack bits

fix a value α > 0add αm bits to vector, one every 1/α items

each counter Ci uses dlog2(Ci )eeach counter counts up to 2dlog2(Ci )e+..

C vector is

(m∑

i=1

dlog2 Cie)

+ αm = N bits

. . .C1 C2 Cm

dlog2C1e dlog2C2e dlog2Cme

. . . C1/α

dlog2C1/αe

slack bit

C2/α

slack bit

dlog2C2/αe

. . .

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The String Array Index: main idea

first level of pointers to subsequences into SBF

a Coarse Offset Vector to groups of (log N)−size itemsthese pointers are m/ log N

second level may be

other Coarse Offset Vector of pointers to subsequencesa simple vector of offsets

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The String Array Index: graphic

. . .

. . .Coarse Offset Vector

S

Offset Vectors C.O.V.

Offset Vectors

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The String Array Index: performances

2-level of pointers to sub-sequences

if N =m∑

i=1(dlog2(Ci )e+ αi )

Theorem

The SAI of size o(N) + O(m) bits can be built in O(m) time,supporting access to sub-sequences in O(1) time

Theorem

An SBF of size N + o(N) + O(m) bits can be built in O(N) time,supporting lookup in O(1) time.Furthermore, each update takes O(1) amortized time.

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

SBF tricks

merging by addiction1 we have two sets S1,S2 and two SBF C 1,C 2

2 suppose m1 = m2 and same hashing functions

3 just sum the counters

C 12i = C 1

i + C 2i

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Bloomier Bloom Filters, (BBF)

we compute any function f using a BBF

some constraints on fsame tradeoff inherited by Bloom Filters

”we associate values with a subset of the domain elements”

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Which function

f : D = {0, . . . ,N − 1} → R = {⊥, 0, . . . , 2r − 1}values computed are into S ⊆ D⊥∈ R

S

D R

f(S)

f

f

error free

error arbitrarilyclose to 1

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Main features

query is O(1)

space requirement is O(nr)

can be generalized to handle dynamic updates

function can be updatedspace unchanged

we query values of f

we may change f (x) for x ∈ S but S is immutable

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Main idea

a false positive in a BBF

”returning a result when the key is not in the map”

we give a simple idea of a BBF

the Bloom Filter cascadecan be formerly generalized

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A near-optimal and simple BBF

possible values are {0, 1}A0 is a BF with values mapping to 0B0 is a BF with values mapping to 1

we will build many (Ai ,Bi ) (here is the cascade)

we make a cross search

we search as deep as we need

what may happen when searching?

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A near-optimal and simple BBF

we start looking in (A0,B0)if it is in neither

it is not in the map (surely)

if it is in A0 but not in B0

it does not map to 1 (surely)it does map to 0 (probably)

if it is in A0 and in B0

which one lies ? (false positive)

we have to go recursively into (A1,B1)

A1 are values mapping to 0 that are false positives in B0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A near-optimal and simple BBF

we start looking in (A0,B0)if it is in neither

it is not in the map (surely)

if it is in A0 but not in B0

it does not map to 1 (surely)it does map to 0 (probably)

if it is in A0 and in B0

which one lies ? (false positive)

we have to go recursively into (A1,B1)

A1 are values mapping to 0 that are false positives in B0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A near-optimal and simple BBF

we start looking in (A0,B0)if it is in neither

it is not in the map (surely)

if it is in A0 but not in B0

it does not map to 1 (surely)it does map to 0 (probably)

if it is in A0 and in B0

which one lies ? (false positive)

we have to go recursively into (A1,B1)

A1 are values mapping to 0 that are false positives in B0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A near-optimal and simple BBF

we start looking in (A0,B0)if it is in neither

it is not in the map (surely)

if it is in A0 but not in B0

it does not map to 1 (surely)it does map to 0 (probably)

if it is in A0 and in B0

which one lies ? (false positive)

we have to go recursively into (A1,B1)

A1 are values mapping to 0 that are false positives in B0

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

A BF cascade

A0 B0

falsepositive

falsepositive

A1 B1

f.p. f.p.

. . . . . .

|Ai+1| � |Ai |average search is O(1)

first pairs are generally enough

total space is independent of nfirst pair occupies most space

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The general idea

results are binary-coded

v ∈ R is coded with βv ∈ {0, 1}q

for each bit of βv

we use the simple BBF

what we get is

space is slightly larger than the space for 2q BFlookup is Θ(q)build is O(n log n)the EBBF is proportional to 2−q

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

The general idea

they use a table T of coded values

T has m locations

we have as always k hash functions

we use a masking value M to reduce EBBF

βf (x) = M ⊕k⊕

i=1

T [hi (x)]

and if x 6∈ S

P[lookup(x ,T ) = ⊥] ≥ 1− k

2q

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

Spectral Bloom FiltersBloomier Bloom FiltersOthers

Other ”special” Bloom Filters

Counting Bloom Filters

Broder et al. [1998]

Compressed Bloom Filters

M.Mitzenmacher [2002]

Attenuated Bloom Filters

Rhea, Kubiatowicz [2002]

Compact Approximator of Lattice Functions

Boldi, Vigna [2004]

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

ApplicationsReferences

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

ApplicationsReferences

When are really used

routing

probabilistic location and routingshortest path distance information

proxy

Web proxy cache into SQUIDdistributed caching

peer-to-peer

summarize the contents

spell checking

original B.Bloom idea

. . .

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

ApplicationsReferences

Index

1 IntroductionThe problem

2 Bloom Filters: the classicMain ideaMathematics

3 Bloom Filters: variantsSpectral Bloom FiltersBloomier Bloom FiltersOthers

4 ConclusionsApplicationsReferences

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

ApplicationsReferences

References: foundations

B.Bloom”Space/time tradeoffs in hash coding with allowable errors”.CACM,13(7): 422-426, 1970

Saar Cohen, Yossi Matias”Spectral bloom filters”.ACM SIGMOD ’03, 2003

B. Chazelle, J. Kilian, R. Rubinfeld, A. Tal”The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables”.Proceedings of 15th SODA (2004), 30-39, 2004

Bose, Guo, Kranakis, Maheshwari, Morin, Morrison, Smid, Tang”On the false-positive rate of Bloom Filters”.School of Computer Science, Carleton University, 2004

G.Caravagna Bloom Filters, general theory and variants

IntroductionBloom Filters: the classic

Bloom Filters: variantsConclusions

ApplicationsReferences

References: extras

M. Mitzenmacher”Compressed Bloom Filters”.In Proceedings of 20th ACM SIGACT-SIGOPS, 144-150, 2002

P.Boldi, S.Vigna”Compact Approximation of Lattice Functions with Applications to Large-Alphabet Text Search”.Dipartimento di Scienze dell’Informazione, Universita di Milano, 2004

A. Broder, M. Mitzenmacher”Network Applications of Bloom Filters: A Survey”.In Proceedings of 40th Allerton Conference (2004), 636-646, 2002

G.Caravagna Bloom Filters, general theory and variants