advanced algorithms for massive datasets

Post on 22-Jan-2016

42 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Advanced Algorithms for Massive Datasets. The power of “ failing ”. 2. TTT. Not perfectly true but. Opt k = 5.45. m / n = 8. We do have an explicit formula for the optimal k. Other advantage: no key storage. Crawling. - PowerPoint PPT Presentation

TRANSCRIPT

Advanced Algorithmsfor Massive Datasets

The power of “failing”

TTT 2

Not perfectly true but...

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0,09

0,1

0 1 2 3 4 5 6 7 8 9 10

Fa

lse

po

siti

ve

rate

Hash functions

m/n = 8Opt k = 5.45...

We do have an

explicit formula

for the optimal k

Other advantage: no key storage

Crawling

What data structures should we use to keep

track of the visited URLs of a crawler?

URLs are long

Check should be very fast

No care about small errors (≈ page not crawled)

Bloom Filter

over crawled URLs

Anti-virus detection

D is a dictionary of virus-checksum of some given length z. For each position i, check…

Brute-force check: O( |D| * |F| ) time Trie check: O( z * |F| ) time Better Solution ?

Build a BF on D.

Check T[i,i+z-1] є D, if BF answers YES

then “warn the user” or explicitly scan D

FVji i+z

O(k*|F|)

or even better...

Upper bounds

Upper bounds

Recurring minimum forimproving the estimate

+ 2 SBF

top related