stanford university ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · more...

47
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Upload: others

Post on 15-Aug-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff UllmanStanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Page 2: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

More algorithms for streams:(1) Filtering a data stream: Bloom filtersSelect elements with property x from stream

(2) Counting distinct elements: Flajolet‐MartinNumber of distinct elements in the last k elements of the stream

(3) Estimating frequency moments: AMS methodEstimate std. dev. of last k elements

(4) Counting frequent itemsets: Exponentially Decaying Windows

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Page 3: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with
Page 4: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

4J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 5: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

5J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 6: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Given a set of keys S that we want to filterCreate a bit array B of n bits, initially all 0sChoose a hash function h with range [0,n)Hash each member of s∈ S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1Hash each element a of the stream and output only those that hash to bit that was set to 1Output a if B[h(a)] == 1

6J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 7: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Creates false positives but no false negativesIf the item is in S we surely output it, if not we may still output it

7

FilterItem

0010001011000

Output the item since it may be in S.Item hashes to a bucket that at least one of the items in S hashed to.

Hash func h

Drop the item.It hashes to a bucket set to 0 so it is surely not in S.

Bit array B

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 8: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

8J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 9: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

9J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 10: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

We have m darts, n targetsWhat is the probability that a target gets at least one dart?

10

(1 – 1/n)

Probability sometarget X not hit

by a dart

m

1 -

Probability atleast one darthits target X

n( / n)

EquivalentEquals 1/eas n →∞

1 – e–m/n

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 11: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Fraction of 1s in the array B == probability of false positive = 1 – e‐m/n

Example: 109 darts, 8∙109 targetsFraction of 1s in B = 1 – e‐1/8 = 0.1175Compare with our earlier estimate: 1/8 = 0.125

11J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 12: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Consider: |S| = m, |B| = nUse k independent hash functions h1 ,…, hkInitialization:Set B to all 0sHash each element s∈ S using each hash function hi, set B[hi(s)] = 1 (for each i = 1,.., k)

Run‐time:When a stream element with key x arrivesIf B[hi(x)] = 1 for all i = 1,..., k then declare that x is in SThat is, x hashes to a bucket set to 1 for every hash function hi(x)

Otherwise discard the element xJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  12

(note: we have a single array B!)

Page 13: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  13

Page 14: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  14

0 2 4 6 8 10 12 14 16 18 200.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Number of hash functions, k

Fals

e po

sitiv

e pr

ob.

Page 15: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  15

Page 16: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with
Page 17: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 18: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

18J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 19: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

19J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 20: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Each element x is placed in bucket k with probability 1/2k

E[#elements before bucket k is non‐empty] = 2k

R = maximum index of non‐empty bucket. R is only information we need to maintain.

Estimated number of distinct elements = 2R

20J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 21: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 22: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Very very rough and heuristic intuition why Flajolet‐Martin works:h(a) hashes a with equal prob. to any of N valuesThen h(a) is a sequence of log2 N bits, where 2‐r fraction of all as have a tail of r zeros About 50% of as hash to ***0About 25% of as hash to **00So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 distinct items so far

So, it takes to hash about 2r items before we see one with zero‐suffix of length r

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Page 23: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 24: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

24

Prob. that given h(a) ends in at most r zeros

Prob. all end in at most r zeros.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 25: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

25J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

1)21( 2 =≈−−−− rmmr e

0)21( 2 =≈−−−− rmmr e

rrr mmrmr e−− −−− ≈−=− 2)2(2)21()21(

Page 26: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

26J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 27: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with
Page 28: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

28

∑∈Aik

im )(

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 29: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

0thmoment = number of distinct elementsThe problem just considered

1st moment = count of the numbers of elements = length of the streamEasy to compute

2nd moment = surprise number S =a measure of how uneven the distribution is

29J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

∑∈Aik

im )(

Page 30: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

Page 31: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

[Alon, Matias, and Szegedy]

Page 32: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

Page 33: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  33

Time t whenthe last i is seen (ct=1)

Time twhenthe penultimatei is seen (ct=2)

Time twhenthe first i is seen (ct=mi)

Group timesby the valueseen

a a a a

1 32 ma

b b b b

Count:

Stream:

mi … total count of item i in the stream

(we are assuming stream has length n)

Page 34: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  34

a a a a

1 32 ma

b b b bStream:

Count:

Page 35: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

Page 36: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

36J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 37: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

(1) The variables X have n as a factor –keep n separately; just hold the count in X(2) Suppose we can only store k counts.  We must throw some Xs out as time goes on:Objective: Each starting time t is selected with probability k/nSolution: (fixed‐size sampling!)Choose the first k times for k variablesWhen the nth element arrives (n > k), choose it with probability k/nIf you choose it, throw one of the previously stored variables X out, with equal probability

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

Page 38: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with
Page 39: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

New Problem: Given a stream, which items appear more than s times in the window?Possible solution: Think of the stream of baskets as one binary stream per item1 = item present; 0 = not presentUse DGIM to estimate counts of 1s for all items

39J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N

0112

234

106

Page 40: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

40J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 41: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

Page 42: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

42J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 43: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

1/c

. . .

Page 44: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

44J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 45: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Count (some) itemsets in an E.D.W.What are currently “hot” itemsets?Problem: Too many itemsets to keep counts of all of them in memory

When a basket B comes in:Multiply all counts by (1‐c)For uncounted items in B, create new countAdd 1 to count of any item in B and to any itemsetcontained in B that is already being countedDrop counts < ½Initiate new counts (next slide)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

Page 46: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

Start a count for an itemset S ⊆ B if every proper subset of S had a count prior to arrival of basket BIntuitively: If all subsets of S are being counted this means they are “frequent/hot” and thus S has a potential to be “hot”

Example:Start counting S={i, j} iff both i and j were counted prior to seeing BStart counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}were all counted prior to seeing B

46J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 47: Stanford University ://courses.corelab.ntua.gr/.../course/section/712/ch04-streams2.… · More algorithms for streams: (1)Filtering a data stream: Bloom filters Select elements with

47J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org