Transcript
Page 1: CS60021: Scalable Data Mining Streaming Algorithms

CS60021: Scalable Data Mining

Streaming Algorithms

Sourangshu Bhattacharya

Page 2: CS60021: Scalable Data Mining Streaming Algorithms

Count distinct

Page 3: CS60021: Scalable Data Mining Streaming Algorithms

Streaming problem: distinct count

β€’ Universe is π‘ˆ, number of distinct elements = m, stream size is 𝑛– Example: π‘ˆ = all IP addresses

– IPs can repeat– Want to estimate the number of distinct elements in the

stream

3

10.1.21.10, 10.93.28,1,…..,98.0.3.1,…..10.93.28.1…..

Page 4: CS60021: Scalable Data Mining Streaming Algorithms

Other applications

β€’ Universe = set of all k-grams, stream is generated by document corpus– need number of distinct k-grams seen in corpus

β€’ Universe = telephone call records, stream generated by tuples (caller, callee)– need number of phones that made > 0 calls

4

Page 5: CS60021: Scalable Data Mining Streaming Algorithms

Solutions

5

β€’ Seen 𝑛 elements from stream with elements from π‘ˆ.

β€’ NaΓ―ve solution: 𝑂(𝑛 π‘™π‘œπ‘”|π‘ˆ|) space– Store all elements, sort and count distinct– Store a hashmap, insert if not present

β€’ Bit array: O(|U|) space:– Bits initialized to 1 only if element seen in stream

β€’ Can we do this in less space ?

Page 6: CS60021: Scalable Data Mining Streaming Algorithms

Approximations

β€’ πœ–, 𝛿 βˆ’approximations– Algorithm will use random hash functions– Will return an answer !𝑛 such that

– This will happen with probability 1 βˆ’ 𝛿 over the randomness of the algorithm

6

1 βˆ’ πœ– 𝑛 ≀ !𝑛 ≀ 1 + πœ– 𝑛

Page 7: CS60021: Scalable Data Mining Streaming Algorithms

First effort

β€’ Stream length: 𝑛, distinct elements: π‘šβ€’ Proposed algo: Given space 𝑠, sample 𝑠 items from the

stream– Find the number of distinct elements in this set: #π‘š

– return m = #π‘šΓ— !"

β€’ Not a constant factor approximation– 1,1,1,1,…..1,2,3,4,….,n-1

π‘š βˆ’ 𝑛 + 1

Page 8: CS60021: Scalable Data Mining Streaming Algorithms

Linear Counting

β€’ Bit array 𝐡 of size π‘š, initialized to all zeroβ€’ Hash function β„Ž: π‘ˆ β†’ [π‘š]β€’ When seeing item π‘₯ , set 𝐡 β„Ž π‘₯ = 1

8

Page 9: CS60021: Scalable Data Mining Streaming Algorithms

Linear Counting

β€’ Bit array 𝐡 of size π‘š, initialized to all zeroβ€’ Hash function β„Ž: π‘ˆ β†’ [π‘š]β€’ When seeing item π‘₯ , set 𝐡 β„Ž π‘₯ = 1

β€’ 𝑧# = fraction of zero entries

β€’ Return estimate βˆ’m log($!#)

9

Page 10: CS60021: Scalable Data Mining Streaming Algorithms

Linear Counting Analysis

β€’ Pr[ position remaining 0 ] = 1 βˆ’ )*

+β‰ˆ 𝑒,

!"

β€’ Expected number of positions at zero: E z- = π‘šπ‘’,+/*

β€’ Using tail inequalities we can show this is concentratedβ€’ Typically useful only for π‘š = Θ(𝑛), often useful in

practice

10

Page 11: CS60021: Scalable Data Mining Streaming Algorithms

Flajolet Martin Sketch

Page 12: CS60021: Scalable Data Mining Streaming Algorithms

Flajolet Martin Sketch

β€’ Components– β€œrandom” hash function β„Ž: π‘ˆ β†’ 2β„“ for some large ℓ– β„Ž(π‘₯) is a β„“ βˆ’length bit string– initially assume it is completely random, can relax

β€’ π‘§π‘’π‘Ÿπ‘œ 𝑣 = position of rightmost 1 in bit representation of 𝑣= max{ 𝑖 , 2' 𝑑𝑖𝑣𝑖𝑑𝑒𝑠 𝑣 }

– π‘§π‘’π‘Ÿπ‘œπ‘  10110 = 1, π‘§π‘’π‘Ÿπ‘œπ‘  110101000 = 3

12

Page 13: CS60021: Scalable Data Mining Streaming Algorithms

Flajolet Martin Sketch

Initialize:– Choose a β€œrandom” hash function β„Ž: π‘ˆ β†’ 2β„“

– 𝑧 ← 0Process(x)

– if π‘§π‘’π‘Ÿπ‘œπ‘  β„Ž π‘₯ > 𝑧, z ← π‘§π‘’π‘Ÿπ‘œπ‘ (β„Ž π‘₯ )Estimate:– return 2()*/,

13

Page 14: CS60021: Scalable Data Mining Streaming Algorithms

Example

14

h(.)

0110101

1011010

1000100

1111010

Page 15: CS60021: Scalable Data Mining Streaming Algorithms

Space usage

β€’ We need β„“ β‰₯ 𝐢 log 𝑛 for some 𝐢 β‰₯ 3, say– by birthday paradox analysis, no collisions with high prob

β€’ Sketch : 𝑧 , needs to have only 𝑂(log log 𝑛) bits

β€’ Total space usage = 𝑂(log 𝑛 + log log 𝑛)

15

Page 16: CS60021: Scalable Data Mining Streaming Algorithms

Intuition

β€’ Assume hash values are uniformly distributedβ€’ The probability that a uniform bit-string

– is divisible by 2 is Β½ – is divisible by 4 is ΒΌ – ….

– is divisible by 2! is "#!β€’ We don’t expect any of them to be divisible by 2$%&" ' ("

16

Page 17: CS60021: Scalable Data Mining Streaming Algorithms

Formalizing intuition

β€’ 𝑆 = set of elements that appeared in stream

β€’ For any π‘Ÿ ∈ β„“ , 𝑗 ∈ [𝑛], 𝑋-. = indicator of zeros(β„Ž 𝑗) β‰₯ π‘Ÿ

β€’ π‘Œ- = number of 𝑗 ∈ π‘ˆ such that zeros(β„Ž 𝑗) β‰₯ π‘Ÿπ‘Œ) =2

*∈,

𝑋)*

β€’ Let οΏ½Μ‚οΏ½ be final value of 𝑧 after algo has seen all data

17

Page 18: CS60021: Scalable Data Mining Streaming Algorithms

Proof of FM

β€’ π‘Œ4 > 0 ⟷ οΏ½Μ‚οΏ½ β‰₯ π‘Ÿ , equivalently, π‘Œ4 = 0 ⟷ οΏ½Μ‚οΏ½ < π‘Ÿ

β€’ 𝐸 π‘Œ4 = βˆ‘5∈7 𝐸 𝑋45

β€’ 𝐸 π‘Œ4 = +8#

β€’ π‘£π‘Žπ‘Ÿ π‘Œ4 = βˆ‘5∈7 π‘£π‘Žπ‘Ÿ 𝑋45 ≀ βˆ‘5∈7 𝐸 𝑋458

18

𝑋-. = P1 π‘€π‘–π‘‘β„Ž π‘π‘Ÿπ‘œπ‘12-

0 𝑒𝑙𝑠𝑒

Page 19: CS60021: Scalable Data Mining Streaming Algorithms

Proof of FM

β€’ π‘£π‘Žπ‘Ÿ π‘Œ4 ≀ βˆ‘5∈7 𝐸 𝑋458 ≀ 𝑛/24

19

Pr π‘Œ- > 0 = Pr π‘Œ- β‰₯ 1 ≀𝐸 π‘Œ-1

=𝑛2-

Pr π‘Œ- = 0 ≀ Pr π‘Œ- βˆ’ 𝐸 π‘Œ- β‰₯ 𝐸 π‘Œ- β‰€π‘£π‘Žπ‘Ÿ π‘Œ-𝐸 π‘Œ- , ≀

2-

𝑛

Page 20: CS60021: Scalable Data Mining Streaming Algorithms

Upper bound

Returned estimate M𝑛 = 2:Μ‚;)/8

π‘Ž = smallest integer with 2<;)/8 β‰₯ 4𝑛

Pr M𝑛 β‰₯ 4𝑛 = Pr οΏ½Μ‚οΏ½ β‰₯ π‘Ž = Pr π‘Œ< > 0 ≀𝑛2<

≀24

20

Page 21: CS60021: Scalable Data Mining Streaming Algorithms

Lower bound

Returned estimate Z𝑛 = 2(Μ‚)*/,

𝑏 = largest integer with 20)*/, ≀ 𝑛/4

Pr Z𝑛 ≀𝑛4

= Pr οΏ½Μ‚οΏ½ ≀ 𝑏 = Pr π‘Œ0)* = 0 ≀20)*

𝑛≀

24

21

Page 22: CS60021: Scalable Data Mining Streaming Algorithms

Understanding the bound

β€’ By union bound, with prob 1 βˆ’ 88,

β€’ Can get somewhat better constantsβ€’ Need only 2-wise independent hash functions,

since we only used variances

22

𝑛4≀ Z𝑛 ≀ 4𝑛

Page 23: CS60021: Scalable Data Mining Streaming Algorithms

Improving the probabilities

β€’ To improve the probabilities, a common trick: median of estimates

β€’ Create #𝑧*, #𝑧,,…., ]𝑧1 in parallel– return median

β€’ Expect at most ,2π‘˜ of them to exceed 4𝑛

β€’ But if median exceeds 4𝑛 , then 1,

of them does Γ  using Chernoff bound this prob is exp(βˆ’Ξ© π‘˜ )

23

Page 24: CS60021: Scalable Data Mining Streaming Algorithms

Improving the probabilitiesβ€’ To improve the probabilities, a common trick: median of

estimates

β€’ Create !𝑧), !𝑧8,…., $𝑧= in parallel– return median

β€’ Using Chernoff bound, can show that median will lie in +>, 4𝑛 with probability 1 βˆ’ exp(βˆ’Ξ© π‘˜ ).

β€’ Given error prob 𝛿, choose π‘˜ = O(log )?)

24

Page 25: CS60021: Scalable Data Mining Streaming Algorithms

Summary

β€’ Streaming model– useful abstraction– Estimating basic statistics also nontrivial

β€’ Estimating number of distinct elements– Linear counting– Flajolet Martin

25

Page 26: CS60021: Scalable Data Mining Streaming Algorithms

k-minimum value Sketch

Page 27: CS60021: Scalable Data Mining Streaming Algorithms

k-MV sketch

β€’ Developed in an effort to get better accuracy– Flajolet Martin only give multiplicative accuracy

β€’ Additional capabilities for estimating cardinalities of union and intersection of streams– If 𝑆* and 𝑆, are two streams, can compute their union sketch

from individual sketches of 𝑆* and 𝑆,

27

[kMV sketch slides courtesy Cohen-Wang]

Page 28: CS60021: Scalable Data Mining Streaming Algorithms

Intuition

β€’ Suppose β„Ž: π‘ˆ β†’ [0,1] is random hash function such that β„Ž π‘₯ ∼ π‘ˆ 0,1 for all π‘₯ ∈ π‘ˆ

β€’ Maintain min-hash value 𝑦– initialize 𝑦 ← 1– For each item π‘₯', 𝑦 ← min(𝑦, β„Ž π‘₯' )

β€’ Expectation of minimum is 𝐸[min@β„Ž(π‘₯@)] =

)+;)

28

Page 29: CS60021: Scalable Data Mining Streaming Algorithms

Why is expectation of min = 89:8

?

β€’ Imagine a circle instead of [0, 1]β€’ Choose 𝑛 + 1 points uniformly at randomβ€’ 𝑛 + 1 intervals are formed

β€’ Expected length of each interval is )+;)

β€’ Think of the first point as the place to cut the circle!

29

Page 30: CS60021: Scalable Data Mining Streaming Algorithms

k-minimum value sketch

Initialize:– 𝑦*, . . 𝑦1 ← 1,…1– Uniform random hash functions β„Ž*, … , β„Ž1, β„Ž': π‘ˆ β†’ [0,1]

Process π‘₯ :– For all 𝑗 ∈ π‘˜ , 𝑦. ← min(𝑦., β„Ž. π‘₯' )

Estimate:

– return median-of-means(*3*, … , *3!)

30

Page 31: CS60021: Scalable Data Mining Streaming Algorithms

Median-of-means

β€’ Given (πœ–, 𝛿) , choose π‘˜ = 45" log(

*6)

β€’ Group 𝑑*, … 𝑑1 into log(*6) groups of size 45" each

β€’ Find mean 𝑑' for each group: 𝑍*, … 𝑍789(#$)

β€’ Return Z𝑛 =median of 𝑍*, … 𝑍789(#$)

31

Page 32: CS60021: Scalable Data Mining Streaming Algorithms

Example

32

h1 h2 h3 h4.45 .19 .10 .92

.35 .51 .71 .20

.21 .07 .93 .18

.14 .70 .50 .25

Page 33: CS60021: Scalable Data Mining Streaming Algorithms

Complexity

β€’ Total space required = 𝑂(π‘˜ log 𝑛) = 𝑂( )

A$log 𝑛 log()

?))

– can be improved– don’t need floating points, can use β„Ž: π‘ˆ β†’ 2β„“ as before– can do with k-wise universal hash functions

β€’ Update time per item = 𝑂(π‘˜)– However, can show that most items will not result in updates

33

Page 34: CS60021: Scalable Data Mining Streaming Algorithms

Theoretical Guarantees

With probability 1 βˆ’ 𝛿, returns ;𝑛 satisfies

Proof is simple application of expectation and Chernoff bound.

Reference: Lecture notes by Amit Chakrabarti.

34

1 βˆ’ πœ– 𝑛 ≀ Z𝑛 ≀ 1 + πœ– 𝑛

Page 35: CS60021: Scalable Data Mining Streaming Algorithms

Merging

β€’ For two stream 𝑆% and 𝑆& use same set of hash functions– Stream 𝑆@ has sketch (𝑦)@ , … , 𝑦=@ )

β€’ For each j ∈ π‘˜ , find the combined sketch as:– yF = min(𝑦5), 𝑦58)

β€’ Gives estimate of 𝑆% βˆͺ 𝑆&35

Page 36: CS60021: Scalable Data Mining Streaming Algorithms

References:

β€’ Primary reference for this lectureβ€’ Lecture notes by Amit Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/data-

streams-lecnotes.pdf


Top Related