CS60021: Scalable Data Mining
Streaming Algorithms
Sourangshu Bhattacharya
Count distinct
Streaming problem: distinct count
β’ Universe is π, number of distinct elements = m, stream size is πβ Example: π = all IP addresses
β IPs can repeatβ Want to estimate the number of distinct elements in the
stream
3
10.1.21.10, 10.93.28,1,β¦..,98.0.3.1,β¦..10.93.28.1β¦..
Other applications
β’ Universe = set of all k-grams, stream is generated by document corpusβ need number of distinct k-grams seen in corpus
β’ Universe = telephone call records, stream generated by tuples (caller, callee)β need number of phones that made > 0 calls
4
Solutions
5
β’ Seen π elements from stream with elements from π.
β’ NaΓ―ve solution: π(π πππ|π|) spaceβ Store all elements, sort and count distinctβ Store a hashmap, insert if not present
β’ Bit array: O(|U|) space:β Bits initialized to 1 only if element seen in stream
β’ Can we do this in less space ?
Approximations
β’ π, πΏ βapproximationsβ Algorithm will use random hash functionsβ Will return an answer !π such that
β This will happen with probability 1 β πΏ over the randomness of the algorithm
6
1 β π π β€ !π β€ 1 + π π
First effort
β’ Stream length: π, distinct elements: πβ’ Proposed algo: Given space π , sample π items from the
streamβ Find the number of distinct elements in this set: #π
β return m = #πΓ !"
β’ Not a constant factor approximationβ 1,1,1,1,β¦..1,2,3,4,β¦.,n-1
π β π + 1
Linear Counting
β’ Bit array π΅ of size π, initialized to all zeroβ’ Hash function β: π β [π]β’ When seeing item π₯ , set π΅ β π₯ = 1
8
Linear Counting
β’ Bit array π΅ of size π, initialized to all zeroβ’ Hash function β: π β [π]β’ When seeing item π₯ , set π΅ β π₯ = 1
β’ π§# = fraction of zero entries
β’ Return estimate βm log($!#)
9
Linear Counting Analysis
β’ Pr[ position remaining 0 ] = 1 β )*
+β π,
!"
β’ Expected number of positions at zero: E z- = ππ,+/*
β’ Using tail inequalities we can show this is concentratedβ’ Typically useful only for π = Ξ(π), often useful in
practice
10
Flajolet Martin Sketch
Flajolet Martin Sketch
β’ Componentsβ βrandomβ hash function β: π β 2β for some large ββ β(π₯) is a β βlength bit stringβ initially assume it is completely random, can relax
β’ π§πππ π£ = position of rightmost 1 in bit representation of π£= max{ π , 2' πππ£ππππ π£ }
β π§ππππ 10110 = 1, π§ππππ 110101000 = 3
12
Flajolet Martin Sketch
Initialize:β Choose a βrandomβ hash function β: π β 2β
β π§ β 0Process(x)
β if π§ππππ β π₯ > π§, z β π§ππππ (β π₯ )Estimate:β return 2()*/,
13
Example
14
h(.)
0110101
1011010
1000100
1111010
Space usage
β’ We need β β₯ πΆ log π for some πΆ β₯ 3, sayβ by birthday paradox analysis, no collisions with high prob
β’ Sketch : π§ , needs to have only π(log log π) bits
β’ Total space usage = π(log π + log log π)
15
Intuition
β’ Assume hash values are uniformly distributedβ’ The probability that a uniform bit-string
β is divisible by 2 is Β½ β is divisible by 4 is ΒΌ β β¦.
β is divisible by 2! is "#!β’ We donβt expect any of them to be divisible by 2$%&" ' ("
16
Formalizing intuition
β’ π = set of elements that appeared in stream
β’ For any π β β , π β [π], π-. = indicator of zeros(β π) β₯ π
β’ π- = number of π β π such that zeros(β π) β₯ ππ) =2
*β,
π)*
β’ Let οΏ½ΜοΏ½ be final value of π§ after algo has seen all data
17
Proof of FM
β’ π4 > 0 β· οΏ½ΜοΏ½ β₯ π , equivalently, π4 = 0 β· οΏ½ΜοΏ½ < π
β’ πΈ π4 = β5β7 πΈ π45
β’ πΈ π4 = +8#
β’ π£ππ π4 = β5β7 π£ππ π45 β€ β5β7 πΈ π458
18
π-. = P1 π€ππ‘β ππππ12-
0 πππ π
Proof of FM
β’ π£ππ π4 β€ β5β7 πΈ π458 β€ π/24
19
Pr π- > 0 = Pr π- β₯ 1 β€πΈ π-1
=π2-
Pr π- = 0 β€ Pr π- β πΈ π- β₯ πΈ π- β€π£ππ π-πΈ π- , β€
2-
π
Upper bound
Returned estimate Mπ = 2:Μ;)/8
π = smallest integer with 2<;)/8 β₯ 4π
Pr Mπ β₯ 4π = Pr οΏ½ΜοΏ½ β₯ π = Pr π< > 0 β€π2<
β€24
20
Lower bound
Returned estimate Zπ = 2(Μ)*/,
π = largest integer with 20)*/, β€ π/4
Pr Zπ β€π4
= Pr οΏ½ΜοΏ½ β€ π = Pr π0)* = 0 β€20)*
πβ€
24
21
Understanding the bound
β’ By union bound, with prob 1 β 88,
β’ Can get somewhat better constantsβ’ Need only 2-wise independent hash functions,
since we only used variances
22
π4β€ Zπ β€ 4π
Improving the probabilities
β’ To improve the probabilities, a common trick: median of estimates
β’ Create #π§*, #π§,,β¦., ]π§1 in parallelβ return median
β’ Expect at most ,2π of them to exceed 4π
β’ But if median exceeds 4π , then 1,
of them does Γ using Chernoff bound this prob is exp(βΞ© π )
23
Improving the probabilitiesβ’ To improve the probabilities, a common trick: median of
estimates
β’ Create !π§), !π§8,β¦., $π§= in parallelβ return median
β’ Using Chernoff bound, can show that median will lie in +>, 4π with probability 1 β exp(βΞ© π ).
β’ Given error prob πΏ, choose π = O(log )?)
24
Summary
β’ Streaming modelβ useful abstractionβ Estimating basic statistics also nontrivial
β’ Estimating number of distinct elementsβ Linear countingβ Flajolet Martin
25
k-minimum value Sketch
k-MV sketch
β’ Developed in an effort to get better accuracyβ Flajolet Martin only give multiplicative accuracy
β’ Additional capabilities for estimating cardinalities of union and intersection of streamsβ If π* and π, are two streams, can compute their union sketch
from individual sketches of π* and π,
27
[kMV sketch slides courtesy Cohen-Wang]
Intuition
β’ Suppose β: π β [0,1] is random hash function such that β π₯ βΌ π 0,1 for all π₯ β π
β’ Maintain min-hash value π¦β initialize π¦ β 1β For each item π₯', π¦ β min(π¦, β π₯' )
β’ Expectation of minimum is πΈ[min@β(π₯@)] =
)+;)
28
Why is expectation of min = 89:8
?
β’ Imagine a circle instead of [0, 1]β’ Choose π + 1 points uniformly at randomβ’ π + 1 intervals are formed
β’ Expected length of each interval is )+;)
β’ Think of the first point as the place to cut the circle!
29
k-minimum value sketch
Initialize:β π¦*, . . π¦1 β 1,β¦1β Uniform random hash functions β*, β¦ , β1, β': π β [0,1]
Process π₯ :β For all π β π , π¦. β min(π¦., β. π₯' )
Estimate:
β return median-of-means(*3*, β¦ , *3!)
30
Median-of-means
β’ Given (π, πΏ) , choose π = 45" log(
*6)
β’ Group π‘*, β¦ π‘1 into log(*6) groups of size 45" each
β’ Find mean π‘' for each group: π*, β¦ π789(#$)
β’ Return Zπ =median of π*, β¦ π789(#$)
31
Example
32
h1 h2 h3 h4.45 .19 .10 .92
.35 .51 .71 .20
.21 .07 .93 .18
.14 .70 .50 .25
Complexity
β’ Total space required = π(π log π) = π( )
A$log π log()
?))
β can be improvedβ donβt need floating points, can use β: π β 2β as beforeβ can do with k-wise universal hash functions
β’ Update time per item = π(π)β However, can show that most items will not result in updates
33
Theoretical Guarantees
With probability 1 β πΏ, returns ;π satisfies
Proof is simple application of expectation and Chernoff bound.
Reference: Lecture notes by Amit Chakrabarti.
34
1 β π π β€ Zπ β€ 1 + π π
Merging
β’ For two stream π% and π& use same set of hash functionsβ Stream π@ has sketch (π¦)@ , β¦ , π¦=@ )
β’ For each j β π , find the combined sketch as:β yF = min(π¦5), π¦58)
β’ Gives estimate of π% βͺ π&35
References:
β’ Primary reference for this lectureβ’ Lecture notes by Amit Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/data-
streams-lecnotes.pdf