bloom filters, count sketches and adaptive sketchesas143/comp640_fall16/slides/29th_class.pdf ·...
TRANSCRIPT
![Page 1: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/1.jpg)
Bloom Filters, Count Sketches and Adaptive Sketches
Rice University
Anshumali Shrivastava
29th August 2016
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 1 / 22
![Page 2: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/2.jpg)
Basics: Universal HashingBasic tool for shuffling and sampling from any set of objectsO = {1, 2, ..., n}.
h : O → {1, 2, ..., m}Pr(h(x) = h(y)) ≤ 1
m iff x 6= y .
Some implementations
Pick a random number a and b, a large enough prime, returnh(x) = ax + b mod p mod m
Fastest Trick: Choose m = 2M to be power of 2, choose a randomodd integer return h(x) = ax >> (32−M)
Problems:
Given a set O, randomly assign it to m bins.
Randomly sample 1/m fraction of the data.
Activity: Suppose m >> n How to sample one element randomlyfrom O
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 2 / 22
![Page 3: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/3.jpg)
Bloom Filters Set Up
A common Task: How to know whether some event occurred (before) ornot without storing the event information? The number of possible eventsare huge. The following list is from Wikipedia
Akamai web servers use Bloom filters to prevent ”one-hit-wonders”from being stored in its disk caches. One-hit-wonders are web objectsrequested by users just once.
Google BigTable, Apache HBase and Apache Cassandra, andPostgresql use Bloom filters to reduce the disk lookups fornon-existent rows or columns. Avoiding costly disk lookupsconsiderably increases the performance of a database query operation.
The Google Chrome web browser used to use a Bloom filter toidentify malicious URLs.
The Squid Web Proxy Cache uses Bloom filters for cache digests
Bitcoin uses Bloom filters to speed up wallet synchronization.
many more.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 3 / 22
![Page 4: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/4.jpg)
The Bloom Filter Algorithm and Analysis
A Dynamic Data Structure of m bit arrays B
Pick k universal hash function hi : O → {1, 2, ..., m}i ∈ {1, 2, ..., k}.Insert oj : Set all the bits B(hi (oj)) = 1. ∀ i ∈ {1, 2, ..., k}Query oj : If B(hi (oj)) = 1 ∀ i ∈ {1, 2, ..., k} RETURN True ELSEfalse
Properties
If an item is present, the algorithm is always correct. No falsenegative.
If an item is not present, the algorithm may return true with smallprobability.
Cannot delete items easily.
Analysis On-Board
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 4 / 22
![Page 5: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/5.jpg)
Generalized Bloom Filters: Count-Min Sketch
On a network, a lot of events keep happening. Cannot afford to storeevent information.
Bloom Filters: Keep track of whether an given event has alreadyhappened or not.
Count Min Sketches (or Count Sketches): Keep track of thefrequency of the frequent events (heavy hitters).
Instead of bits keep CountersUsually, to avoid collisions among different hashes, they are hashed intodifferent arrays. (Hence we get Matrix)
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 5 / 22
![Page 6: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/6.jpg)
The Classical (Non-Adaptive) Approximate Counting:
Setting:
We are given a huge number of items (co-variate) i ∈ I to track overtime t ∈ {1, 2, ...,T}. T can be large as well.
We only see increments (i , t, v), the increment v to item i at time t.
Goal: In limited space (hopefully O(log |I | × T )), we want to
Point Queries: Estimate the counts (increments) of item i at time t.
Range Queries: Estimate the counts (increments) of item i duringthe given range [t1, t2].
Classical Sketching: Count-Sketch, Count-Min Sketch (CMS), LossyCounting, etc.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 6 / 22
![Page 7: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/7.jpg)
Idea: Power Law Everywhere in Practice
Frequ
ency
Events Example: We want to cache answers to frequent queries on a server.All queries are just too much to keep track of.
How to identify very frequent queries? (Note, we cannot counteverything.)
We dont even know which ones are frequent, we only see somequeries within a given time set.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 7 / 22
![Page 8: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/8.jpg)
Counting Heavy Hitters on Data Streams
Real Problem: How to identify significant event (frequent) withouthaving to count all of them. (sub-linear)
Classical Formalism (Turnstile Model)
Assume we have a very long vector v (Dim D), we cannot materialize.
We only see increments to its coordinates. E.g. co-ordinate i isincremented by 10 at time t.
Goal: Find s heaviest coordinate, using space k << D
Seems Hopeless !
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 8 / 22
![Page 9: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/9.jpg)
Counting Heavy Hitters on Data Streams
Real Problem: How to identify significant event (frequent) withouthaving to count all of them. (sub-linear)
Classical Formalism (Turnstile Model)
Assume we have a very long vector v (Dim D), we cannot materialize.
We only see increments to its coordinates. E.g. co-ordinate i isincremented by 10 at time t.
Goal: Find s heaviest coordinate, using space k << D
Seems Hopeless !
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 8 / 22
![Page 10: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/10.jpg)
Counting Heavy Hitters on Data Streams
Real Problem: How to identify significant event (frequent) withouthaving to count all of them. (sub-linear)
Classical Formalism (Turnstile Model)
Assume we have a very long vector v (Dim D), we cannot materialize.
We only see increments to its coordinates. E.g. co-ordinate i isincremented by 10 at time t.
Goal: Find s heaviest coordinate, using space k << D
Seems Hopeless !
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 8 / 22
![Page 11: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/11.jpg)
Uncertainty is the Refuge of Hope.—Henri Frederic Amiel (1821-81)
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 9 / 22
![Page 12: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/12.jpg)
Basic Idea behind Sketching.
Randomly assign items to a small number of counters.
It works! AMS 85, Moody 89, Charikar 99, MuthuKrishnana 02, etc.
If no collisions, counts exact.
i
H(i)
Use Random Hash Function
Handling Time:
Treat each pair (i , t) (item, time) as different item.
Hash pairs (i , t), instead of just items.
Time only increases the number of items to |I | × T .
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 10 / 22
![Page 13: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/13.jpg)
What happens during Collision ?
+
+
+
The Good
We typically care about heavy hitters.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 11 / 22
![Page 14: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/14.jpg)
What happens during Collision ?
+
+
+
The Good
The Irrelevant
We typically care about heavy hitters.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 11 / 22
![Page 15: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/15.jpg)
What happens during Collision ?
+
+
+
The Good
The Irrelevant
The Unlucky
We typically care about heavy hitters.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 11 / 22
![Page 16: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/16.jpg)
Maximizing Luck : Count-Min Sketch (CMS)
Idea:
We always overestimate, if unlucky, by a lot.Repeat independently d times and take minimum of all overestimates.Unless unlucky all d times, it will work. (d = log 1
δ , w = 1ε )
Theoretical Guarantee
c ≤ c ≤ c + εMT with probability 1− δ, where MT is sum of allcounts in the stream.Space O(log |I | × T )
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 12 / 22
![Page 17: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/17.jpg)
New Requirement: Time Adaptability
In Practice:
Recent trends are more important.
A burst in the number of clicks in the past few minutes moreinformative than similar burst last month.
Expectation: Time Adaptive Counting.
Classical sketches do not take temporal effect into consideration.
Smart Tradeoff: Given the same space, trade errors of recent countswith that of older ones.
Like our memory, forget slowly.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 13 / 22
![Page 18: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/18.jpg)
Existing Solution: Hokusai1 t = T (𝑨𝑻)
t = T-1 (𝑨𝑻−𝟏 )
t = T-2 (𝑨𝑻−𝟐 )
t = T-3 (𝑨𝑻−𝟑 )
t = T-4 (𝑨𝑻−𝟒 )
t = T-5 (𝑨𝑻−𝟓 )
t = T-6 (𝑨𝑻−𝟔 )
Idea: Disproportionate allocation over time.
Accuracy of CMS dependent on memory allocated.
More space for recent sketches and less for older.
Keep a CMS sketch for every time. Shrink sketch size on fly.Clever Idea: Exploit Rollover.
1Matusevych, Smola and Ahmad 2012Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 14 / 22
![Page 19: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/19.jpg)
Existing Solution: Hokusai1 t = T (𝑨𝑻)
t = T-1 (𝑨𝑻−𝟏 )
t = T-2 (𝑨𝑻−𝟐 )
t = T-3 (𝑨𝑻−𝟑 )
t = T-4 (𝑨𝑻−𝟒 )
t = T-5 (𝑨𝑻−𝟓 )
t = T-6 (𝑨𝑻−𝟔 )
Idea: Disproportionate allocation over time.
Accuracy of CMS dependent on memory allocated.
More space for recent sketches and less for older.
Keep a CMS sketch for every time. Shrink sketch size on fly.Clever Idea: Exploit Rollover.
1Matusevych, Smola and Ahmad 2012Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 14 / 22
![Page 20: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/20.jpg)
Existing Solution: Hokusai1 t = T (𝑨𝑻)
t = T-1 (𝑨𝑻−𝟏 )
t = T-2 (𝑨𝑻−𝟐 )
t = T-3 (𝑨𝑻−𝟑 )
t = T-4 (𝑨𝑻−𝟒 )
t = T-5 (𝑨𝑻−𝟓 )
t = T-6 (𝑨𝑻−𝟔 )
Idea: Disproportionate allocation over time.
Accuracy of CMS dependent on memory allocated.
More space for recent sketches and less for older.
Keep a CMS sketch for every time. Shrink sketch size on fly.Clever Idea: Exploit Rollover.
1Matusevych, Smola and Ahmad 2012Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 14 / 22
![Page 21: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/21.jpg)
Existing Solution: Hokusai1
t = T (𝑨𝑻)
t = T-1 (𝑨𝑻−𝟏 )
t = T-2 (𝑨𝑻−𝟐 )
t = T-3 (𝑨𝑻−𝟑 )
t = T-4 (𝑨𝑻−𝟒 )
t = T-5 (𝑨𝑻−𝟓 )
t = T-6 (𝑨𝑻−𝟔 )
Idea: Disproportionate allocation over time.
Accuracy of CMS dependent on memory allocated.
More space for recent sketches and less for older.
Keep a CMS sketch for every time. Shrink sketch size on fly.Clever Idea: Exploit Rollover.
1Matusevych, Smola and Ahmad 2012Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 14 / 22
![Page 22: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/22.jpg)
Problems with Hokusai
Issues:
Discontinuity. If time t is empty, we still have to shrink sketch size forolder times.
O(T) memory. One for each t.
Shrinking overhead. Shrink log t sketches for every transition from tto t + 1.
No flexibility.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 15 / 22
![Page 23: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/23.jpg)
Detour: Dolby Noise Reduction (1960s)
High Level View
In digital recording, the music signal compete with tape hiss(background noise).
if Signal to Noise (SNR) ratio is high, the recording is noise free.
While recording the frequencies in the music is artificially inflated(Pre-Emphasis).
During playback a reverse transformation is applied which cancelspre-emphasis. (De-Emphasis)
Overall effect of noise is minimized.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 16 / 22
![Page 24: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/24.jpg)
Proposal: (Adaptive)Ada-Sketches
Analogy with Dolby Noise Reduction:
Sketches preserves heavier counts more accurately.
Artificially inflate recent counts (Pre-emphasis).
Inflated counts will be preserved with less error.
Deflate by exact same amount during estimation. (De-emphasis)
Data
Stream
Multiply
by 𝒇(𝒕) SKETCH
Insert Query RESULT
Pre-emphasis De-emphasis
Divide by
apt 𝒇(𝒕)
Proposal
Let f (t) be any (pre-defined) monotonically increasing sequence.(f (t) can be chosen wisely)
Multiply the count of (i , t) with f (t) and then add to the sketch.
While querying (i,t), get the estimate and divide by f (t)
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 17 / 22
![Page 25: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/25.jpg)
Proposal: (Adaptive)Ada-Sketches
Analogy with Dolby Noise Reduction:
Sketches preserves heavier counts more accurately.
Artificially inflate recent counts (Pre-emphasis).
Inflated counts will be preserved with less error.
Deflate by exact same amount during estimation. (De-emphasis)
Data
Stream
Multiply
by 𝒇(𝒕) SKETCH
Insert Query RESULT
Pre-emphasis De-emphasis
Divide by
apt 𝒇(𝒕)
Proposal
Let f (t) be any (pre-defined) monotonically increasing sequence.(f (t) can be chosen wisely)
Multiply the count of (i , t) with f (t) and then add to the sketch.
While querying (i,t), get the estimate and divide by f (t)
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 17 / 22
![Page 26: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/26.jpg)
Why it works ?
Observation
If no collision then exact.
During collision, errors or recent counts decrease due to greaterde-emphasis.
+
+
Vanilla
Error
+
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 18 / 22
![Page 27: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/27.jpg)
Why it works ?
Observation
If no collision then exact.
During collision, errors or recent counts decrease due to greaterde-emphasis.
+
+
Adaptive
Pre-Emphasis
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 18 / 22
![Page 28: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/28.jpg)
Why it works ?
Observation
If no collision then exact.
During collision, errors or recent counts decrease due to greaterde-emphasis.
+
+
Adaptive
Error
Pre-Emphasis
De-Emphasis
+ +
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 18 / 22
![Page 29: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/29.jpg)
Advantages
No Discontinuity. If time t is empty, no addition, no extra collisions,no extra errors.
O(log |I | × T ) memory just like CMS.
No shrinking overhead. Minimum modification to CMS. (StrictGeneralization)
Provable Time Adaptive Guarantees
Theorem
For w = d eε e and d = log 1δ , given any (i , t) we have
cti ≤ cti ≤ cti + εβt√MT
2
with probability 1− δ. Here βt =
√∑Tt′=0(f (t
′))2
f (t) is the time adaptive factormonotonically decreasing with t.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 19 / 22
![Page 30: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/30.jpg)
More..
Works with any Sketching Algorithm
Adaptive Count Sketches, Adaptive Lossy Counting etc.
Provable Time Adaptive Guarantees for all of them.
Flexibility in Choice of f (t)
Any monotonic f (t) works. Can be tailored
Upper bound dependent on βt =
√∑Tt′=0(f (t
′))2
f (t) .
Fine control over the error distributions.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 20 / 22
![Page 31: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/31.jpg)
More..
Works with any Sketching Algorithm
Adaptive Count Sketches, Adaptive Lossy Counting etc.
Provable Time Adaptive Guarantees for all of them.
Flexibility in Choice of f (t)
Any monotonic f (t) works. Can be tailored
Upper bound dependent on βt =
√∑Tt′=0(f (t
′))2
f (t) .
Fine control over the error distributions.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 20 / 22
![Page 32: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/32.jpg)
Experiments: Accuracy for a given Memory
500 1000 1500 20000
500
1000
1500
2000
Time
Abs
olut
e E
rror
w = 218
AOL
CMSAda−CMS (exp)Ada−CMS (lin)Hokusai
400 600 800 10000
200
400
600
800
1000
Time
Abs
olut
e E
rror
w = 218Criteo
CMSAda−CMS (exp)Ada−CMS (lin)Hokusai
Time500 1000 1500 2000S
tand
ard
Dev
iatio
n of
Err
ors
0
50
100
150
200 w = 218
AOL
CMSAda-CMS (exp)Ada-CMS (lin)Hokusai
Time400 600 800 1000S
tand
ard
Dev
iatio
n of
Err
ors
0
50
100
150
200w = 218
Criteo
CMSAda-CMS (exp)Ada-CMS (lin)Hokusai
Figure: Mean and Standard deviation of errors for w = 218.
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 21 / 22
![Page 33: Bloom Filters, Count Sketches and Adaptive Sketchesas143/COMP640_Fall16/slides/29th_class.pdf · Counting Heavy Hitters on Data Streams Real Problem: How to identify signi cant event](https://reader034.vdocuments.net/reader034/viewer/2022052021/603546c2945f6a6aa1014292/html5/thumbnails/33.jpg)
Scalability: Throughput
Table: Time in sec to summarize AOL dataset
220 222 225 227 230
CMS 44.62 44.80 48.40 50.81 52.67
Hoku 68.46 94.07 360.23 1206.71 9244.17
ACMS (lin) 44.57 44.62 49.95 52.21 52.87
ACMS (exp) 68.32 73.96 76.23 82.73 76.82
Table: Time in sec to summarize Criteo Dataset
220 222 225 227 230
CMS 40.79 42.29 45.81 45.92 46.17
Hoku 55.19 90.32 335.04 1134.07 8522.12
ACMS (lin) 39.07 42.00 44.54 45.32 46.24
ACMS (exp) 69.21 69.31 71.23 72.01 72.85
Anshumali Shrivastava (COMP 640) Sketching 29th August 2016 22 / 22