streaming data mining - edo liberty's homepage · edo liberty , jelani nelson : streaming data...
TRANSCRIPT
![Page 1: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/1.jpg)
Streaming Data Mining
Edo Liberty 1 Jelani Nelson2
1Yahoo! Research, [email protected] University, [email protected].
Edo Liberty , Jelani Nelson : Streaming Data Mining 1 / 111
![Page 2: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/2.jpg)
The need for Streaming Data Mining
Standard Interface between data and mining algorithms
Edo Liberty , Jelani Nelson : Streaming Data Mining 2 / 111
![Page 3: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/3.jpg)
The need for Streaming Data Mining
We have a lot of data...Example: videos, images, email messages, webpages, chats, click data,search queries, shopping history, user browsing patterns, GPS trials,financial transactions, stock exchange data, electricity consumption, trafficrecords, Seismology, Astronomy, Physics, medical imaging, Chemistry,Computational Biology, weather measurements, maps, telephony data,SMSs, audio tracks and songs, applications, gaming scores, user ratings,questions answer forums, legal documentation, medical records, networktrafic records, satellite mesurants, digital microscopy, cellular records...
Edo Liberty , Jelani Nelson : Streaming Data Mining 3 / 111
![Page 4: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/4.jpg)
The need for Streaming Data Mining
Distributed storage and file systems (HaddopFS, GFS, Cassandra, XIV)
Edo Liberty , Jelani Nelson : Streaming Data Mining 4 / 111
![Page 5: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/5.jpg)
The need for Streaming Data Mining
Distributed computation (MapReduce, Hadoop, Message passing)
Edo Liberty , Jelani Nelson : Streaming Data Mining 5 / 111
![Page 6: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/6.jpg)
The need for Streaming Data Mining
Distributed computation (Web search, Hbase/bigtable)
Edo Liberty , Jelani Nelson : Streaming Data Mining 6 / 111
![Page 7: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/7.jpg)
The need for Streaming Data Mining
“Study Projects Nearly 45-Fold Annual Data Growth by 2020” EMC Press Release.
Figure: The Economist: Data, data everywhere
Edo Liberty , Jelani Nelson : Streaming Data Mining 7 / 111
![Page 8: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/8.jpg)
The need for Streaming Data Mining
IDC 2011 Digital Universe Study.
Edo Liberty , Jelani Nelson : Streaming Data Mining 8 / 111
![Page 9: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/9.jpg)
The need for Streaming Data Mining
Edo Liberty , Jelani Nelson : Streaming Data Mining 9 / 111
![Page 10: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/10.jpg)
More exact model
Trivial tasks: count items, sum values, sample, find min/max.
Edo Liberty , Jelani Nelson : Streaming Data Mining 10 / 111
![Page 11: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/11.jpg)
Communication complexity
Impossible tasks: finding median, alert on new item, most frequent item.
Edo Liberty , Jelani Nelson : Streaming Data Mining 11 / 111
![Page 12: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/12.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 13: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/13.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 14: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/14.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 15: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/15.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 16: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/16.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 17: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/17.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 18: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/18.jpg)
Streaming Data Mining
When things are possible and not trivial:
1 Most tasks/query-types require different sketches
2 Algorithms are usually randomized
3 Results are, as a whole, approximated
But
1 Approximate result is expectable → significant speedup (one pass)
2 Data cannot be stored → only option
Edo Liberty , Jelani Nelson : Streaming Data Mining 12 / 111
![Page 19: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/19.jpg)
Streaming Data Mining
1 Items (words, IP-adresses, events, clicks,...):
Item frequenciesDistinct elementsMoment estimation
2 Vectors (text documents, images, example features,...)
Dimensionality reductionk-meansLinear Regression
3 Matrices (text corpora, user preferences, social graphs,...)
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 13 / 111
![Page 20: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/20.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 14 / 111
![Page 21: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/21.jpg)
Item frequencies
Computing f (i) for all i is easy in O(n) space.
Edo Liberty , Jelani Nelson : Streaming Data Mining 15 / 111
![Page 22: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/22.jpg)
Item frequencies
Computing f (i) for all i is easy in O(n) space.
Edo Liberty , Jelani Nelson : Streaming Data Mining 15 / 111
![Page 23: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/23.jpg)
Sampling
We sample with probability p and estimate f ′(i) = 1pA(i).
Edo Liberty , Jelani Nelson : Streaming Data Mining 16 / 111
![Page 24: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/24.jpg)
Sampling
For f ′(i) = f (i)± εN it suffices that p ≥ c log(n/δ)/Nε2.
Edo Liberty , Jelani Nelson : Streaming Data Mining 17 / 111
![Page 25: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/25.jpg)
Sampling
The space requirement is therefore O(log(n/δ)/ε2).
Edo Liberty , Jelani Nelson : Streaming Data Mining 18 / 111
![Page 26: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/26.jpg)
Count min sketches
This time we keep only 2/ε counters in an array A
Edo Liberty , Jelani Nelson : Streaming Data Mining 19 / 111
![Page 27: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/27.jpg)
Count min sketches
Items are counted in buckets according to a hash function h : [n]→ [2/ε].
Edo Liberty , Jelani Nelson : Streaming Data Mining 20 / 111
![Page 28: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/28.jpg)
Count min sketches
Obviously we have that f ′(i) ≥ f (i)
Edo Liberty , Jelani Nelson : Streaming Data Mining 21 / 111
![Page 29: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/29.jpg)
Count min sketches
But also Pr[f ′(i) ≤ f (i) + εN] ≥ 1/2.
Edo Liberty , Jelani Nelson : Streaming Data Mining 22 / 111
![Page 30: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/30.jpg)
Count min sketches
This reduces the space requirement to O(log(n/δ)/ε).
Edo Liberty , Jelani Nelson : Streaming Data Mining 23 / 111
![Page 31: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/31.jpg)
Lossy Counting
We show how to reduce the space requirement to 1/ε.Misra, Gries. Finding repeated elements, 1982.Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003The name “Lossy Counting” was used for a different algorithm here by Manku and Motwani, 2002Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006 (Space Saving)
Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111
![Page 32: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/32.jpg)
Lossy Counting
We keep at most 1/ε different items (in this case 2)
Edo Liberty , Jelani Nelson : Streaming Data Mining 25 / 111
![Page 33: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/33.jpg)
Lossy Counting
If we have more than 1/ε we reduce all counters
Edo Liberty , Jelani Nelson : Streaming Data Mining 26 / 111
![Page 34: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/34.jpg)
Lossy Counting
And repeat...
Edo Liberty , Jelani Nelson : Streaming Data Mining 27 / 111
![Page 35: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/35.jpg)
Lossy Counting
Until the stream is consumed
Edo Liberty , Jelani Nelson : Streaming Data Mining 28 / 111
![Page 36: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/36.jpg)
Lossy Counting
We have f (i) ≥ f ′(i) ≥ f ′(i)− εN.
Edo Liberty , Jelani Nelson : Streaming Data Mining 29 / 111
![Page 37: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/37.jpg)
Lossy Counting
This is because we can delete 1/ε items at most εN times!
Edo Liberty , Jelani Nelson : Streaming Data Mining 30 / 111
![Page 38: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/38.jpg)
Error Relative to the Tail
Figure: Distribution of top 20 and top 1000 most frequent words a messagingtext corpus. The heavy tail distribution is common to many data sources.
Therefore, εN = ε∑n
i=1 f (i) might not be tight enough.We now see how to reduce the approximation guarantee to
εn∑
i=k+1
f (i)
Edo Liberty , Jelani Nelson : Streaming Data Mining 31 / 111
![Page 39: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/39.jpg)
Count Max Sketches
Distributing items with hash function h : [n]→ [4k] to 4k lossy counters.Beware: algorithm does not exist in the litrature, only described for didactic reasons.
Edo Liberty , Jelani Nelson : Streaming Data Mining 32 / 111
![Page 40: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/40.jpg)
Count Max Sketches
Then nj =∑
j :h(j)=h(i) f (j) ≤∑n
i=k+1 f (i)/k with probability at least 1/2
Edo Liberty , Jelani Nelson : Streaming Data Mining 33 / 111
![Page 41: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/41.jpg)
Count Max Sketches
So, f ′(i) > f (i)− ε∑n
i=k+1 f (i) with probability at least 1/2
Edo Liberty , Jelani Nelson : Streaming Data Mining 34 / 111
![Page 42: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/42.jpg)
Count Max Sketches
f ′(i) is taken as the maximum value over log(n/δ) such structures.
Edo Liberty , Jelani Nelson : Streaming Data Mining 35 / 111
![Page 43: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/43.jpg)
Count Max Sketches
This gives a total size of O(log(n/δ)/min(ε, 1/k))
Edo Liberty , Jelani Nelson : Streaming Data Mining 36 / 111
![Page 44: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/44.jpg)
Count Sketches
Reduces the error to ε[∑n
i=k+1 f2(i)]1/2
Charikar, Chen, Farach-Colton. Finding frequent items in data streams. 2002
Edo Liberty , Jelani Nelson : Streaming Data Mining 37 / 111
![Page 45: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/45.jpg)
Count Sketches
While the space increases slightly to O(log(n/δ)/min(ε2, 1/k))
Edo Liberty , Jelani Nelson : Streaming Data Mining 38 / 111
![Page 46: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/46.jpg)
Item frequency estimation
Table: Recap of the six algorithms presented. All quantities are given in thebig-O notation.
Space Update Query Approximation PrfailNaive n 1 1 0 0
Sampling log(n/δ)ε2 1 1 ε
∑ni=1 f (i) δ
Count MinSketches
log(n/δ)ε log( n
δ ) log( nδ ) ε
∑ni=1 f (i) δ
LossyCounting
1ε 1 1 ε
∑ni=1 f (i) 0
Count MaxSketches
log(n/δ)min(ε,1/k) log( n
δ ) log( nδ ) ε
∑ni=k+1 f (i) δ
CountSketches
log(n/δ)min(ε2,1/k) log( n
δ ) log( nδ ) ε[
∑ni=k+1 f
2(i)]1/2 δ
See also: Berinde, Indyk, Cormode, Strauss, Space-optimal heavy hitters with strong error bounds, PODS 2009
Survey at: Gibbons, Matias, External memory algorithms, 1999
Edo Liberty , Jelani Nelson : Streaming Data Mining 39 / 111
![Page 47: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/47.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 40 / 111
![Page 48: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/48.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
Edo Liberty , Jelani Nelson : Streaming Data Mining 41 / 111
![Page 49: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/49.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
Edo Liberty , Jelani Nelson : Streaming Data Mining 41 / 111
![Page 50: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/50.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
Edo Liberty , Jelani Nelson : Streaming Data Mining 41 / 111
![Page 51: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/51.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
Edo Liberty , Jelani Nelson : Streaming Data Mining 42 / 111
![Page 52: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/52.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 69.172.200.24
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 43 / 111
![Page 53: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/53.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 69.172.200.24
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 43 / 111
![Page 54: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/54.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 44 / 111
![Page 55: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/55.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 45 / 111
![Page 56: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/56.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 46 / 111
![Page 57: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/57.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 18.9.22.69
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
Edo Liberty , Jelani Nelson : Streaming Data Mining 47 / 111
![Page 58: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/58.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 106.10.165.51
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
106.10.165.51
Edo Liberty , Jelani Nelson : Streaming Data Mining 48 / 111
![Page 59: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/59.jpg)
Distinct elements
server
to: cs.princeton.edu
from: 106.10.165.51
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
106.10.165.51
Edo Liberty , Jelani Nelson : Streaming Data Mining 48 / 111
![Page 60: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/60.jpg)
Distinct elements
server
to: cs.princeton.edu
from: . . .
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
106.10.165.51
. . .
Goal: Count number of dis-tinct IP addresses that con-tacted server.
Edo Liberty , Jelani Nelson : Streaming Data Mining 49 / 111
![Page 61: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/61.jpg)
Distinct elements
server
to: cs.princeton.edu
from: . . .
packet————————————————————————————————————
Addresses seen:
18.9.22.69
69.172.200.24
106.10.165.51
. . .
Goal: Count number of dis-tinct IP addresses that con-tacted server.
Edo Liberty , Jelani Nelson : Streaming Data Mining 49 / 111
![Page 62: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/62.jpg)
Distinct elements
Two obvious solutions
Store a bitvector x ∈ 0, 12128(xi = 1 if we’ve seen address i)
Store a hash table: O(N) words of memory
In general: sequence of N integers each in 1, . . . , n.Can either use O(N log n) bits of memory, or n bits.
But we can do better!
Edo Liberty , Jelani Nelson : Streaming Data Mining 50 / 111
![Page 63: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/63.jpg)
Distinct elements
Two obvious solutions
Store a bitvector x ∈ 0, 12128(xi = 1 if we’ve seen address i)
Store a hash table: O(N) words of memory
In general: sequence of N integers each in 1, . . . , n.Can either use O(N log n) bits of memory, or n bits.
But we can do better!
Edo Liberty , Jelani Nelson : Streaming Data Mining 50 / 111
![Page 64: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/64.jpg)
Distinct elements
Two obvious solutions
Store a bitvector x ∈ 0, 12128(xi = 1 if we’ve seen address i)
Store a hash table: O(N) words of memory
In general: sequence of N integers each in 1, . . . , n.Can either use O(N log n) bits of memory, or n bits.
But we can do better!
Edo Liberty , Jelani Nelson : Streaming Data Mining 50 / 111
![Page 65: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/65.jpg)
Distinct elements
KMV algorithm[Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan, RANDOM 2002]
also see [Beyer, Gemulla, Haas, Reinwald, Sismanis, Commun. ACM 52(10), 2009]
(first small-space algorithm published is due to [Flajolet, Martin, FOCS 1983])
Guarantee: Let F0 be the number of distinct elements. Will output a value F0
such that with probability at least 2/3, |F0 − F0| ≤ εF0.
KMV algorithm
1 Pick random hash function h : [n]→ [0, 1]
2 Maintain k = Θ(1/ε2) smallest distinct hash values seen in streamX1 < X2 < . . . < Xk
3 if seen less than k distinct hash values at end of stream, outputnumber of distinct hash values seen
else output k/Xk
Edo Liberty , Jelani Nelson : Streaming Data Mining 51 / 111
![Page 66: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/66.jpg)
Distinct elements
KMV algorithm[Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan, RANDOM 2002]
also see [Beyer, Gemulla, Haas, Reinwald, Sismanis, Commun. ACM 52(10), 2009]
(first small-space algorithm published is due to [Flajolet, Martin, FOCS 1983])
Guarantee: Let F0 be the number of distinct elements. Will output a value F0
such that with probability at least 2/3, |F0 − F0| ≤ εF0.
KMV algorithm
1 Pick random hash function h : [n]→ [0, 1]
2 Maintain k = Θ(1/ε2) smallest distinct hash values seen in streamX1 < X2 < . . . < Xk
3 if seen less than k distinct hash values at end of stream, outputnumber of distinct hash values seen
else output k/Xk
Edo Liberty , Jelani Nelson : Streaming Data Mining 51 / 111
![Page 67: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/67.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.00239167
Stream: 5
1 5 2 7 1 3 8 4 6
h(5) = 0.00239167
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 68: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/68.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.00239167
Stream: 5
1 5 2 7 1 3 8 4 6
h(5) = 0.00239167
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 69: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/69.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.00239167
Stream: 5 1
5 2 7 1 3 8 4 6
h(1) = 0.973811
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 70: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/70.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
Stream: 5 1
5 2 7 1 3 8 4 6
h(1) = 0.973811
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 71: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/71.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
Stream: 5 1 5
2 7 1 3 8 4 6
h(5) = 0.00239167
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 72: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/72.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
Stream: 5 1 5
2 7 1 3 8 4 6
h(5) = 0.00239167
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 73: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/73.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
Stream: 5 1 5 2
7 1 3 8 4 6
h(2) = 0.0929362
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 74: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/74.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.9738110.0929362
Stream: 5 1 5 2
7 1 3 8 4 6
h(2) = 0.0929362
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 75: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/75.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.9738110.0929362
Stream: 5 1 5 2 7
1 3 8 4 6
h(7) = 0.425028
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 76: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/76.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7
1 3 8 4 6
h(7) = 0.425028
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 77: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/77.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7 1
3 8 4 6
h(1) = 0.973811
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 78: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/78.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7 1
3 8 4 6
h(1) = 0.973811
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 79: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/79.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7 1 3
8 4 6
h(3) = 0.770643
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 80: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/80.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7 1 3
8 4 6
h(3) = 0.770643
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 81: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/81.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.425028
Stream: 5 1 5 2 7 1 3 8
4 6
h(8) = 0.223476
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 82: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/82.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.223476
Stream: 5 1 5 2 7 1 3 8
4 6
h(8) = 0.223476
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 83: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/83.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.223476
Stream: 5 1 5 2 7 1 3 8 4
6
h(4) = 0.204447
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 84: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/84.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.2234760.204447
Stream: 5 1 5 2 7 1 3 8 4
6
h(4) = 0.204447
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 85: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/85.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.2234760.204447
Stream: 5 1 5 2 7 1 3 8 4 6
h(6) = 0.88464
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 86: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/86.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.2234760.204447
Stream: 5 1 5 2 7 1 3 8 4 6
h(6) = 0.88464
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 87: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/87.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.2234760.204447
Stream: 5 1 5 2 7 1 3 8 4 6
output k/Xk = 3/.204447 = 14.6737
Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 88: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/88.jpg)
Distinct elements — KMV algorithm example
k = 3k minimum (hash) values (i.e. kmv)
0.002391670.973811
0.09293620.4250280.2234760.204447
Stream: 5 1 5 2 7 1 3 8 4 6
output k/Xk = 3/.204447 = 14.6737Note: true answer is 8
Edo Liberty , Jelani Nelson : Streaming Data Mining 52 / 111
![Page 89: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/89.jpg)
Distinct elements — Why does KMV work?
Question: Suppose we pick t random numbers X1, . . . ,Xt in the range[0, 1]. What do we expect the kth smallest Xi to be on average?
Answer: kt+1
0——•——•——•——1aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸
14
14
14
14
We expect the t random numbers to be evenly spaced when sorted fromsmallest to largest, so kth smallest is expected to be k/(t + 1).
For us t = F0, so if things go according to expectation then k/Xk = F0 + 1Of course, things don’t always go exactly according to expectation
Edo Liberty , Jelani Nelson : Streaming Data Mining 53 / 111
![Page 90: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/90.jpg)
Distinct elements — Why does KMV work?
Question: Suppose we pick t random numbers X1, . . . ,Xt in the range[0, 1]. What do we expect the kth smallest Xi to be on average?
Answer: kt+1
0——•——•——•——1aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸
14
14
14
14
We expect the t random numbers to be evenly spaced when sorted fromsmallest to largest, so kth smallest is expected to be k/(t + 1).
For us t = F0, so if things go according to expectation then k/Xk = F0 + 1Of course, things don’t always go exactly according to expectation
Edo Liberty , Jelani Nelson : Streaming Data Mining 53 / 111
![Page 91: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/91.jpg)
Distinct elements — Why does KMV work?
Question: Suppose we pick t random numbers X1, . . . ,Xt in the range[0, 1]. What do we expect the kth smallest Xi to be on average?
Answer: kt+1
0——•——•——•——1aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸
14
14
14
14
We expect the t random numbers to be evenly spaced when sorted fromsmallest to largest, so kth smallest is expected to be k/(t + 1).
For us t = F0, so if things go according to expectation then k/Xk = F0 + 1
Of course, things don’t always go exactly according to expectation
Edo Liberty , Jelani Nelson : Streaming Data Mining 53 / 111
![Page 92: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/92.jpg)
Distinct elements — Why does KMV work?
Question: Suppose we pick t random numbers X1, . . . ,Xt in the range[0, 1]. What do we expect the kth smallest Xi to be on average?
Answer: kt+1
0——•——•——•——1aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸ aaaa︸︷︷︸
14
14
14
14
We expect the t random numbers to be evenly spaced when sorted fromsmallest to largest, so kth smallest is expected to be k/(t + 1).
For us t = F0, so if things go according to expectation then k/Xk = F0 + 1Of course, things don’t always go exactly according to expectation
Edo Liberty , Jelani Nelson : Streaming Data Mining 53 / 111
![Page 93: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/93.jpg)
Distinct elements — KMV analysis
Assume F0 > k .
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 94: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/94.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 95: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/95.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?
Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 96: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/96.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 97: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/97.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 98: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/98.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 99: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/99.jpg)
Distinct elements — KMV analysis
Assume F0 > k . Define good events:
Event E1: fewer than k elements hash below k/(F0(1 + ε))
Event E2: at least k elements hash below k/(F0(1− ε))
As long as both E1, E2 happen, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0, as we want
What’s Pr[¬E1]?Yi indicator random variable for event h(ith item) ≤ k/(F0(1 + ε))Y =
∑F0i=1 Yi is the number of items below threshold
EY =∑F0
i=1 EYi = k/(1 + ε)
Var[Y ] =∑F0
i=1 Var[Yi ] ≤ k/(1 + ε)
Chebyshev: Pr(¬E1) = Pr(Y ≥ k) ≤ Var[Y ]/(k − EY )2 ≤ (1 + ε)/(ε2k)
Edo Liberty , Jelani Nelson : Streaming Data Mining 54 / 111
![Page 100: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/100.jpg)
Distinct elements
See algorithms with even better space performance in
[Durand, Flajolet, 2003] (with implementation!)
[Flajolet, Fusy, Gandouet, Meunier, Disc. Math. and Theor. Comp. Sci., 2007] (with implementation!)
[Kane, N., Woodruff, 2010]
Edo Liberty , Jelani Nelson : Streaming Data Mining 55 / 111
![Page 101: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/101.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 56 / 111
![Page 102: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/102.jpg)
Moment estimation
Problem: Compute value Fp which, with 2/3 probability, lies in theinterval [(1− ε)Fp, (1 + ε)Fp].
Fp = ‖f ‖pp =n∑
i=1
|f (i)|p
p = 0: Distinct elements
p =∞: Most frequent item (actually “F1/∞∞ ”, or ‖f ‖∞)
larger p: Closer approximation to F∞
Unfortunately known that poly(n) space required for p > 2.[Bar-Yossef, Jayram, Kumar, Sivakumar, JCSS 68(4), 2004]
[Chakrabarti, Khot, Sun, CCC 2003]
Edo Liberty , Jelani Nelson : Streaming Data Mining 57 / 111
![Page 103: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/103.jpg)
Moment estimation
Problem: Compute value Fp which, with 2/3 probability, lies in theinterval [(1− ε)Fp, (1 + ε)Fp].
Fp = ‖f ‖pp =n∑
i=1
|f (i)|p
p = 0: Distinct elements
p =∞: Most frequent item (actually “F1/∞∞ ”, or ‖f ‖∞)
larger p: Closer approximation to F∞
Unfortunately known that poly(n) space required for p > 2.[Bar-Yossef, Jayram, Kumar, Sivakumar, JCSS 68(4), 2004]
[Chakrabarti, Khot, Sun, CCC 2003]
Edo Liberty , Jelani Nelson : Streaming Data Mining 57 / 111
![Page 104: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/104.jpg)
Moment estimation
Problem: Compute value Fp which, with 2/3 probability, lies in theinterval [(1− ε)Fp, (1 + ε)Fp].
Fp = ‖f ‖pp =n∑
i=1
|f (i)|p
p = 0: Distinct elements
p =∞: Most frequent item (actually “F1/∞∞ ”, or ‖f ‖∞)
larger p: Closer approximation to F∞
Unfortunately known that poly(n) space required for p > 2.[Bar-Yossef, Jayram, Kumar, Sivakumar, JCSS 68(4), 2004]
[Chakrabarti, Khot, Sun, CCC 2003]
Edo Liberty , Jelani Nelson : Streaming Data Mining 57 / 111
![Page 105: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/105.jpg)
Moment estimation — p = 2
A look at p = 2
p = 2 is as close as we can get to ‖f ‖∞ while not taking polynomialspace (“is there an outlier?”)
Linear sketches give a way to estimate dot product
(read: cosine similarity)
Suppose x , y are unit vectors and ‖z‖2 is some estimate (1± ε)‖z‖2
Recall ‖x − y‖22 = ‖x‖2
2 + ‖y‖22 − 2 〈x , y〉
so 12 · (‖ ˜(x − y)‖2
2 − ‖x‖22 − ‖y‖2
2) = 〈x , y〉 ± 2ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 58 / 111
![Page 106: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/106.jpg)
Moment estimation — p = 2
A look at p = 2
p = 2 is as close as we can get to ‖f ‖∞ while not taking polynomialspace (“is there an outlier?”)
Linear sketches give a way to estimate dot product
(read: cosine similarity)
Suppose x , y are unit vectors and ‖z‖2 is some estimate (1± ε)‖z‖2
Recall ‖x − y‖22 = ‖x‖2
2 + ‖y‖22 − 2 〈x , y〉
so 12 · (‖ ˜(x − y)‖2
2 − ‖x‖22 − ‖y‖2
2) = 〈x , y〉 ± 2ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 58 / 111
![Page 107: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/107.jpg)
Moment estimation — p = 2
A look at p = 2
p = 2 is as close as we can get to ‖f ‖∞ while not taking polynomialspace (“is there an outlier?”)
Linear sketches give a way to estimate dot product
(read: cosine similarity)
Suppose x , y are unit vectors and ‖z‖2 is some estimate (1± ε)‖z‖2
Recall ‖x − y‖22 = ‖x‖2
2 + ‖y‖22 − 2 〈x , y〉
so 12 · (‖ ˜(x − y)‖2
2 − ‖x‖22 − ‖y‖2
2) = 〈x , y〉 ± 2ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 58 / 111
![Page 108: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/108.jpg)
Moment estimation — p = 2
TZ sketch[Thorup, Zhang, SODA 2004]
also see [Charikar, Chen, Farach-Colton, ICALP 2002]
(first small-space algorithm published is due to [Alon, Matias, Szegedy, STOC 1996])
TZ sketch
Initialize counters A1, . . . ,Ak to 0 for k = Θ(1/ε2)
Pick random hash functions h : [n]→ [k] and σ : [n]→ −1, 1upon update fi ← fi + v add σ(i) · v to Ah(i)
output∑k
j=1 A2j
Just O(1/ε2) words of space and constant update time!
Edo Liberty , Jelani Nelson : Streaming Data Mining 59 / 111
![Page 109: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/109.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 0 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
0 0 0 0 0 0 0 0 0 0
a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 110: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/110.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 0 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
0 0 0 0 0 0 0 0 0 0
+1 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 111: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/111.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 0 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 0 0 0 0 0 0
+1 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 112: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/112.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 0 0a a a +1 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 0 0 0 0 0 0
+1 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 113: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/113.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 1 0a a a +1 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 0 0 0 0 0 0
+1 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 114: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/114.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 1 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 0 0 0 0 0 0
a a a -4 a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 115: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/115.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 1 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 -4 0 0 0 0 0
a a a -4 a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 116: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/116.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 0 1 0a a a -4 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 -4 0 0 0 0 0
a a a -4 a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 117: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/117.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 1 0a a a -4 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 -4 0 0 0 0 0
a a a -4 a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 118: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/118.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 1 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
1 0 0 0 -4 0 0 0 0 0
-5 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 119: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/119.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 1 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 0 0 0 0
-5 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 120: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/120.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 1 0a a a -5 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 0 0 0 0
-5 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 121: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/121.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -4 0a a a -5 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 0 0 0 0
-5 a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 122: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/122.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -4 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 0 0 0 0
a a a a +2 a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 123: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/123.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -4 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 2 0 0 0
a a a a +2 a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 124: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/124.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -4 0a a a -2 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 2 0 0 0
a a a a +2 a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 125: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/125.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -6 0a a a -2 a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 2 0 0 0
a a a a +2 a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 126: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/126.jpg)
Moment estimation — TZ sketch
n = 10
k = 5A1 A2 A3 A4 A5
A = 0 0 -4 -6 0a a a a
output 02 +02 +(−4)2 +(−6)2 +02 = 52
(note ‖f ‖22 = (−4)2 + (−4)2 + 22 = 38)
h valuesh(1): 4
h(2): 1
h(3): 3
h(4): 1
h(5): 3
h(6): 4
h(7): 4
h(8): 4
h(9): 2
h(10): 4
σ valuesσ(1): +1
σ(2): +1
σ(3): +1
σ(4): −1
σ(5): +1
σ(6): +1
σ(7): −1
σ(8): +1
σ(9): +1
σ(10): +1
f =
f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8) f (9) f (10)
-4 0 0 0 -4 0 2 0 0 0
a a a a a a a
Edo Liberty , Jelani Nelson : Streaming Data Mining 60 / 111
![Page 127: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/127.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 128: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/128.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 129: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/129.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 130: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/130.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 131: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/131.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 132: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/132.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 133: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/133.jpg)
Moment estimation — TZ sketch analysis
Why does it work?
Let δi ,r be an indicator random variable for the event h(i) = r
Ar =∑n
i=1 δi ,rσ(i)f (i)
⇒ A2r =
∑ni=1 δi ,r f (i)2 +
∑i 6=j δi ,rδj ,rσ(i)σ(j)f (i)f (j)
⇒ EA2r =
∑ni=1(Eδi ,r )f (i)2 +
∑i 6=j(Eδi ,rδj ,r )(Eσ(i)Eσ(j))f (i)f (j)
⇒ EA2r = ‖f ‖2
2/k
⇒ E∑k
r=1 A2j =
∑kr=1 EA2
r = ‖f ‖22
Edo Liberty , Jelani Nelson : Streaming Data Mining 61 / 111
![Page 134: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/134.jpg)
Moment estimation — TZ sketch analysis
Expectation is unbiased . . . what about the variance?
Var(‖A‖22) = E(‖A‖2
2 − E‖A‖22)2 = E‖A‖4
2 − ‖f ‖42
After some calculations I’ll omit
Var(‖A‖22) =
k∑r=1
∑i 6=j
(Eδi ,rδj ,r )f 2i f
2j ≤
k∑r=1
1
k2(‖f ‖2
2)2 =‖f ‖4
2
k
Chebyshev: Pr(|‖A‖22 − E‖A‖2
2| > ε‖f ‖22) <
Var(‖A‖22)
ε2‖f ‖42
So, we can set k = 3/ε2 to get error probability 1/3
Edo Liberty , Jelani Nelson : Streaming Data Mining 62 / 111
![Page 135: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/135.jpg)
Moment estimation — TZ sketch analysis
Expectation is unbiased . . . what about the variance?
Var(‖A‖22) = E(‖A‖2
2 − E‖A‖22)2 = E‖A‖4
2 − ‖f ‖42
After some calculations I’ll omit
Var(‖A‖22) =
k∑r=1
∑i 6=j
(Eδi ,rδj ,r )f 2i f
2j ≤
k∑r=1
1
k2(‖f ‖2
2)2 =‖f ‖4
2
k
Chebyshev: Pr(|‖A‖22 − E‖A‖2
2| > ε‖f ‖22) <
Var(‖A‖22)
ε2‖f ‖42
So, we can set k = 3/ε2 to get error probability 1/3
Edo Liberty , Jelani Nelson : Streaming Data Mining 62 / 111
![Page 136: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/136.jpg)
Moment estimation — TZ sketch analysis
Expectation is unbiased . . . what about the variance?
Var(‖A‖22) = E(‖A‖2
2 − E‖A‖22)2 = E‖A‖4
2 − ‖f ‖42
After some calculations I’ll omit
Var(‖A‖22) =
k∑r=1
∑i 6=j
(Eδi ,rδj ,r )f 2i f
2j ≤
k∑r=1
1
k2(‖f ‖2
2)2 =‖f ‖4
2
k
Chebyshev: Pr(|‖A‖22 − E‖A‖2
2| > ε‖f ‖22) <
Var(‖A‖22)
ε2‖f ‖42
So, we can set k = 3/ε2 to get error probability 1/3
Edo Liberty , Jelani Nelson : Streaming Data Mining 62 / 111
![Page 137: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/137.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 63 / 111
![Page 138: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/138.jpg)
Dimensionality reduction
Many problems, as one would expect, become computationally harder asthe dimensionality of the underlying input data grows
Nearest neighbor search (exponential preprocessing time to getsublinear query)
Optimization problems for geometric problems: closest pair, diameter,minimum spanning tree, . . .
Linear algebra problems: regression, low-rank approximation
Clustering
How can we reduce dimensionality in such a way that we can still(approximately) solve the above problems above?
Edo Liberty , Jelani Nelson : Streaming Data Mining 64 / 111
![Page 139: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/139.jpg)
Dimensionality reduction
Many problems, as one would expect, become computationally harder asthe dimensionality of the underlying input data grows
Nearest neighbor search (exponential preprocessing time to getsublinear query)
Optimization problems for geometric problems: closest pair, diameter,minimum spanning tree, . . .
Linear algebra problems: regression, low-rank approximation
Clustering
How can we reduce dimensionality in such a way that we can still(approximately) solve the above problems above?
Edo Liberty , Jelani Nelson : Streaming Data Mining 64 / 111
![Page 140: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/140.jpg)
Dimensionality reduction
Good news when underlying distance metric is ‖ · ‖2
Theorem (Johnson-Lindenstrauss (JL) lemma, 1984)
For every 0 < ε ≤ 1/2 and set of points x1, . . . , xN ∈ Rn, there exists amatrix A ∈ Rm×n for m = O(ε−2 logN) such that
∀i 6= j , (1− ε)‖xi − xj‖2 ≤ ‖Axi − Axj‖2 ≤ (1 + ε)‖xi − xj‖2
Bad news when underlying distance metric is not ‖ · ‖2
Work of [Naor, Johnson, SODA 2009] shows that ‖ · ‖2 (or metric spaces“close” to it) are the only spaces where we could hope to embed intoO(logN) dimensions with any constant distortion guarantee.
Edo Liberty , Jelani Nelson : Streaming Data Mining 65 / 111
![Page 141: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/141.jpg)
Dimensionality reduction
Good news when underlying distance metric is ‖ · ‖2
Theorem (Johnson-Lindenstrauss (JL) lemma, 1984)
For every 0 < ε ≤ 1/2 and set of points x1, . . . , xN ∈ Rn, there exists amatrix A ∈ Rm×n for m = O(ε−2 logN) such that
∀i 6= j , (1− ε)‖xi − xj‖2 ≤ ‖Axi − Axj‖2 ≤ (1 + ε)‖xi − xj‖2
Bad news when underlying distance metric is not ‖ · ‖2
Work of [Naor, Johnson, SODA 2009] shows that ‖ · ‖2 (or metric spaces“close” to it) are the only spaces where we could hope to embed intoO(logN) dimensions with any constant distortion guarantee.
Edo Liberty , Jelani Nelson : Streaming Data Mining 65 / 111
![Page 142: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/142.jpg)
Dimensionality reduction
How do you prove the JL lemma?
Theorem (Johnson-Lindenstrauss (JL) lemma, 1984)
For every 0 < ε ≤ 1/2 and set of points x1, . . . , xN ∈ Rn, there exists a matrix
A ∈ Rm×n for m = O(ε−2 logN) such that
∀i 6= j , (1− ε)‖xi − xj‖2 ≤ ‖Axi − Axj‖2 ≤ (1 + ε)‖xi − xj‖2
Use the Distributional JL (DJL) lemma
Theorem
For every 0 < ε, δ ≤ 1/2 there exists a distribution Dε,δ over Rm×n for
m = O(ε−2 log(1/δ)) such that for any x ∈ Rn with ‖x‖2 = 1
PrA∼Dε,δ
(∣∣‖Ax‖22 − 1
∥∥ > ε)< δ
Proof of JL via DJL: Set δ = 1/N2 so that (xi − xj)/‖xi − xj‖2 is
preserved w.p. 1− 1/N2. Union bound over all(N
2
)i < j .
Edo Liberty , Jelani Nelson : Streaming Data Mining 66 / 111
![Page 143: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/143.jpg)
Dimensionality reduction
How do you prove the JL lemma?
Theorem (Johnson-Lindenstrauss (JL) lemma, 1984)
For every 0 < ε ≤ 1/2 and set of points x1, . . . , xN ∈ Rn, there exists a matrix
A ∈ Rm×n for m = O(ε−2 logN) such that
∀i 6= j , (1− ε)‖xi − xj‖2 ≤ ‖Axi − Axj‖2 ≤ (1 + ε)‖xi − xj‖2
Use the Distributional JL (DJL) lemma
Theorem
For every 0 < ε, δ ≤ 1/2 there exists a distribution Dε,δ over Rm×n for
m = O(ε−2 log(1/δ)) such that for any x ∈ Rn with ‖x‖2 = 1
PrA∼Dε,δ
(∣∣‖Ax‖22 − 1
∥∥ > ε)< δ
Proof of JL via DJL: Set δ = 1/N2 so that (xi − xj)/‖xi − xj‖2 is
preserved w.p. 1− 1/N2. Union bound over all(N
2
)i < j .
Edo Liberty , Jelani Nelson : Streaming Data Mining 66 / 111
![Page 144: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/144.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries[Indyk, Motwani, STOC 1998]
also see [Dasgupta, Gupta, Rand. Struct. Alg. 22(1), 2003]
y
=1√m
has density function p(x) = 1√2πe−x
2/2
Can choose m = (4 + o(1))ε−2 lnN
x
Edo Liberty , Jelani Nelson : Streaming Data Mining 67 / 111
![Page 145: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/145.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 146: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/146.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 147: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/147.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 148: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/148.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 149: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/149.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 150: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/150.jpg)
Dimensionality reduction
Can choose A to have random Gaussian entries
Sketch of why it works:
Recall, want Pr(|‖Ax‖22 − 1| > ε) < δ for all x with unit norm.
We have ‖Ax‖22 − 1 = xTATAx − 1
xTATAx = (1/m) ·∑m
r=1
∑i 6=j gr ,igr ,jxixj . Let
g = (g1,1, . . . , g1,n, g2,1, . . .) be the vector of rows of A concatenated.
This is (1/m) · gTBg where B is a block-diagonal matrix with mblocks each equaling xxT . Orthogonal change of basis!
gTBg = gTQTΛQg = (Qg)TΛ(Qg) = g ′TΛg ′ =∑m
i=1 g2i . Thus we
just want to show (1/m)(∑m
i=1(g2i − 1)) is small with high
probability. Can use statistics about the chi-squared distribution.
Edo Liberty , Jelani Nelson : Streaming Data Mining 68 / 111
![Page 151: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/151.jpg)
Dimensionality reduction
Another perhaps useful fact . . .
[Klartag, Mendelson, J. Funct. Anal. 225(1), 2005]: Don’t need m = Ω((logN)/ε2) all thetime, but rather can take m = O(γ2
2(X)/ε2) where X = xi − xji 6=j .
Here γ2(X) = E supx∈X 〈g , x〉2 for Gaussian vector g .
Bottom line: Given your data can estimate γ2 by picking a fewindependent Gaussian vectors and taking the empirical mean ofsupx∈X 〈g , x〉
2. Might improve dimensionality reduction on your data.
Caveat: Takes Ω(N2) time to estimate γ2 (X is the set of pairwisedifferences), so probably only makes sense when n N (note:Gram-Schmidt is O(N2n), so assuming n < N is expensive).
Edo Liberty , Jelani Nelson : Streaming Data Mining 69 / 111
![Page 152: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/152.jpg)
Dimensionality reduction
Another perhaps useful fact . . .
[Klartag, Mendelson, J. Funct. Anal. 225(1), 2005]: Don’t need m = Ω((logN)/ε2) all thetime, but rather can take m = O(γ2
2(X)/ε2) where X = xi − xji 6=j .
Here γ2(X) = E supx∈X 〈g , x〉2 for Gaussian vector g .
Bottom line: Given your data can estimate γ2 by picking a fewindependent Gaussian vectors and taking the empirical mean ofsupx∈X 〈g , x〉
2. Might improve dimensionality reduction on your data.
Caveat: Takes Ω(N2) time to estimate γ2 (X is the set of pairwisedifferences), so probably only makes sense when n N (note:Gram-Schmidt is O(N2n), so assuming n < N is expensive).
Edo Liberty , Jelani Nelson : Streaming Data Mining 69 / 111
![Page 153: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/153.jpg)
Dimensionality reduction
Another perhaps useful fact . . .
[Klartag, Mendelson, J. Funct. Anal. 225(1), 2005]: Don’t need m = Ω((logN)/ε2) all thetime, but rather can take m = O(γ2
2(X)/ε2) where X = xi − xji 6=j .
Here γ2(X) = E supx∈X 〈g , x〉2 for Gaussian vector g .
Bottom line: Given your data can estimate γ2 by picking a fewindependent Gaussian vectors and taking the empirical mean ofsupx∈X 〈g , x〉
2. Might improve dimensionality reduction on your data.
Caveat: Takes Ω(N2) time to estimate γ2 (X is the set of pairwisedifferences), so probably only makes sense when n N (note:Gram-Schmidt is O(N2n), so assuming n < N is expensive).
Edo Liberty , Jelani Nelson : Streaming Data Mining 69 / 111
![Page 154: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/154.jpg)
Dimensionality reduction
Another perhaps useful fact . . .
[Klartag, Mendelson, J. Funct. Anal. 225(1), 2005]: Don’t need m = Ω((logN)/ε2) all thetime, but rather can take m = O(γ2
2(X)/ε2) where X = xi − xji 6=j .
Here γ2(X) = E supx∈X 〈g , x〉2 for Gaussian vector g .
Bottom line: Given your data can estimate γ2 by picking a fewindependent Gaussian vectors and taking the empirical mean ofsupx∈X 〈g , x〉
2. Might improve dimensionality reduction on your data.
Caveat: Takes Ω(N2) time to estimate γ2 (X is the set of pairwisedifferences), so probably only makes sense when n N (note:Gram-Schmidt is O(N2n), so assuming n < N is expensive).
Edo Liberty , Jelani Nelson : Streaming Data Mining 69 / 111
![Page 155: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/155.jpg)
Dimensionality reduction
Can also choose A to have random sign entries[Achlioptas, PODS 2001]
also see [Achlioptas, PODS 2001]
y
=1√m
+ − + + + + + − + +
+ + − − + − + − − +
− − − + + + − + − +
+ + + + + + − − − −
+ − + − − − − − − −
has density function p(x) = 1√2πe−x
2/2
Can choose m = (4 + o(1))ε−2 lnN
x
Edo Liberty , Jelani Nelson : Streaming Data Mining 70 / 111
![Page 156: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/156.jpg)
Dimensionality reduction
Downside of last two constructions: The preprocessing (performingthe embedding) is dense matrix-vector multiplication: O(mn) time
Achlioptas actually showed that we can take a random sign matrix with2/3rds of its entries zero (sparser, so faster)
Then there was the the Fast Johnson-Lindenstrauss Transform* (FJLT)[Ailon, Chazelle, SIAM J. Comput. 39(1), 2009]
1√m
1 0 0 0 0
0 0 0 1 0
0 1 0 0 0
︸ ︷︷ ︸
random samples
0 0 0 0 0
H
︸ ︷︷ ︸Hadamard (or Fourier)
±1
±1
±1
±1
±1
x
* Actual Ailon-Chazelle construction does something a tad better than the sampling matrix (see their paper).
Edo Liberty , Jelani Nelson : Streaming Data Mining 71 / 111
![Page 157: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/157.jpg)
Dimensionality reduction
Downside of last two constructions: The preprocessing (performingthe embedding) is dense matrix-vector multiplication: O(mn) time
Achlioptas actually showed that we can take a random sign matrix with2/3rds of its entries zero (sparser, so faster)
Then there was the the Fast Johnson-Lindenstrauss Transform* (FJLT)[Ailon, Chazelle, SIAM J. Comput. 39(1), 2009]
1√m
1 0 0 0 0
0 0 0 1 0
0 1 0 0 0
︸ ︷︷ ︸
random samples
0 0 0 0 0
H
︸ ︷︷ ︸Hadamard (or Fourier)
±1
±1
±1
±1
±1
x
* Actual Ailon-Chazelle construction does something a tad better than the sampling matrix (see their paper).
Edo Liberty , Jelani Nelson : Streaming Data Mining 71 / 111
![Page 158: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/158.jpg)
Dimensionality reduction
Downside of last two constructions: The preprocessing (performingthe embedding) is dense matrix-vector multiplication: O(mn) time
Achlioptas actually showed that we can take a random sign matrix with2/3rds of its entries zero (sparser, so faster)
Then there was the the Fast Johnson-Lindenstrauss Transform* (FJLT)[Ailon, Chazelle, SIAM J. Comput. 39(1), 2009]
1√m
1 0 0 0 0
0 0 0 1 0
0 1 0 0 0
︸ ︷︷ ︸
random samples
0 0 0 0 0
H
︸ ︷︷ ︸Hadamard (or Fourier)
±1
±1
±1
±1
±1
x
* Actual Ailon-Chazelle construction does something a tad better than the sampling matrix (see their paper).
Edo Liberty , Jelani Nelson : Streaming Data Mining 71 / 111
![Page 159: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/159.jpg)
Dimensionality reduction
Downside of last two constructions: The preprocessing (performingthe embedding) is dense matrix-vector multiplication: O(mn) time
Achlioptas actually showed that we can take a random sign matrix with2/3rds of its entries zero (sparser, so faster)
Then there was the the Fast Johnson-Lindenstrauss Transform* (FJLT)[Ailon, Chazelle, SIAM J. Comput. 39(1), 2009]
1√m
1 0 0 0 0
0 0 0 1 0
0 1 0 0 0
︸ ︷︷ ︸
random samples
0 0 0 0 0
H
︸ ︷︷ ︸Hadamard (or Fourier)
±1
±1
±1
±1
±1
x
* Actual Ailon-Chazelle construction does something a tad better than the sampling matrix (see their paper).
Edo Liberty , Jelani Nelson : Streaming Data Mining 71 / 111
![Page 160: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/160.jpg)
Dimensionality reduction — FJLT
y =1√mSHDx
Recall H is the matrix with Hi ,j = (−1)(〈~i ,~j〉 mod 2).Actually just need any matrix that has entries bounded in magnitude by 1, and is orthogonal when we normalize by 1/
√n.
Why does this work?
Randomly sampling entries of x has correct `22 expectation but poor
variance (e.g. when x has exactly one non-zero coordinate)
Can show ‖HDx‖∞ ≤√
log(nN)/n ≈√
log(N)/n w.p. 1− 1/N2
Conditioned on above item, apply the Chernoff bound to say samplingby S works with high probability as long as we have enough samples.
Unfortunately requires m = Θ(ε−2 log2 N) samples (extra logN). Canfix by finishing off with a slow matrix (e.g. Gaussian or sign) for anadditional O(mε−2 logN) time. Total time is O(n log n + ε−4 log3 N).
Edo Liberty , Jelani Nelson : Streaming Data Mining 72 / 111
![Page 161: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/161.jpg)
Dimensionality reduction — FJLT
y =1√mSHDx
Recall H is the matrix with Hi ,j = (−1)(〈~i ,~j〉 mod 2).Actually just need any matrix that has entries bounded in magnitude by 1, and is orthogonal when we normalize by 1/
√n.
Why does this work?
Randomly sampling entries of x has correct `22 expectation but poor
variance (e.g. when x has exactly one non-zero coordinate)
Can show ‖HDx‖∞ ≤√
log(nN)/n ≈√
log(N)/n w.p. 1− 1/N2
Conditioned on above item, apply the Chernoff bound to say samplingby S works with high probability as long as we have enough samples.
Unfortunately requires m = Θ(ε−2 log2 N) samples (extra logN). Canfix by finishing off with a slow matrix (e.g. Gaussian or sign) for anadditional O(mε−2 logN) time. Total time is O(n log n + ε−4 log3 N).
Edo Liberty , Jelani Nelson : Streaming Data Mining 72 / 111
![Page 162: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/162.jpg)
Dimensionality reduction — FJLT
y =1√mSHDx
Recall H is the matrix with Hi ,j = (−1)(〈~i ,~j〉 mod 2).Actually just need any matrix that has entries bounded in magnitude by 1, and is orthogonal when we normalize by 1/
√n.
Why does this work?
Randomly sampling entries of x has correct `22 expectation but poor
variance (e.g. when x has exactly one non-zero coordinate)
Can show ‖HDx‖∞ ≤√
log(nN)/n ≈√
log(N)/n w.p. 1− 1/N2
Conditioned on above item, apply the Chernoff bound to say samplingby S works with high probability as long as we have enough samples.
Unfortunately requires m = Θ(ε−2 log2 N) samples (extra logN). Canfix by finishing off with a slow matrix (e.g. Gaussian or sign) for anadditional O(mε−2 logN) time. Total time is O(n log n + ε−4 log3 N).
Edo Liberty , Jelani Nelson : Streaming Data Mining 72 / 111
![Page 163: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/163.jpg)
Dimensionality reduction — FJLT
y =1√mSHDx
Recall H is the matrix with Hi ,j = (−1)(〈~i ,~j〉 mod 2).Actually just need any matrix that has entries bounded in magnitude by 1, and is orthogonal when we normalize by 1/
√n.
Why does this work?
Randomly sampling entries of x has correct `22 expectation but poor
variance (e.g. when x has exactly one non-zero coordinate)
Can show ‖HDx‖∞ ≤√
log(nN)/n ≈√
log(N)/n w.p. 1− 1/N2
Conditioned on above item, apply the Chernoff bound to say samplingby S works with high probability as long as we have enough samples.
Unfortunately requires m = Θ(ε−2 log2 N) samples (extra logN). Canfix by finishing off with a slow matrix (e.g. Gaussian or sign) for anadditional O(mε−2 logN) time. Total time is O(n log n + ε−4 log3 N).
Edo Liberty , Jelani Nelson : Streaming Data Mining 72 / 111
![Page 164: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/164.jpg)
Dimensionality reduction — FJLT
y =1√mSHDx
Recall H is the matrix with Hi ,j = (−1)(〈~i ,~j〉 mod 2).Actually just need any matrix that has entries bounded in magnitude by 1, and is orthogonal when we normalize by 1/
√n.
Why does this work?
Randomly sampling entries of x has correct `22 expectation but poor
variance (e.g. when x has exactly one non-zero coordinate)
Can show ‖HDx‖∞ ≤√
log(nN)/n ≈√
log(N)/n w.p. 1− 1/N2
Conditioned on above item, apply the Chernoff bound to say samplingby S works with high probability as long as we have enough samples.
Unfortunately requires m = Θ(ε−2 log2 N) samples (extra logN). Canfix by finishing off with a slow matrix (e.g. Gaussian or sign) for anadditional O(mε−2 logN) time. Total time is O(n log n + ε−4 log3 N).
Edo Liberty , Jelani Nelson : Streaming Data Mining 72 / 111
![Page 165: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/165.jpg)
Dimensionality reduction — FJLT
Improvements to the original FJLT since Ailon and Chazelle’s work
[Ailon, Liberty, SODA 2008]: m = O(ε−2 logN) in O(n log n + m2+γ) time
[Ailon, Liberty, SODA 2011], [Krahmer, Ward, SIAM J. Math. Anal. 43(3), 2011]:m = O(ε−2 logN log4 n) in O(n log n) time (faster for huge N).Construction is actually what we just saw ((1/
√m)SHD), but with a
much different-looking analysis (doesn’t use DJL).
It’s conceivable the (1/√m)SHD construction actually gets the
optimal m = O(ε−2 logN), but we just don’t know how to prove ityet. So, can try with smaller m but buyer beware.
Edo Liberty , Jelani Nelson : Streaming Data Mining 73 / 111
![Page 166: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/166.jpg)
Dimensionality reduction — FJLT
Improvements to the original FJLT since Ailon and Chazelle’s work
[Ailon, Liberty, SODA 2008]: m = O(ε−2 logN) in O(n log n + m2+γ) time
[Ailon, Liberty, SODA 2011], [Krahmer, Ward, SIAM J. Math. Anal. 43(3), 2011]:m = O(ε−2 logN log4 n) in O(n log n) time (faster for huge N).Construction is actually what we just saw ((1/
√m)SHD), but with a
much different-looking analysis (doesn’t use DJL).
It’s conceivable the (1/√m)SHD construction actually gets the
optimal m = O(ε−2 logN), but we just don’t know how to prove ityet. So, can try with smaller m but buyer beware.
Edo Liberty , Jelani Nelson : Streaming Data Mining 73 / 111
![Page 167: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/167.jpg)
Dimensionality reduction — FJLT
Improvements to the original FJLT since Ailon and Chazelle’s work
[Ailon, Liberty, SODA 2008]: m = O(ε−2 logN) in O(n log n + m2+γ) time
[Ailon, Liberty, SODA 2011], [Krahmer, Ward, SIAM J. Math. Anal. 43(3), 2011]:m = O(ε−2 logN log4 n) in O(n log n) time (faster for huge N).Construction is actually what we just saw ((1/
√m)SHD), but with a
much different-looking analysis (doesn’t use DJL).
It’s conceivable the (1/√m)SHD construction actually gets the
optimal m = O(ε−2 logN), but we just don’t know how to prove ityet. So, can try with smaller m but buyer beware.
Edo Liberty , Jelani Nelson : Streaming Data Mining 73 / 111
![Page 168: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/168.jpg)
Dimensionality reduction
One downside to FJLT: kills sparsity
Works by randomly spreading mass around to decrease variance forsampling. Takes O(n log n) time, but dense matrices (Gaussian or sign)take only O(m · ‖x‖0).
Who cares about sparsity?
Edo Liberty , Jelani Nelson : Streaming Data Mining 74 / 111
![Page 169: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/169.jpg)
Dimensionality reduction
One downside to FJLT: kills sparsity
Works by randomly spreading mass around to decrease variance forsampling. Takes O(n log n) time, but dense matrices (Gaussian or sign)take only O(m · ‖x‖0).
Who cares about sparsity?
Edo Liberty , Jelani Nelson : Streaming Data Mining 74 / 111
![Page 170: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/170.jpg)
Dimensionality reduction
One downside to FJLT: kills sparsity
Works by randomly spreading mass around to decrease variance forsampling. Takes O(n log n) time, but dense matrices (Gaussian or sign)take only O(m · ‖x‖0).
Who cares about sparsity?
Edo Liberty , Jelani Nelson : Streaming Data Mining 74 / 111
![Page 171: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/171.jpg)
Dimensionality reduction
Who cares about sparsity?
You do
Document as bag of words: xi = number of occurrences of word i .Compare documents using cosine similarity.
d = lexicon size; most documents aren’t dictionaries
Network traffic: xi ,j = #bytes sent from i to j
d = 264 (2256 in IPv6); most servers don’t talk to each other
User ratings: xi is user’s score for movie i on Netflix
d = #movies; most people haven’t rated all movies
Streaming: x receives a stream of updates of the form: “add v toxi”. Maintaining Sx requires calculating v · Sei .. . .
Edo Liberty , Jelani Nelson : Streaming Data Mining 75 / 111
![Page 172: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/172.jpg)
Dimensionality reduction
Who cares about sparsity?
You do
Document as bag of words: xi = number of occurrences of word i .Compare documents using cosine similarity.
d = lexicon size; most documents aren’t dictionaries
Network traffic: xi ,j = #bytes sent from i to j
d = 264 (2256 in IPv6); most servers don’t talk to each other
User ratings: xi is user’s score for movie i on Netflix
d = #movies; most people haven’t rated all movies
Streaming: x receives a stream of updates of the form: “add v toxi”. Maintaining Sx requires calculating v · Sei .. . .
Edo Liberty , Jelani Nelson : Streaming Data Mining 75 / 111
![Page 173: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/173.jpg)
Dimensionality reduction — SparseJL
One way to embed sparse vectors faster: use sparse matrices
[Kane, N., SODA 2012]
building on work of [Weinberger, Dasgupta, Langford, Smola, Attenberg, ICML, 2009], [Dasgupta, Kumar, Sarlos, STOC 2010]
Embedding time O(m + s‖x‖0) = O(m + εm‖x‖0)
=
0
0 m
00
m/s =
0
0m
00
Each black cell is ±1/√s at random
m = O(ε−2 logN), s = O(ε−1 logN)
Edo Liberty , Jelani Nelson : Streaming Data Mining 76 / 111
![Page 174: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/174.jpg)
Dimensionality reduction — SparseJL
One way to embed sparse vectors faster: use sparse matrices
[Kane, N., SODA 2012]
building on work of [Weinberger, Dasgupta, Langford, Smola, Attenberg, ICML, 2009], [Dasgupta, Kumar, Sarlos, STOC 2010]
Embedding time O(m + s‖x‖0) = O(m + εm‖x‖0)
=
0
0 m
00
m/s =
0
0m
00
Each black cell is ±1/√s at random
m = O(ε−2 logN), s = O(ε−1 logN)Edo Liberty , Jelani Nelson : Streaming Data Mining 76 / 111
![Page 175: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/175.jpg)
Dimensionality reduction — SparseJL
Why does it work?
Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 176: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/176.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 177: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/177.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 178: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/178.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 179: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/179.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 180: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/180.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 181: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/181.jpg)
Dimensionality reduction — SparseJL
Why does it work?Let’s look at construction where each column has non-zeroes in exactly srandom locations. Let δr ,i be 1 if Ar ,i 6= 0, and let σr ,i be a random signso that Ar ,i = δr ,iσr ,i . Then,
(Ax)r = 1√s
∑ni=1 δr ,iσr ,ixi
(Ax)2r = 1
s
∑ni=1 δr ,ix
2i + 1
s
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
‖Ax‖22 = 1
s
∑nr=1
∑ni=1 δr ,ix
2i + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= 1m
∑ni=1 x
2i · (
∑mr=1 δr ,i ) + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
= ‖x‖22 + 1
s
∑nr=1
∑i 6=j δr ,iδr ,jσr ,iσr ,jxixj
The error term above is some quadratic form σTBσ. Can argue that it’ssmall with high probability using known tail bounds for quadratic forms(the “Hanson-Wright inequality”).
Edo Liberty , Jelani Nelson : Streaming Data Mining 77 / 111
![Page 182: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/182.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 78 / 111
![Page 183: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/183.jpg)
k-means clustering
Input: x1, . . . , xN ∈ Rn, integer k > 1Output: A partition of input points into k clusters C1, . . . ,Ck
Goal: Minimizek∑
r=1
∑i∈Cr
‖xi − µr‖22,
where µr is the centroid of Cr , i.e. µr = 1|Cr |∑
i∈Crxi .
Edo Liberty , Jelani Nelson : Streaming Data Mining 79 / 111
![Page 184: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/184.jpg)
k-means clustering
Input: x1, . . . , xN ∈ Rn, integer k > 1Output: A partition of input points into k clusters C1, . . . ,Ck
Goal: Minimizek∑
r=1
∑i∈Cr
‖xi − µr‖22,
where µr is the centroid of Cr , i.e. µr = 1|Cr |∑
i∈Crxi .
Edo Liberty , Jelani Nelson : Streaming Data Mining 79 / 111
![Page 185: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/185.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 186: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/186.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 187: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/187.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 188: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/188.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩
For each i ∈ Cr the coefficient of ‖xi‖22 in the inner sum is
1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 189: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/189.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)
For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 190: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/190.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |
Noting that ‖xi − xj‖22 = ‖xi‖2
2 + ‖x‖22 − 2 〈xi , xj〉,
(∗) equals∑k
r=11|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 191: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/191.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 192: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/192.jpg)
k-means clustering
Let’s look at the objective function
Minimize ∑kr=1
∑i∈Cr
∥∥∥xi − 1|Cr | ·
∑j∈Cr
xj
∥∥∥2
2(∗)
Let’s focus on a particular r with |Cr | > 1
Look at each term in the inner sum as⟨xi− 1|Cr |
∑j∈Cr | xj ,xi−
1|Cr
∑j∈Cr xj
⟩For each i ∈ Cr the coefficient of ‖xi‖2
2 in the inner sum is1− 2/|Cr |+ |Cr | · (1/|Cr |2) = (1− 1/|Cr |)For each i < j ∈ Cr the coefficient of 〈xi , xj〉 is −2/|Cr |Noting that ‖xi − xj‖2
2 = ‖xi‖22 + ‖x‖2
2 − 2 〈xi , xj〉,(∗) equals
∑kr=1
1|Cr |·(∑
i<j∈Cr ‖xi−xj‖22)
⇒ JL with O(ε−2 logN) rows preserves optimality of k-means up to 1± ε
Edo Liberty , Jelani Nelson : Streaming Data Mining 80 / 111
![Page 193: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/193.jpg)
k-means clustering
Also see [Boutsidis, Zouzias, Mahoney, Drineas, arXiv abs/1110.2897], which shows thatO(k/ε2) dimensions preserves optimal solution up to 2± ε
For papers on how to actually do fast k-means clustering withprovable guarantees, see
[Guha, Meyerson, Mishra, Motwani, O’Callaghan, IEEE Trans. Knowl. Data Eng. 15(3), 2003]
[Har-Peled, Mazumdar, STOC 2004]
[Ostrovsky, Rabani, Schulman, Swamy, FOCS 2006]
[Arthur, Vasilvitskii, SODA 2007]
[Aggarwal, Deshpande, Kannan, APPROX 2009]
[Ailon, Jaiswal, Monteleoni, NIPS 2009]
[Jaiswal, Garg, RANDOM 2012]
. . .
Edo Liberty , Jelani Nelson : Streaming Data Mining 81 / 111
![Page 194: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/194.jpg)
k-means clustering
Also see [Boutsidis, Zouzias, Mahoney, Drineas, arXiv abs/1110.2897], which shows thatO(k/ε2) dimensions preserves optimal solution up to 2± ε
For papers on how to actually do fast k-means clustering withprovable guarantees, see
[Guha, Meyerson, Mishra, Motwani, O’Callaghan, IEEE Trans. Knowl. Data Eng. 15(3), 2003]
[Har-Peled, Mazumdar, STOC 2004]
[Ostrovsky, Rabani, Schulman, Swamy, FOCS 2006]
[Arthur, Vasilvitskii, SODA 2007]
[Aggarwal, Deshpande, Kannan, APPROX 2009]
[Ailon, Jaiswal, Monteleoni, NIPS 2009]
[Jaiswal, Garg, RANDOM 2012]
. . .
Edo Liberty , Jelani Nelson : Streaming Data Mining 81 / 111
![Page 195: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/195.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 82 / 111
![Page 196: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/196.jpg)
Linear regression
Want to find a linear function that best matches data, quadraticpenalty
minx‖Sx − b‖2(∗)
If rows of S are Si , we want our linear function to have f (Si ) = bi for all i
The JL connection:‖Sx − b‖2
2 = ‖Sx − b|| − b⊥‖22 = ‖Sx − b||‖2
2 + ‖b⊥‖22, where b|| is in the
column span of S , and b⊥ is in the orthogonal complement. So, b|| = Sy||for some y||, so Sx − b|| = S(x − y||). Thus if we apply some JL matrix Awhich satisfies ‖ASz‖2 ≈ ‖Sz‖2 for all z , we preserve (*). Can get awaywith m = O(d/ε2) [Arora, Hazan, Kale, RANDOM 2006].
Edo Liberty , Jelani Nelson : Streaming Data Mining 83 / 111
![Page 197: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/197.jpg)
Linear regression
Want to find a linear function that best matches data, quadraticpenalty
minx‖Sx − b‖2(∗)
If rows of S are Si , we want our linear function to have f (Si ) = bi for all i
The JL connection:‖Sx − b‖2
2 = ‖Sx − b|| − b⊥‖22 = ‖Sx − b||‖2
2 + ‖b⊥‖22, where b|| is in the
column span of S , and b⊥ is in the orthogonal complement. So, b|| = Sy||for some y||, so Sx − b|| = S(x − y||). Thus if we apply some JL matrix Awhich satisfies ‖ASz‖2 ≈ ‖Sz‖2 for all z , we preserve (*). Can get awaywith m = O(d/ε2) [Arora, Hazan, Kale, RANDOM 2006].Edo Liberty , Jelani Nelson : Streaming Data Mining 83 / 111
![Page 198: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/198.jpg)
Linear regression
For more on fast linear regression, low rank approximation, andother numerical linear algebra problems:
[Sarlos, FOCS 2006]
[Clarkson, Woodruff, STOC 2009]
[Mahoney, Randomized Algorithms for Matrices and Data, 2011]
[Halko, Martinsson, Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate
matrix decompositions, 2009]
[Boutsidis, Drineas, Magdon-Ismail, arXiv abs/1202.3505]
[Clarkson, Woodruff, arXiv abs/1207.6365]
. . .
Last paper listed can do linear regression on d × n matrices in O(nnz(S) + poly(d/ε)) time!
(nnz(S) is the number of non-zero entries in S)
Edo Liberty , Jelani Nelson : Streaming Data Mining 84 / 111
![Page 199: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/199.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 85 / 111
![Page 200: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/200.jpg)
Streaming Matrices
A stream of vectors can be viewed a very large matrix A.
Edo Liberty , Jelani Nelson : Streaming Data Mining 86 / 111
![Page 201: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/201.jpg)
Streaming Matrices
What do we usually want to get from large matrices?
Low rank approximation
Singular Value Decomposition
Principal Component Analysis
Latent Dirichlet allocation
Latent Semantic Indexing
...
For the above, it is sufficient to compute AAT .
Edo Liberty , Jelani Nelson : Streaming Data Mining 87 / 111
![Page 202: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/202.jpg)
Streaming Matrices
Computing AAT is trivial from the stream of columns Ai
AAT =n∑
i=1
AiATi
In words, AAT is sum of outer products of the columns of A.
Naıve solution
Time O(nd2) and space O(d2).
What is d = 10, 000? or 100, 000?Order of d2 time or space is out of the question...
Edo Liberty , Jelani Nelson : Streaming Data Mining 88 / 111
![Page 203: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/203.jpg)
Streaming Matrices
Computing AAT is trivial from the stream of columns Ai
AAT =n∑
i=1
AiATi
In words, AAT is sum of outer products of the columns of A.
Naıve solution
Time O(nd2) and space O(d2).
What is d = 10, 000? or 100, 000?Order of d2 time or space is out of the question...
Edo Liberty , Jelani Nelson : Streaming Data Mining 88 / 111
![Page 204: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/204.jpg)
Streaming Matrices
Computing AAT is trivial from the stream of columns Ai
AAT =n∑
i=1
AiATi
In words, AAT is sum of outer products of the columns of A.
Naıve solution
Time O(nd2) and space O(d2).
What is d = 10, 000? or 100, 000?Order of d2 time or space is out of the question...
Edo Liberty , Jelani Nelson : Streaming Data Mining 88 / 111
![Page 205: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/205.jpg)
Matrix Column Sampling
We want to create a “sketch” B which approximates A.
1 B is as small as possible
2 Computing B is as efficient as possible
3 AAT ≈ BBT
More accurately:‖AAT − BBT‖ ≤ ε‖A‖2
f
Amazingly enough, we can get this by sampling columns from A![Alan Frieze, Ravi Kannan, and Santosh Vempala, Fast monte-carlo algorithms for finding low-rank approximations, 1998]
[Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels, 2002]
[Petros Drineas and Ravi Kannan. Pass efficient algorithms for approximating large matrices. 2003]
[Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis, 2007]
[Roberto Imbuzeiro Oliveira. Sums of random hermitian matrices and an inequality by Rudelson, 2010]
[Christos Boutsidis, Petros Drineas. and Malik Magdon-Ismail. Near optimal column-based matrix reconstruction, 2011]
Edo Liberty , Jelani Nelson : Streaming Data Mining 89 / 111
![Page 206: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/206.jpg)
Matrix Column Sampling
Let B contain ` independently chosen columns from A
Bj = Ai/√`pi w.p. pi = ‖Ai‖2
2/‖A‖2f
To see why this makes sense, let us compute the expectation of BBT
E[BjBTj ] =
n∑i=1
piAi√`pi
ATi√`pi
=1
`
n∑i=1
AiATi =
1
`AAT
So,
E[BBT ] =∑j=1
E[BjBTj ] = AAT
Note, BBT =∑`
j=1 BjBTj is a sum of independent random variables...
Edo Liberty , Jelani Nelson : Streaming Data Mining 90 / 111
![Page 207: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/207.jpg)
Matrix Column Sampling
Figuring out the minimal value for ` is non trivial.But, it is enough to require
` = c log(r)/rε2 .
r = ‖A‖2f /‖A‖2
2 is the numeric rank of A.
1 ≤ r ≤ Rank(A) ≤ d .
r ≈ const for “Interesting” matrices.
Matrix Column Sampling
We can compute B in time O(dn) (same as reading the matrix).B requires at most O(d/ε2) space and:
‖AAT − BBT‖ ≤ ε‖A‖2f
Edo Liberty , Jelani Nelson : Streaming Data Mining 91 / 111
![Page 208: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/208.jpg)
Matrix Column Sampling
Figuring out the minimal value for ` is non trivial.But, it is enough to require
` = c log(r)/rε2 .
r = ‖A‖2f /‖A‖2
2 is the numeric rank of A.
1 ≤ r ≤ Rank(A) ≤ d .
r ≈ const for “Interesting” matrices.
Matrix Column Sampling
We can compute B in time O(dn) (same as reading the matrix).B requires at most O(d/ε2) space and:
‖AAT − BBT‖ ≤ ε‖A‖2f
Edo Liberty , Jelani Nelson : Streaming Data Mining 91 / 111
![Page 209: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/209.jpg)
Matrix Column Sampling
A streaming procedure for matrix column sampling.
Input: ε ∈ (0, 1], A ∈ Rd×n
`← dc/ε2eB ← all zeros matrix ∈ Rd×`
T = 0for i ∈ [n] do
t = ‖Ai‖2
T ← T + tfor j ∈ [`] do
w.p. t/TBj ← 1√
`t/TAi
Return: B
The dependency on 1/ε2 is problematic for small ε.
Edo Liberty , Jelani Nelson : Streaming Data Mining 92 / 111
![Page 210: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/210.jpg)
Matrix Sketching
The space requirement can be reduced to O(d/ε)
Edo Liberty , Jelani Nelson : Streaming Data Mining 93 / 111
![Page 211: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/211.jpg)
Matrix Sketching
Columns are added until the sketch is ‘full’
Edo Liberty , Jelani Nelson : Streaming Data Mining 94 / 111
![Page 212: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/212.jpg)
Matrix Sketching
Then, we compute the SVD of the sketch and rotate it
Edo Liberty , Jelani Nelson : Streaming Data Mining 95 / 111
![Page 213: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/213.jpg)
Matrix Sketching
B = USV is rotated to Bnew = US (note that BBT = BnewBTnew )
Edo Liberty , Jelani Nelson : Streaming Data Mining 96 / 111
![Page 214: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/214.jpg)
Matrix Sketching
Let ∆ = ‖B1/ε‖
Edo Liberty , Jelani Nelson : Streaming Data Mining 97 / 111
![Page 215: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/215.jpg)
Matrix Sketching
Shrink all the columns such that their `22 is reduced by ∆
Edo Liberty , Jelani Nelson : Streaming Data Mining 98 / 111
![Page 216: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/216.jpg)
Matrix Sketching
Start aggregating columns again...
Edo Liberty , Jelani Nelson : Streaming Data Mining 99 / 111
![Page 217: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/217.jpg)
Matrix Sketching
Input: ε ∈ (0, 1], A ∈ Rn×m
`← d1/εeB ← all zeros matrix ∈ R`×mfor i ∈ [n] do
B` ← Ai
[U,Σ,V ]← SVD(B)∆← Σ2
c`,c`
Σ←√
max(Σ2 − I`∆, 0)B ← ΣV
Return: BLiberty, 2012
Matrix Sketching
We can compute B in O(dn/ε) time and O(d/ε) space.
‖AAT − BBT‖ ≤ ε‖A‖2f
Edo Liberty , Jelani Nelson : Streaming Data Mining 100 / 111
![Page 218: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/218.jpg)
Matrix Sketching
Example approximation of a synthetic matrix of size 1, 000× 10, 000
Edo Liberty , Jelani Nelson : Streaming Data Mining 101 / 111
![Page 219: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/219.jpg)
Streaming Data Mining
1 Items
Item frequenciesDistinct elementsMoment estimation
2 Vectors
Dimensionality reductionk-meansLinear Regression
3 Matrices
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 102 / 111
![Page 220: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/220.jpg)
Sparsification by sampling
Sometimes, we only get one matrix entry at a time...
Edo Liberty , Jelani Nelson : Streaming Data Mining 103 / 111
![Page 221: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/221.jpg)
Sparsification by sampling
Recommender system example: user i rated item j with Ai ,j -stars.
Edo Liberty , Jelani Nelson : Streaming Data Mining 104 / 111
![Page 222: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/222.jpg)
Sparsification by sampling
Note that these matrices are usually very sparse!
Edo Liberty , Jelani Nelson : Streaming Data Mining 105 / 111
![Page 223: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/223.jpg)
Sparsification by sampling
If we sample entries we can only improve sparsity.
Edo Liberty , Jelani Nelson : Streaming Data Mining 106 / 111
![Page 224: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/224.jpg)
Sparsification by sampling
How close is B the original matrix A, ‖A− B‖2 ≤?
Edo Liberty , Jelani Nelson : Streaming Data Mining 107 / 111
![Page 225: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/225.jpg)
Sparsification by sampling
Generic sampling algorithm:
B ← all zeros matrixfor (i , j ,Ai ,j) do
w.p. pi ,jBi ,j ← Ai ,j/pi ,j
Return: B
For any choice of pi ,j we have E[A− B] = 0 or:
E[B] = A
What about Var[A− B] ? In other words, when is ‖A− B‖2 small?
Edo Liberty , Jelani Nelson : Streaming Data Mining 108 / 111
![Page 226: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/226.jpg)
Sparsification by sampling
Generic sampling algorithm:
B ← all zeros matrixfor (i , j ,Ai ,j) do
w.p. pi ,jBi ,j ← Ai ,j/pi ,j
Return: B
For any choice of pi ,j we have E[A− B] = 0 or:
E[B] = A
What about Var[A− B] ? In other words, when is ‖A− B‖2 small?
Edo Liberty , Jelani Nelson : Streaming Data Mining 108 / 111
![Page 227: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/227.jpg)
Sparsification by sampling
Assume w.l.o.g. that |Ai ,j | ≤ 1 we can get that:
‖A− B‖2 ≤ ε
[Fast Computation of Low Rank Matrix Approximations, Achlioptas, McSherry]
pi ,j = O(1) ·max(A2i ,j/n, |Ai ,j |/
√n)
|B| = O(n + nε2
∑i ,j A
2i ,j)
[A Fast Random Sampling Algorithm for Sparsifying Matrices, Arora, Hazan, Kale]
pi ,j = min(|Ai ,j |√n/ε, 1)
|B| = O(√nε
∑i ,j |Ai ,j |)
The latter has a very simple proof, see paper for more details.Edo Liberty , Jelani Nelson : Streaming Data Mining 109 / 111
![Page 228: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/228.jpg)
Streaming Data Mining
1 Items (words, IP-adresses, events, clicks,...):
Item frequenciesDistinct elementsMoment estimation
2 Vectors (text documents, images, example features,...)
Dimensionality reductionk-meansLinear Regression
3 Matrices (text corpora, user preferences, social graphs,...)
Efficiently approximating the covariance matrixSparsification by sampling
Edo Liberty , Jelani Nelson : Streaming Data Mining 110 / 111
![Page 229: Streaming Data Mining - Edo Liberty's homepage · Edo Liberty , Jelani Nelson : Streaming Data Mining 24 / 111. Lossy Counting We keep at most 1="di erent items (in this case 2) Edo](https://reader033.vdocuments.net/reader033/viewer/2022052103/603d2cb44e80ee56981a17d7/html5/thumbnails/229.jpg)
Thank you!
We would like to thank: Nir Ailon, Moses S. Charikar, Sudipto Guha, EladHazan, Yishay Mansour, Muthu Muthukrishnan, David P. Woodruff,Petros Drineas, Piotr Indyk, Graham Cormode, Andrew McGregor, andDimitris Achlioptas for pointing out important algorithms and references.
Edo Liberty , Jelani Nelson : Streaming Data Mining 111 / 111