data reduction for weighted and outlier-resistant clustering leonard j. schulman caltech joint with...
TRANSCRIPT
![Page 1: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/1.jpg)
Data reduction for weighted and outlier-resistant
clustering
Leonard J. SchulmanCaltech
joint with
Dan FeldmanMIT
![Page 2: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/2.jpg)
Talk outline
• Clustering-type problems:– k-median– weighted k-median– k-median with m outliers (small m)– k-median with penalty (clustering with many outliers)– k-line median
• Unifying framework: tame loss functions
• Core-sets, a.k.a. -approximations
• Common existence proof and algorithm
![Page 3: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/3.jpg)
![Page 4: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/4.jpg)
![Page 5: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/5.jpg)
Voronoi regions have spherical boundaries
![Page 6: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/6.jpg)
![Page 7: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/7.jpg)
k-Median with penalty
![Page 8: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/8.jpg)
k-Median with penalty: good for outliers
2-median clustering of a data set:
Same data set plus an outlier:
Now cluster with h-robust loss function:
![Page 9: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/9.jpg)
![Page 10: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/10.jpg)
Related work and our results
![Page 11: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/11.jpg)
Why are all these problems in the same paper?
In each case the objective function is a suitably tame “loss function”.
The loss in representing a point p by a center c is:
k-median: D(p) = dist(p,c)
Weighted k-median: D(p) = w · dist(p,c)
Robust k-median: D(p) = min{h, dist(p,c)}
What qualifies as a “tame” loss function?
![Page 12: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/12.jpg)
Log-Log Lipschitz (LgLgLp) condition on the loss function
![Page 13: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/13.jpg)
Many examples of LgLgLp loss functions:
Robust M-estimators in Statistics
figure: Z. Zhang
![Page 14: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/14.jpg)
Classic Data Reduction
![Page 15: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/15.jpg)
Same notion for LgLgLp loss functions
![Page 16: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/16.jpg)
k-clustering core-set for loss D
![Page 17: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/17.jpg)
Weighted-k-clustering core-set for loss D
Handling arbitrary-weight centers is the “hard part”
![Page 18: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/18.jpg)
Our main technical result
1. For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size
|S| = O(log2 n)
(In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite metrics, d=log n.)
2. S can be computed in time O(n)
![Page 19: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/19.jpg)
Sensitivity [Langberg and S, SODA’11]
The sensitivity of a point p P determines how important it is to include P in a core-set:
Why this works:If s(p) is small, then p has many “surrogates” in the data,
we can take any one of them for the core-set.If s(p) is large, then there is some C for which p alone
contributes a significant fraction of the loss, so we need to include p in any core-set.
DW(p,C)
qP DW(q,C)s(p) = maxC
![Page 20: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/20.jpg)
Total sensitivity
The total sensitivity T(P) is the sum of the sensitivities of all the points:
The total sensitivity of the problem is the maximum of T(P) over all input sets P.
Total sensitivity ~ n: cannot have small core-sets.Total sensitivity constant or polylog: there may
exist small core-sets.
T(P)=sP s(p)
![Page 21: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/21.jpg)
Small total sensitivity Small coreset
![Page 22: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/22.jpg)
Small total sensitivity Small core-set
![Page 23: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/23.jpg)
The main thing we need to do in order to produce a small core-set for weighted-k-median:
For each p P compute a good upper bound on s(p) in amortized O(1) time per point.
(Upper bound should be good enough that s(p) is small)
![Page 24: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/24.jpg)
Recursive-Robust-Median(P,k)• Input:
– A set P of n points in a metric space– An integer k 1
• Output:– A subset Q P of (n/kk) points
We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query. Hence each point p Q has sensitivity s(p) O(1/|Q|).
Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty.
Total sensitivity bd: T # calls to Recursive-Robust-Median kk log n.
Algorithm for computing sensitivities
![Page 25: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/25.jpg)
The algorithm to find the (n)–size set Q:
![Page 26: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/26.jpg)
Recursive-Robust-Median: illustration
c*
c*
![Page 27: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/27.jpg)
Recursive-Robust-Median: illustration
c*
![Page 28: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/28.jpg)
A detail
Actually it’s more complicated than described because we can’t afford to look for a (1+)-approximation, or even a 2-approximation, to the best k-median of any b·n points (b constant).
Instead look for a bicriteria approximation: a 2-approximation of the best k-median of any b·n/2 points. Linear time algorithm from [F,Langberg STOC’11].
![Page 29: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/29.jpg)
High-level intuition for the correctness of Recursive-Robust-Median
Consider any p in the “output” set Q.
If for all queries C, D(p,C) is small, then p has low sensitivity.
If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c C, and are closer to each other than to c; so they are surrogates.
![Page 30: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/30.jpg)
Thankyou
![Page 31: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/31.jpg)
appendices
![Page 32: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/32.jpg)
Many examples of LgLgLp loss functions:
Robust M-estimators in Statistics
…
![Page 33: Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT](https://reader035.vdocuments.net/reader035/viewer/2022062517/56649f395503460f94c56093/html5/thumbnails/33.jpg)
М-оценка
Huber
"fair"
Cauchy
Geman-
McClure
Welsch
Tukey
Andrews