a summary to current clustering methods and optics: ordering points to identify the clustering...
TRANSCRIPT
![Page 1: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/1.jpg)
A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure
Presented byHo Wai Shing
![Page 2: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/2.jpg)
Overview Introduction Current Clustering Techniques OPTICS Discussions
![Page 3: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/3.jpg)
Introduction What is clustering?
Given: a dataset with N points in a d -dimensional space
Task: find a natural partitioning of the points into a number (k ) of closely related groups (clusters) and noise
![Page 4: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/4.jpg)
![Page 5: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/5.jpg)
Introduction Example Application
To find similar electronic parts from their design blue-prints:
use Fourier transform to transform contours of parts into coefficients
do clustering on the coefficients discuss this later in the talk
![Page 6: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/6.jpg)
Introduction What are the main concerns?
Efficiency Effectiveness Scalability Interactivity
![Page 7: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/7.jpg)
Current Techniques can be classified into groups
hierarchical vs partitioning bottom-up merging vs flat partitioning
centroid-based vs density-based 1 or more representative points for a
cluster
![Page 8: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/8.jpg)
Several Clustering Algorithms k -mean BIRCH DBSCAN CURE / C2P OPTICS
![Page 9: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/9.jpg)
k-mean partitions the space into k clusters each cluster is represented by the
mean of the points belong to this cluster
iteratively refines the k representative points until reaching a local minimal on total distances within clusters
![Page 10: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/10.jpg)
k-mean the clusters must be convex, and
should have similar size (may not be the case in real data)
need to scan the database many times (slow)
easily disturbed by outliers (every point counts in calculating the mean)
![Page 11: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/11.jpg)
an example dataset
![Page 12: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/12.jpg)
BIRCH use CF-Tree to summarize the data
points so that everything are in memory points will be merged to a leaf entry of
the CF-Tree if they are similar points will be stored in an “extension” if
no similar leaves can be found build clusters over those leaves
instead of the original data points
![Page 13: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/13.jpg)
an example dataset
entries in CF-Tree Leaves (contains N, LS, SS)
![Page 14: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/14.jpg)
an example dataset
… …
entries in CF-Tree Leaves (contains N, LS, SS)
![Page 15: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/15.jpg)
BIRCH basically hierarchical one of the fastest algorithm available scan the data only once can remove some outliers can be used as a pre-clustering step the result depends on inserting order
![Page 16: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/16.jpg)
DBSCAN the first density-based algorithm
without grid clusters = collections of density-
connected points definitions:
directly density-reachable density-reachable density-connected
![Page 17: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/17.jpg)
r is directly density-reachable from q
![Page 18: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/18.jpg)
DBSCAN start with an arbitrary point, perform k-NN (k-nearest-neighbor)
search for that point if it is dense then we grow that
point into a cluster find another point until we exhaust
all the points
![Page 19: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/19.jpg)
DBSCAN good:
can find arbitrary shaped clusters intuitive definition of clusters reasonable complexity if index is available
for k-NN search bad:
difficult to determine input parameters Eps and MinPts
suffers from “Chaining Effect”
![Page 20: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/20.jpg)
The Chaining Problem
chain
![Page 21: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/21.jpg)
CURE instead of using all points in
calculating the distance between a point and a cluster like DBSCAN, CURE uses a set of representative points within a cluster
this could reduce the chaining effect
![Page 22: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/22.jpg)
CURE diagram: no longer chains
actual points points are close,but representatives ain’t
![Page 23: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/23.jpg)
C2P much better than CURE in terms of
efficiency (O(n2lgn) to O(nlgn + m2lgm)) accomplished by a O(nlgn) pre-
clustering phase add links between all the points and their
nearest neighbours condense each connected graph into a
single point repeat until m points remains
![Page 24: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/24.jpg)
OPTICS Ordering Points To Identify the
Clustering Structure in SIGMOD 99 by Ankerst et al. A generalization of DBSCAN + a
visualization technique
![Page 25: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/25.jpg)
OPTICS Motivation:
input parameters (e.g., Eps) are difficult to be determined
one global parameter setting may not fit all the clusters
it’s good to allow users to have flexibility in selecting clusters
![Page 26: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/26.jpg)
OPTICS definitions
core-distance of a point p the distance between the point p and its
MinPts’th neighbour reachability-distance of a point p
w.r.t. another point o the distance between o and p, with a
lower bound of core-dist(o)
![Page 27: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/27.jpg)
![Page 28: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/28.jpg)
OPTICS start from an arbitrary point, sorts the
points according to the reachability-distance
this sorting can be used to produce density-based clusters with 0 < Eps < Epsinput
Reachability plot can be used to provide a good visualization tool for analyzing clusters
![Page 29: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/29.jpg)
OPTICS reachability plot
![Page 30: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/30.jpg)
OPTICS 16-d reachability plot
![Page 31: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/31.jpg)
Discussions All the methods described above
fail at high-dimensional cases “The curse of dimensionality”
distances between all the points are nearly the same
grids are usually not dense (O(2d) grids vs O(n) points)), clusters tend to be divided by grids
no efficient indices for k-NN search
![Page 32: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/32.jpg)
Discussions key observation:
not all dimensions are meaningful in clustering
some clusters may exist under a subset of dimensions while the others exist under another subset of dimensions
leads to: feature selection subspace clustering / projected clustering
![Page 33: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/33.jpg)
References M Ankerst, M M Breunig and H-P Kriegel, J Sander, OPTICS: Ordering
Points To Identify the Clustering Structure, SIGMOD’99 T Zhang, R Ramakrishnan and M Livny, BIRCH: An Efficient Data
Clustering Method for Very Large Databases, SIGMOD’96 S Guha, R Rastogi and K Shim, CURE: An Efficent Clustering Algorithm
for Large Databases SIGMOD’98 C C Aggarwal and P S Yu, Finding Generalized Projected Clusters in
High Dimensional Spaces, SIGMOD’00 R Agrawal, J Gehrke, D Gunopulos and P Raghavan, Automatic
Subspace Clustering of High Dimensional Data for Data Mining Applications, SIGMOD’98
M Ester, H-P Kriegel, J Sander and X Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Database with Noise, KDD’96
A Nanopoulos, Y Theodoridis and Y Manolopoulos, C2P: Clustering based on Closest Pairs, VLDB’01
![Page 34: A Summary to Current Clustering Methods and OPTICS: Ordering Points To Identify the Clustering Structure Presented by Ho Wai Shing](https://reader036.vdocuments.net/reader036/viewer/2022081514/5697bfbe1a28abf838ca2b0d/html5/thumbnails/34.jpg)
Questions?