clustering an overview of clustering algorithms dènis de keijzer gia 2004
Post on 22-Dec-2015
226 Views
Preview:
TRANSCRIPT
Clustering
An overview of clustering algorithms
Dènis de Keijzer
GIA 2004
Overview
AlgorithmsGRAVIclustAUTOCLUSTAUTOCLUST+3D Boundary-based Clustering SNN
Gravity based spatial clustering
GRAVIclust Initialisation Phase
calculate the initial centre clusters
Optimisation Phase improve the position of the cluster centres so as
to achieve a solution which minimizes the distance function
k
=i iCpiLp,d
1
GRAVIclust: Initialisation Phase
Input:set of points P
GRAVIclust: Initialisation Phase
Input:set of points Pmatrix of distances between all pairs of
pointsassumption: actual access path distanceexists in GIS maps
e.g.. http://www.transinfo.qld.gov.auvery versatile
footpath road map rail map
GRAVIclust: Initialisation Phase
Input:set of points Pmatrix of distances between all pairs of
points# of required clusters k
GRAVIclust: Initialisation Phase
Step 1: calculate first initial centre
the point with the largest number of points within radius r remove first initial centre & all points within radius r from
further consideration
Step 2: repeat Step 1 until k initial centres have been chosen
Step 3: create initial clusters by assigning all points to the closest
cluster centre
GRAVIclust: radius calculation
Radius rcalculated based on the area of the region
considered for clusteringstatic radius
based on the assumption that all clusters are of the same size
dynamic radius recalculated after each initial cluster centre is
chosen
π
A=r
clusters required #
rectangle bounding minimum of area=A
GRAVIclust: Static vs. Dynamic
Static reduced computation # points within a radius r has to be calculated
only once not suitable for problems where the points are
separated by large empty areas
Dynamic increases computation time ensures the radius is adjusted as the points are
removed
Differs only when distribution is non-uniform
GRAVIclust: Optimisation Phase
Step 1: for each cluster, calculate new centre
based on the the point closest to cluster centre of gravity
Step 2: re-assign points to new cluster centres
Step 3: recalculate distance function
never greater than previous
Step 4: repeat Step 1 to 3 until value distance function
equals previous
GRAVIclust
Deterministic
Can handle obstacles
Monotonic convergence of the distance function to a stable point
AUTOCLUST
Definitions
ipd
j=ij
ipNeii pde=pNe=pLocalMean
1
//
ipd
j=ijii pdepLocalMean=pLocalStDev
1
2 /
n
=ii npLocalStDev=PMeanStDev
1
/
PMeanStDevpLocalStDev=pDevRelativeSt ii /
AUTOCLUST
Definitions II
PMeanStDevpLocalMean<e|e=pShortEdges ijji
PMeanStDev+pLocalMean>e|e=pLongEdges ijji
iiii pLongEdgespShortEdgespN=pOtherEdges
AUTOCLUST
Phase 1: finding boundaries
Phase 2: restoring and re-attaching
Phase 3:detecting second-order inconsistency
AUTOCLUST: Phase 1
Finding boundariesCalculate
Delaunay Diagram for each point pi
ShortEdges(pi)
LongEdges(pi)
OtherEdges(pi)
Remove ShortEdges(pi) and LongEdges(pi)
AUTOCLUST: Phase 2
Restoring and re-attaching for each point pi where ShortEdges(pi)
Determine a candidate connected component C for p
i
If there are 2 edges ej = (p
i,p
j) and e
k = (p
i,p
k) in
ShortEdges(pi) with CC[p
j] CC[p
k], then
Compute, for each edge e = (pi,p
j) ShortEdges(p
i),
the size ||CC[pj]|| and let M = max
e = (pi,pj)
ShortEdges(pi)
||CC[pj]||
Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p
i)
AUTOCLUST: Phase 2
Restoring and re-attaching for each point p
i where ShortEdges(p
i)
Determine a candidate connected component C for pi
If … Otherwise, let C be the label of the connected
component all edges e ShortEdges(pi) connect pi to
AUTOCLUST: Phase 2
Restoring and re-attaching for each point p
i where ShortEdges(p
i)
Determine a candidate connected component C for pi
If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that
all edges in OtherEdges(pi) are removed, and only in this case, will pi swap connected components
Add all edges e ShortEdges(pi) that connect to C
AUTOCLUST: Phase 3
Detecting second-order inconsistencycompute the LocalMean for 2-
neighbourhoods remove all edges in N
2,G(pi) that are long
edges
ipGNe ipGipG2,
Ne=LocalMean2,
2,/
PMeanStDev+LocalMean>eipG2,
AUTOCLUST
AUTOCLUST
No user supplied arguments eliminates expensive human-based exploration
time for finding best-fit arguments
Robust to noise, outliers, bridges and type of distributionAble to detect clusters with arbitrary shapes, different sizes and different densitiesCan handle multiple bridgesO(n log n)
AUTOCLUST+
Construct Delaunay Diagram
Calculate MeanStDev(P)
For all edges e, remove e if it intersects some obstacles
Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps
3D Boundary-based Clustering
Benefits from 3D Clusteringmore accurate spatial analysisdistinguish
positive clusters: clusters in higher dimensions but not in lower
dimensions
3D Boundary-based Clustering
Benefits from 3D Clusteringmore accurate spatial analysisdistinguish
positive clusters: clusters in higher dimensions but not in lower
dimensionsnegative clusters:
clusters in lower dimensions but not in higher dimensions
3D Boundary-based Clustering
Based on AUTOCLUST
Uses Delaunay Tetrahedrizations
Definitions:e
j potential inter-cluster edge if:
iij pLocalStDevl+pLocalMean>e
l m RelativeStDev pi1 m MeanStDev P LocalStDev pi
PMeanStDevm+pLocalMeanpAIPMeanStDevmpLocalMean iii
3D Boundary-based Clustering
Phase IFor all the p
i P, classify each edge e
j
incident to pi into one of three groups
ShortEdges(pi) when the length of ej is less than the range in AI(pi)
LongEdges(pi) when the length of ej is greater than the range in AI(pi)
OtherEdges(pi) when the length of ej is within AI(pi)
For all the pi P, remove all edges in
ShortEdges(pi) and LongEdges(pi)
3D Boundary-based Clustering
Phase IIRecuperate ShortEdges(pi) incident to
border points using connected component analysis
Phase IIIRemove exceptionally long edges in local
regions
PMeanStDevm+LocalMean>eipGj
2,
Shared Nearest Neighbour
Clustering in higher dimensionsDistances or similarities between points
become more uniform, making clustering more difficult
Also, similarity between points can be misleading
i.e.. a point can be more similar to a point that “actually” belongs to a different cluster
SolutionShared nearest neighbor approach to similarity
SNN: An alternative definition of similarity
Euclidian distancemost common distance metric usedwhile useful in low dimensions, it doesn’t
work well in high dimensions
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
P1 3 0 0 0 0 0 0 0 0 0
P2 0 0 0 0 0 0 0 0 0 4
P3 3 2 4 0 1 2 3 1 2 0
P4 0 2 4 0 1 2 3 1 2 4
SNN: An alternative definition of similarity
Define similarity in terms of their shared nearest neighbours the similarity of the points is “confirmed” by
their common shared nearest neighbours
))()((),( qNNpNNsizeqpsimilarity
SNN: An alternative definition of density
SNN similarity, with the k-nearest neighbour approach if the k-nearest neighbour of a point, with
respect to SNN similarity is close, then we say that there is a high density at this point
since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space
SNN: Algorithm
Compute the similarity matrixcorresponds to a similarity graph with data
points for nodes and edges whose weights are the similarities between data points
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix by keeping only the k most similar neighbourscorresponds to keeping only the k
strongest links of the similarity graph
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared nearest neighbour graph from the sparsified similarity matrix
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared …
Find the SNN density of each point
Find the core points
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared …
Find the SNN density of each point
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared …
Find the SNN density of each point
Form clusters from the core points
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared …
Find the SNN density of each point
Form clusters from the core points
Discard all noise points
SNN: Algorithm
Compute the similarity matrix
Sparsify the similarity matrix …
Construct the shared …
Find the SNN density of each point
Form clusters from the core points
Discard all noise points
Assign al non-noise, non-core points to clusters
Shared Nearest Neighbour
Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers
Handles data of high dimentionality and varying densities
Automaticly detects the # of clusters
top related