k-means and dbscan
DESCRIPTION
k-Means and DBSCAN. Gyozo Gidofalvi Uppsala Database Laboratory. Announcements. Updated material for assignment 2 on the lab course home page. Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. Posted office hours . k-Means. Input M (set of points) - PowerPoint PPT PresentationTRANSCRIPT
k-Means and DBSCAN
Gyozo GidofalviUppsala Database Laboratory
23-04-22 Gyozo Gidofalvi 2
Announcements
Updated material for assignment 2 on the lab course home page.
Posted sign-up sheets for labs and examinations for assignment 2 outside P1321.
Posted office hours
23-04-22 Gyozo Gidofalvi 3
k-Means Input
• M (set of points)• k (number of clusters)
Output• µ1, …, µk (cluster centroids)
k-Means clusters the M point into K clustersby minimizing the squared error function
clusters Si; i=1, …, k. µi is the centroid of all xjSi.
2
1
k
i Sxij
ij
µx
23-04-22 Gyozo Gidofalvi 4
k-Means algorithm
select (m1 … mK) randomly from M % initial centroidsdo
(µ1 … µK) = (m1 … mK)all clusters Ci = {}for each point p in M% compute cluster membership of p[i] = argminj(dist(µj, p))% assign p to the corresponding cluster:Ci = Ci {p}endfor each cluster Ci % recompute the centroidsmi = avg(p in Ci)
while exists mi µi % convergence criterion
23-04-22 Gyozo Gidofalvi 5
K-Means on three clusters
23-04-22 Gyozo Gidofalvi 6
I’m feeling Unlucky
Bad initial points
23-04-22 Gyozo Gidofalvi 7
kmeans in practice
How to choose initial centroids• select randomly among the data points• generate completely randomly
How to choose k• study the data• run k-Means for different k
measure squared error for each kRun kmeans many times!
• Get many choices of initial points
23-04-22 Gyozo Gidofalvi 8
k-Means iteration step in AmosQL
Calculate point-to-centroid distances: calp2c_distance(…)
select p, c, dfrom Vector of Number p, Vector of Number c, Number dwhere p in bag({iota(1,10)}) and c in bag({iota(1,10)}) and d = euclid(p,c);
Assign each point to the closest centroid: calc_cluster_assignment(…)
groupby((p2c_distances1(…)), #’argminv’);
Recalculate centroids: calc_clust_means(…) groupby(calc_cluster_assignment1(…), #’col_means’);
23-04-22 Gyozo Gidofalvi 9
Transitive closure
tclose is a second order function to explore graphs where the edges are expressed by a transition function fno
tclose(Function fno, Object o)->Bag of Object fno(o) produces the children of o tclose applies the transition function fno(o), then fno(fno(o)), then fno(fno(fno(o))), etc until fno returns no new results
23-04-22 Gyozo Gidofalvi 10
Iterate until convergence with tclose in AmosQL
create function bagidiv2(Bag of Number b) ->Bag of Bag of Number as (select floor(n/2) from Number n where n in b);
create function vecchild_idiv2(Vector of Number vb) ->Bag of Vector of Number as sort(bagidiv2(in(vb)));
create function vecconverge_tclose(Bag of Number ib) ->Bag of Vector of Number/* tclose function iterating the bagchild_idiv2 function
until convergence */ as select ov from Vector of Number ov where ov in tclose(#'vecchild_idiv2', sort(ib));
23-04-22 Gyozo Gidofalvi 11
What about this?!
Non-spherical clusters
Noise
23-04-22 Gyozo Gidofalvi 12
k-Means pros and cons
+ -Easy Fast
Works only for ”well-shaped”
clustersScalable? Sensitive to
outliersSensitive to noise
Must know k a priori
23-04-22 Gyozo Gidofalvi 13
Questions
Euclidean distance results in spherical clusters• What cluster shape does the Manhattan distance give?• Think of other distance measures too. What cluster
shapes will those yield? Assuming that the K-means algorithm converges
in I iterations, with N points and X features for each point • give an approximation of the complexity of the
algorithm expressed in K, I, N, and X. Can the K-means algorithm be parallelized?
• How?
23-04-22 Gyozo Gidofalvi 14
DBSCAN
Density Based Spatial Clustering of Applications with Noise
Basic idea:• If an object p is density connected to q,
then p and q belong to the same cluster• If an object is not density connected to
any other object it is considered noise
23-04-22 Gyozo Gidofalvi 15
-neigborhood• The -neigborhood of an object p is the set of
objects within distance of p core object
An object q is a core object iffthere are at least MinPts objects in q’s -neighbourhood
directly density reachable (ddr)An object p is ddr from q iff q is a core object and p is inside the neighbourhood of q
Definitions
p
q
23-04-22 Gyozo Gidofalvi 16
density reachable (dr)An object p is dr from q iff
there exists a chain of objects q1 … qn s.t.- q1 is ddr from q, - q2 is ddr from q1, - q3 is ddr from … and p is ddr from qn
density connected (dc)p is dc to r iff
- exist an object q such that p is dr from q - and r is dr from q
Reachability and Connectivity
pqq1q2
q
pr
23-04-22 Gyozo Gidofalvi 17
Recall…
Basic idea:• If an object p is density connected to q,
then p and q belong to the same cluster• If an object is not density connected to
any other object it is considered noise
23-04-22 Gyozo Gidofalvi 18
DBSCANi = 1dotake a point p from Mfind the set of points P which are density connected to pif P = {}
M = M \ {p}else
Ci=Pi=i+1M = M \ P
endwhile M {}
HOW?
p
23-04-22 Gyozo Gidofalvi 19
Fining density connected componnets
If r is dc to p there exists q, s.t. both p and r are dr from q. i.e., there exists a ddr-chain from q to both r and p and q is a core object.
Recall: tclose is a second order function to explore graphs where the edges are expressed by a transition function fno.
fno = ddr
23-04-22 Gyozo Gidofalvi 20
Fining dc components in AmosQL
Assuming q is a core object and the a ddr function with the following signature is defined: ddr(Integer q)->Bag of Integer p
Then:create function dc(Integer q)->Bag of Integeras select p from Integer p where p in tclose(#’ddr’, q);
23-04-22 Gyozo Gidofalvi 21
DBSCAN pros and cons
+ -Clusters of arbitrary
shapeRobust to
noise
Requires connected regions of sufficiently
high density
Does not need an a
priori kDeterministic
Data sets with varying densities are problematic
Scalable?
23-04-22 Gyozo Gidofalvi 22
Questions
Why is the dc criterion useful to define a cluster, instead of dr or ddr?
For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)?
Express using only core objects and ddr, which objects will belong to a cluster