k-means and dbscan

22
k-Means and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory

Upload: trung

Post on 22-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

k-Means and DBSCAN. Gyozo Gidofalvi Uppsala Database Laboratory. Announcements. Updated material for assignment 2 on the lab course home page. Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. Posted office hours . k-Means. Input M (set of points) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: k-Means and DBSCAN

k-Means and DBSCAN

Gyozo GidofalviUppsala Database Laboratory

Page 2: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 2

Announcements

Updated material for assignment 2 on the lab course home page.

Posted sign-up sheets for labs and examinations for assignment 2 outside P1321.

Posted office hours

Page 3: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 3

k-Means Input

• M (set of points)• k (number of clusters)

Output• µ1, …, µk (cluster centroids)

k-Means clusters the M point into K clustersby minimizing the squared error function

clusters Si; i=1, …, k. µi is the centroid of all xjSi.

2

1

k

i Sxij

ij

µx

Page 4: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 4

k-Means algorithm

select (m1 … mK) randomly from M % initial centroidsdo

(µ1 … µK) = (m1 … mK)all clusters Ci = {}for each point p in M% compute cluster membership of p[i] = argminj(dist(µj, p))% assign p to the corresponding cluster:Ci = Ci {p}endfor each cluster Ci % recompute the centroidsmi = avg(p in Ci)

while exists mi µi % convergence criterion

Page 5: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 5

K-Means on three clusters

Page 6: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 6

I’m feeling Unlucky

Bad initial points

Page 7: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 7

kmeans in practice

How to choose initial centroids• select randomly among the data points• generate completely randomly

How to choose k• study the data• run k-Means for different k

measure squared error for each kRun kmeans many times!

• Get many choices of initial points

Page 8: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 8

k-Means iteration step in AmosQL

Calculate point-to-centroid distances: calp2c_distance(…)

select p, c, dfrom Vector of Number p, Vector of Number c, Number dwhere p in bag({iota(1,10)}) and c in bag({iota(1,10)}) and d = euclid(p,c);

Assign each point to the closest centroid: calc_cluster_assignment(…)

groupby((p2c_distances1(…)), #’argminv’);

Recalculate centroids: calc_clust_means(…) groupby(calc_cluster_assignment1(…), #’col_means’);

Page 9: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 9

Transitive closure

tclose is a second order function to explore graphs where the edges are expressed by a transition function fno

tclose(Function fno, Object o)->Bag of Object fno(o) produces the children of o tclose applies the transition function fno(o), then fno(fno(o)), then fno(fno(fno(o))), etc until fno returns no new results

Page 10: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 10

Iterate until convergence with tclose in AmosQL

create function bagidiv2(Bag of Number b) ->Bag of Bag of Number as (select floor(n/2) from Number n where n in b);

create function vecchild_idiv2(Vector of Number vb) ->Bag of Vector of Number as sort(bagidiv2(in(vb)));

create function vecconverge_tclose(Bag of Number ib) ->Bag of Vector of Number/* tclose function iterating the bagchild_idiv2 function

until convergence */ as select ov from Vector of Number ov where ov in tclose(#'vecchild_idiv2', sort(ib));

Page 11: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 11

What about this?!

Non-spherical clusters

Noise

Page 12: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 12

k-Means pros and cons

+ -Easy Fast

Works only for ”well-shaped”

clustersScalable? Sensitive to

outliersSensitive to noise

Must know k a priori

Page 13: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 13

Questions

Euclidean distance results in spherical clusters• What cluster shape does the Manhattan distance give?• Think of other distance measures too. What cluster

shapes will those yield? Assuming that the K-means algorithm converges

in I iterations, with N points and X features for each point • give an approximation of the complexity of the

algorithm expressed in K, I, N, and X. Can the K-means algorithm be parallelized?

• How?

Page 14: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 14

DBSCAN

Density Based Spatial Clustering of Applications with Noise

Basic idea:• If an object p is density connected to q,

then p and q belong to the same cluster• If an object is not density connected to

any other object it is considered noise

Page 15: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 15

-neigborhood• The -neigborhood of an object p is the set of

objects within distance of p core object

An object q is a core object iffthere are at least MinPts objects in q’s -neighbourhood

directly density reachable (ddr)An object p is ddr from q iff q is a core object and p is inside the neighbourhood of q

Definitions

p

q

Page 16: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 16

density reachable (dr)An object p is dr from q iff

there exists a chain of objects q1 … qn s.t.- q1 is ddr from q, - q2 is ddr from q1, - q3 is ddr from … and p is ddr from qn

density connected (dc)p is dc to r iff

- exist an object q such that p is dr from q - and r is dr from q

Reachability and Connectivity

pqq1q2

q

pr

Page 17: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 17

Recall…

Basic idea:• If an object p is density connected to q,

then p and q belong to the same cluster• If an object is not density connected to

any other object it is considered noise

Page 18: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 18

DBSCANi = 1dotake a point p from Mfind the set of points P which are density connected to pif P = {}

M = M \ {p}else

Ci=Pi=i+1M = M \ P

endwhile M {}

HOW?

p

Page 19: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 19

Fining density connected componnets

If r is dc to p there exists q, s.t. both p and r are dr from q. i.e., there exists a ddr-chain from q to both r and p and q is a core object.

Recall: tclose is a second order function to explore graphs where the edges are expressed by a transition function fno.

fno = ddr

Page 20: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 20

Fining dc components in AmosQL

Assuming q is a core object and the a ddr function with the following signature is defined: ddr(Integer q)->Bag of Integer p

Then:create function dc(Integer q)->Bag of Integeras select p from Integer p where p in tclose(#’ddr’, q);

Page 21: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 21

DBSCAN pros and cons

+ -Clusters of arbitrary

shapeRobust to

noise

Requires connected regions of sufficiently

high density

Does not need an a

priori kDeterministic

Data sets with varying densities are problematic

Scalable?

Page 22: k-Means and DBSCAN

23-04-22 Gyozo Gidofalvi 22

Questions

Why is the dc criterion useful to define a cluster, instead of dr or ddr?

For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)?

Express using only core objects and ddr, which objects will belong to a cluster