efficient and effective clustering methods for spatial...

26
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han

Upload: others

Post on 05-Sep-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

1

Efficient and Effective Clustering Methods for Spatial

Data Mining

Raymond T. Ng, Jiawei Han

Page 2: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

2

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 3: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

3

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 4: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

4

Spatial Data Mining

� Identifying interesting relationships and characteristics that may exist implicitly in Spatial Databases

� Different from Relational Databases� Spatial objects - store both spatial and non-

spatial attributes� Queries (“All Walmart stores within 10 miles of

UH)� Spatial Joins, work on spatial indexes (R-tree)� Huge sizes (Tera bytes)

� GIS is a classic example

Page 5: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

5

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 6: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

6

Partitioning Methods

Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity.

[Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids)

Selected algorithms:

1. K-medoids

2. PAM

3. CLARA

4. CLARANS

Page 7: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

7

K-Medoids

� Partition based clustering (K partitions)

� Effective, why ?

� Resistant to outliers� Do not depend on order in

which data points are examined

� Cluster center is part of dataset, unlike k-means where cluster center is gravity based

� Experiments show that large data sets are handled efficiently

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-means

K-medoids

Page 8: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

8

PAM (Partitioning Around Medoids)

� [Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster.

Page 9: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

9

PAM (2)

� Start with K data points designated as medoids. Create cluster around a medoid by moving data points close to the medoid

Oj belongs to Oi

if d(Oj, Oi) = minOe d(Oj, Oe)

� Iteratively replace Oi with Oh if quality of clustering improves.

� Swapping cost, Cijh, associated for replacing a selected object Oi with a non-selected object Oh

Page 10: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

10

PAM (3)

* O(k(n-k)2) for each iteration* Good for small data sets(n=100, k=5)

Page 11: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

11

CLARA (Clustering LARge Applications)

� Improvement over PAM

� Finds medoids in a sample from the dataset

� [Idea]: If the samples are sufficiently random, the medoids of the sample approximate the medoids of the dataset

� [Heuristics]: 5 samples of size 40+2k gives satisfactory results

� Works well for large datasets (n=1000, k=10)

Page 12: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

12

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 13: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

13

CLARANS (Clustering Large Applications based on RANdomized Search)

� A graph abstraction, Gn,k

� Each vertex is a collection of k medoids

� | S1 S2 | = k – 1

� Each node has k(n-k) neighbors

� Cost of each node is total dissimilarity of objects to their medoids

� PAM searches whole graph

� CLARA searches subgraph

S1

S2

{Od1, ..., Odk}

{Oc1, ..., Ock}

{Ob1, ..., Obk}

{Oa1, ..., Oak}

{Om1, ..., Omk}

Page 14: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

14

CLARANS (2)

Experimental values

• numLocal = 2

• maxNeighbors =

max(1.25% of k(n-k), 250)

Page 15: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

15

CLARANS (3)

� Outperforms PAM and CLARA in terms of running time and quality of clustering

� O(n2) for each iteration

CLARANS vs PAM

CLARANS vs CLARA

Page 16: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

16

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 17: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

17

Generalization

� Useful to mine non-spatial attributes

� Process of merging tuples based on a concept hierarchy

� DBLearn – SQL query, gen. hierarchy and threshold

Initial relation Generalized relation

Sphere(color, diameter)

Page 18: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

18

Silhouette

Silhouette of object Oj

� determines how much Oj belongs to it’s cluster

� Between -1 and 1� 1 indicates high

degree of membership

Silhouette width of cluster� Average silhouette of

all objects in cluster

Silhouette coefficient� Average silhouette

widths of k clusters

Silhoutte width Interpretation

0.71 – 1 Strong cluster

0.51 – 0.7 Reasonable cluster

0.26 – 0.5 Weak or artificial cluster

≤ 0.25 No cluster found

Page 19: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

19

SD and NSD approach

� SD – Spatial Dominant

� NSD – Non-Spatial Dominant

� Clustering for spatial attributes / Generalization for non-spatial attributes

� Dominance is decided by what is carried out first (clustering/generalization)

� Second phase works on tuples from previous stage

Page 20: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

20

SD(CLARANS)

Specify learning

request in the

form of SQL

query

Data

SQL

TuplesOi

OjOh

CLARANS

on spatial

attributes

Knat clusters

Collect non-spatial

components

Apply DBLearn

For every cluster

� Finds non-spatial generalizations from spatial clustering

� Value for Knat is determined through heuristics using the silhouette coefficients

� Clustering phase can be treated as finding spatial generalization hierarchy

Page 21: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

21

NSD(CLARANS)

� Finds spatial clusters from non-spatial generalizations

� Clusters may overlap

Page 22: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

22

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Page 23: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

23

Observations

� In all previous methods, quality of mining depends on the SQL query

� CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets.

� Quality of results cannot be guaranteed when N is very large – due to Randomized Search

Page 24: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

24

Observations (2)

� Other clustering algorithms proposed for Spatial Data Mining

� Hierarchical: BIRCH

� Density based: DBSCAN, GDBSCAN, DBRS

� Grid based: STING

Page 25: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

25

Summary

� A seminal paper on use of clustering for spatial data mining

� CLARANS is an effective clustering technique for large datasets

� SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms

Page 26: Efficient and Effective Clustering Methods for Spatial ...cis.csuohio.edu/~sschung/CIS660/ClaransClustering.pdfEfficient and Effective Clustering Methods for Spatial Data Mining (1994)

26

References

� Primary

� Efficient and Effective Clustering Methods for Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han

� Secondary

� CLARANS: A Method for Clustering Objects for Spatial Data Mining - Raymond T. Ng, Jiawei Han

� Clustering for Mining in Large Spatial Databases -Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu

� An Introduction to Spatial Database Systems - Ralf

Hartmut Güting