clustering methods

156
Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides – not going to go through each in detail.

Upload: base

Post on 13-Feb-2016

88 views

Category:

Documents


0 download

DESCRIPTION

Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides – not going to go through each in detail. Clustering Methods. A Categorization of Major Clustering Methods Partitioning methods Hierarchical methods - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering Methods

Clustering Analysis(of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU)

Notes:1. over 100 slides – not going to go through each in detail.

Page 2: Clustering Methods

Clustering Methods

A Categorization of Major Clustering MethodsPartitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methods

Page 3: Clustering Methods

Clustering Methods based on Partitioning

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion– k-means (MacQueen’67): Each cluster is represented by the

center of the cluster– k-medoids or PAM method (Partition Around Medoids)

(Kaufman & Rousseeuw’87): Each cluster is represented by 1 object in the cluster (~ the middle object or median-like object)

Page 4: Clustering Methods

The K-Means Clustering MethodGiven k, the k-means algorithm is implemented in 4 steps (assumes partitioning

criteria is: maximize intra-cluster similarity and minimize inter-cluster similarity. Of course, a heuristic is used. Method isn’t really an optimization)

1. Partition objects into k nonempty subsets (or pick k initial means).

2. Compute the mean (center) or centroid of each cluster of the current partition (if one started with k means, then this step is done).

centroid ~ point that minimizes the sum of dissimilarities from the mean or the sum of the square errors from the mean.

Assign each object to the cluster with the most similar (closest) center.

3. Go back to Step 2

4. Stop when the new set of means doesn’t change (or some other stopping condition?)

Page 5: Clustering Methods

k-Means

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Step 1

Step 2

Step 3

Step 4

Page 6: Clustering Methods

The K-Means Clustering Method Strength

– Relatively efficient: O(tkn),• n is # objects,• k is # clusters• t is # iterations. Normally, k, t << n.

Weakness– Applicable only when mean is defined (e.g., a vector space)

– Need to specify k, the number of clusters, in advance.

– It is sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value.

Page 7: Clustering Methods

The K-Medoids Clustering Method Find representative objects, called medoids, (must be an actual object in the

cluster, where as the mean seldom is).

PAM (Partitioning Around Medoids, 1987)– starts from an initial set of medoids– iteratively replaces one of the medoids by a non-medoid– if it improves the aggregate similarity measure, retain the swap. Do this over all

medoid-nonmedoid pairs– PAM works for small data sets. Does not scale for large data sets

CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples then apply PAM

CLARANS (Clustering Large Applications based on RANdomSearch) (Ng & Han, 1994): Randomized the sampling

Page 8: Clustering Methods

PAM (Partitioning Around Medoids) (1987)

Use real object to represent the cluster– Select k representative objects arbitrarily

– For each pair of non-selected object h and selected object i, calculate the total swapping cost TCi,h

– For each pair of i and h,

• If TCi,h < 0, i is replaced by h

• Then assign each non-selected object to the most similar representative object

– repeat steps 2-3 until there is no change

Page 9: Clustering Methods

CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990) It draws multiple samples of the data set, applies PAM on

each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness:

– Efficiency depends on the sample size

– A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

Page 10: Clustering Methods

CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)

CLARANS draws sample of neighbors dynamically The clustering process can be presented as searching a graph

where every node is a potential solution, that is, a set of k medoids If the local optimum is found, CLARANS starts with new

randomly selected node in search for a new local optimum (Genetic-Algorithm-like)

Finally the best local optimum is chosen after some stopping condition.

It is more efficient and scalable than both PAM and CLARA

Page 11: Clustering Methods

Distance-based partitioning has drawbacks

Simple and fast O(N) The number of clusters, K, has to be arbitrarily chosen before it is known how many

clusters is correct. Produces round shaped clusters, not arbitrary shapes (Chameleon data set below)

Sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function if the initial partition is not well chosen.

Correct result

K-means result

Page 12: Clustering Methods

Distance-based partitioning (Cont.) If we start with A, B, and C as the

initial centriods around which the three clusters are built, then we end up with the partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses.

Whereas, the correct three-cluster solution is obtained by choosing, for example, A, D, and F as the initial cluster means (rectangular clusters).

Page 13: Clustering Methods

A Vertical Data Approach Partition the data set using rectangle P-trees (a gridding)

These P-trees can be viewed as a grouping (partition) of data

Pruning out outliers by disregard those sparse values

Input: total number of objects (N), percentage of outliers (t) Output: Grid P-trees after prune

(1) Choose the Grid P-tree with smallest root count (Pgc)(2) outliers:=outliers OR Pgc

(3) if (outliers/N<= t) then remove Pgc and repeat (1)(2)

Finding clusters using PAM method; each grid P-tree is an object– Note: when we have a P-tree mask for each cluster, the mean is just

the vector sum of the basic Ptrees ANDed with the cluster Ptree,divided by the rootcount of the cluster Ptree

Page 14: Clustering Methods

Distance Function

Data Matrix n objects × p variables

Dissimilarity Matrix n objects × n objects

X11 … X1f … X1p. . .. . .. . .

Xi1 … Xif … Xip. . .. . .. . .

Xn1 … Xnf … Xnp

0 d(2,1) 0 d(3,1) d(3,2) 0 . . . . . . . . . . . . . . .

d(n,1) d(n,2) … d(n,n-1) 0

2222

211 )()()(j)d(i,

:j and iobject between DistanceEuclidian

jpipjiji XXXXXX

Page 15: Clustering Methods

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Use the Single-Link (distance between two sets is the

minimum pairwise distance) method Merge nodes that are most similarity Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 16: Clustering Methods

DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)

Inverse order of AGNES (intitially all objects are in one cluster; then it is split according to some criteria (e.g., maximize some aggregate measure of pairwise dissimilarity again)

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 17: Clustering Methods

Contrasting Clustering Techniques

Partitioning algorithms: Partition a dataset to k clusters, e.g., k=3

Hierarchical alg: Create hierarchical decomposition of ever-finer partitions.

e.g., top down (divisively).

bottom up (agglomerative)

Page 18: Clustering Methods

Hierarchical Clustering

a b

Step 1

d e

Step 2

c d e

Step 3

a b c d e

Step 4

b

dc

e

a

Step 0 Agglomerative

Page 19: Clustering Methods

Hierarchical Clustering (top down)

In either case, one gets a nice dendogram in which any maximal anti-chain (no 2 nodes linked) is a clustering (partition).

ba

Step 4

de

Step 3

c

d e

Step 2

a b

c d e

Step 1

a b c d e

Step 0Divisive

Page 20: Clustering Methods

Hierarchical Clustering (Cont.)

Recall that any maximal anti-chain (maximal set of nodes in whichno 2 are chained) is a clustering (a dendogram offers many).

Page 21: Clustering Methods

Hierarchical Clustering (Cont.)

But the “horizontal” anti-chains are the clusterings resulting from thetop down (or bottom up) method(s).

Page 22: Clustering Methods

Hierarchical Clustering (Cont.) Most hierarchical clustering algorithms are variants of the single-

link, complete-link or average link.

Of these, single-link and complete link are most popular.

– In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of patterns drawn one from each cluster.

– In the complete-link algorithm, the distance between two clusters is the maximum of all pairwise distances between pairs of patterns drawn one from each cluster.

– In the average-link algorithm, the distance between two clusters is the average of all pairwise distances between pairs of patterns drawn one from each cluster (which is the same as the distance between the means in the vector space case – easier to calculate).

Page 23: Clustering Methods

Distance Between Clusters

Single Link: smallest distance between any pair of points from two clusters

Complete Link: largest distance between any pair of points from two clusters

Page 24: Clustering Methods

Distance between Clusters (Cont.)

Average Link: average distance between points from two clusters

• Centroid: distance between centroids of the two clusters

Page 25: Clustering Methods

Single Link vs. Complete Link (Cont.)

Single link works but not complete link Complete link

works but not single link

Complete link works but not

single link

Page 26: Clustering Methods

Single Link vs. Complete Link (Cont.)

Single link works

1

1 1

1

1

1

1

2 222

Complete link doesn’t

1

1 1

1

1

1

1

2 222

Page 27: Clustering Methods

Single Link vs. Complete Link (Cont.)

Single link doesn’t works

11

2

2

2

11

22

22

Complete link does

11

2

2

2

11

22

22

1-cluster noise 2-cluster

Page 28: Clustering Methods

Hierarchical vs. Partitional

Hierarchical algorithms are more versatile than partitional algorithms. – For example, the single-link clustering algorithm works well on data

sets containing non-isotropic (non-roundish) clusters including well-separated, chain-like, and concentric clusters, whereas a typical partitional algorithm such as the k-means algorithm works well only on data sets having isotropic clusters.

On the other hand, the time and space complexities of the partitional algorithms are typically lower than those of the hierarchical algorithms.

Page 29: Clustering Methods

More on Hierarchical Clustering Methods Major weakness of agglomerative clustering methods

– do not scale well: time complexity of at least O(n2), where n is the number of total objects

– can never undo what was done previously (greedy algorithm)

Integration of hierarchical with distance-based clustering– BIRCH (1996): uses Clustering Feature tree (CF-tree) and

incrementally adjusts the quality of sub-clusters

– CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

– CHAMELEON (1999): hierarchical clustering using dynamic modeling

Page 30: Clustering Methods

Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as density-connected points

Major features:– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

Several interesting studies:– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 31: Clustering Methods

Density-Based Clustering: Background Two parameters:

: Maximum radius of the neighbourhood

– MinPts: Minimum number of points in an -neighbourhood of that point

N(p): {q belongs to D | dist(p,q) } Directly (density) reachable: A point p is directly

density-reachable from a point q wrt. , MinPts if – 1) p belongs to N(q)

– 2) q is a core point:

|N(q)| MinPts p

qMinPts = 5

= 1 cm

Page 32: Clustering Methods

Density-Based Clustering: Background (II)

Density-reachable: – A point p is density-reachable from a point q (p)

wrt , MinPts if there is a chain of points p1, …, pn, p1=q, pn=p such that pi+1 is directly density-reachable from pi

q, q is density-reachable from q.• Density reachability is reflexive and transitive, but

not symmetric, since only core objects can be density reachable to each other.

Density-connected– A point p is density-connected to a q wrt , MinPts

if there is a point o such that both, p and q are density-reachable from o wrt , MinPts.

– Density reachability is not symmetric, Density connectivity inherits the reflexivity and transitivity and provides the symmetry. Thus, density connectivity is an equivalence relation and therefore gives a partition (clustering).

p

qp1

p q

o

Page 33: Clustering Methods

DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as an equivalence class of density-connected points.– Which gives the transitive property for the density connectivity binary relation

and therefore it is an equivalence relation whose components form a partition (clustering) according to the duality.

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

= 1cm

MinPts = 3

Page 34: Clustering Methods

DBSCAN: The Algorithm– Arbitrary select a point p

– Retrieve all points density-reachable from p wrt , MinPts.

– If p is a core point, a cluster is formed (note: it doesn’t matter which of the core points within a cluster you start at since density reachability is symmetric on core points.)

– If p is a border point or an outlier, no points are density-reachable from p and DBSCAN visits the next point of the database. Keep track of such points. If they don’t get “scooped up” by a later core point, then they are outliers.

– Continue the process until all of the points have been processed.

– What about a simpler version of DBSCAN:• Define core points and core neighborhoods the same way.

• Define (undirected graph) edge between two points if they cohabitate a core nbrhd.

• The connectivity component partition is the clustering.

– Other related method? How does vertical technology help here? Gridding?

Page 35: Clustering Methods

OPTICS

Ordering Points To Identify Clustering Structure– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)– http://portal.acm.org/citation.cfm?id=304187

– Addresses the shortcoming of DBSCAN, namely choosing parameters.

– Develops a special order of the database wrt its density-based clustering structure

– This cluster-ordering contains info equivalent to the density-based clusterings corresponding to a broad range of parameter settings

– Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

Page 36: Clustering Methods

OPTICS

Does this order resemble theTotal Variation order?

Page 37: Clustering Methods

Reachability-distance

Cluster-orderof the objects

undefined

Page 38: Clustering Methods

DENCLUE: using density functions

DENsity-based CLUstEring by Hinneburg & Keim (KDD’98) Major features

– Solid mathematical foundation– Good for data sets with large amounts of noise– Allows a compact mathematical description of arbitrarily

shaped clusters in high-dimensional data sets– Significant faster than existing algorithm (faster than

DBSCAN by a factor of up to 45 – claimed by authors ???)– But needs a large number of parameters

Page 39: Clustering Methods

Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure.

Influence function: describes the impact of a data point within its neighborhood.– F(x,y) measures the influence that y has on x.– A very good influence function is the Gaussian, F(x,y) = e –d2(x,y)/2

– Others include functions similar to the squashing functions used in neural networks.– One can think of the influence function as a measure of the contribution to the

density at x made by y. Overall density of the data space can be calculated as the sum of the influence function

of all data points. Clusters can be determined mathematically by identifying density attractors. Density attractors are local maximal of the overall density function.

Denclue: Technical Essence

Page 40: Clustering Methods

DENCLUE(D,σ,ξc,ξ)

1. Grid Data Set (use r = σ, the std. dev.)2. Find (Highly) Populated Cells (use a

threshold=ξc) (shown in blue)3. Identify populated cells (+nonempty cells)

4. Find Density Attractor pts, C*, using hill climbing:

1. Randomly pick a point, pi.2. Compute local density (use r=4σ)3. Pick another point, pi+1, close to pi,

compute local density at pi+1

4. If LocDen(pi) < LocDen(pi+1), climb5. Put all points within distance σ/2 of path,

pi, pi+1, …C* into a “density attractor cluster called C*

5. Connect the density attractor clusters, using a threshold, ξ, on the local densities of the attractors.

A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases with Noise. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1998. & KDD 99 Workshop.

Page 41: Clustering Methods

Comparison: DENCLUE Vs DBSCAN

Page 42: Clustering Methods
Page 43: Clustering Methods

BIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by

Zhang, Ramakrishnan, Livny (SIGMOD’96http://portal.acm.org/citation.cfm?id=235968.233324&dl=GUIDE&dl=ACM&idx=235968&part=periodical&WantType=periodical&title

=ACM%20SIGMOD%20Record&CFID=16013608&CFTOKEN=14462336

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering– Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression

of the data that tries to preserve the inherent clustering structure of the data) – Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single scan and improves quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the data record.

Page 44: Clustering Methods

ABSTRACT

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset.

Prior work does not adequately address the problem of large datasets and minimization of I/O costs.

This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases.

BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints).

BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans.

BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.

We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments.

BIRCH

Page 45: Clustering Methods

Clustering Feature VectorClustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

Branching factor = max # children

Threshold = max diameter of leaf cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Page 46: Clustering Methods

Birch

Branching factor, B=6

Threshold, L = 7CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

Root

Non-leaf node

Leaf node Leaf node

Iteratively put points into closest leaf until threshold is exceed, then split leaf.Inodes summarize their subtrees and Inodes get split when threshold is exceeded.Once in-memory CF tree is built, use another method to cluster leaves together.

Page 47: Clustering Methods

CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998 http://portal.acm.org/citation.cfm?id=276312

– Stops the creation of a cluster hierarchy if a level consists of k clusters

– Uses multiple representative points to evaluate the distance between clusters

– adjusts well to arbitrary shaped clusters (not necessarily distance-based

– avoids single-link effect

Page 48: Clustering Methods

Drawbacks of Distance-Based Method

Drawbacks of square-error based clustering method – Consider only one point as representative of a cluster– Good only for convex shaped, similar size and density,

and if k can be reasonably estimated

Page 49: Clustering Methods

Cure: The Algorithm

– Very much a hybrid method (involves pieces from many others).

– Draw random sample s.

– Partition sample to p partitions with size s/p

– Partially cluster partitions into s/pq clusters

– Eliminate outliers

• By random sampling

• If a cluster grows too slow, eliminate it.

– Cluster partial clusters.

– Label data in disk

Page 50: Clustering Methods

ABSTRACT

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers.

We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size.

CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction.

Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers.

To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters.

Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms.

Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

Cure

Page 51: Clustering Methods

Data Partitioning and Clustering

– s = 50– p = 2– s/p = 25

x x

x

y

y y

y

x

y

x

s/pq = 5

Page 52: Clustering Methods

Cure: Shrinking Representative Points

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Page 53: Clustering Methods

Clustering Categorical Data: ROCKhttp://portal.acm.org/citation.cfm?id=351745

ROCK: Robust Clustering using linKs,by S. Guha, R. Rastogi, K. Shim (ICDE’99). – Agglomerative Hierarchical– Use links to measure similarity/proximity– Not distance based– Computational complexity:

Basic ideas:– Similarity function and neighbors:

Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n nm a( log )2 2

Sim T TT TT T

( , )1 21 2

1 2

Sim T T( , ){ }

{ , , , , }.1 2

31 2 3 4 5

15

0 2

Page 54: Clustering Methods

ROCKAbstractClustering, in data mining, is useful to discover distribution patterns in the underlying

data.Clustering algorithms usually employ a distance metric based (e.g., euclidean)

similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions.

In this paper, we study clustering algorithms for data with boolean and categorical attributes.

We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points.

We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters.

Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge.

In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques.

For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

Page 55: Clustering Methods

Rock: Algorithm

Links: The number of common neighbors for the two pts

Algorithm– Draw random sample– Cluster with links– Label data in disk

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}

{1,2,3} {1,2,4}3

Page 56: Clustering Methods

CHAMELEON

CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99

http://portal.acm.org/citation.cfm?id=621303

Measures the similarity based on a dynamic model– Two clusters are merged only if the interconnectivity and

closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm– 1. Use a graph partitioning algorithm: cluster objects into a large

number of relatively small sub-clusters– 2. Use an agglomerative hierarchical clustering algorithm: find the

genuine clusters by repeatedly combining these sub-clusters

Page 57: Clustering Methods

CHAMELEONABSTRACT

Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model.

By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters.

Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged.

Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters.

Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters.

By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters.

Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters.

Chameleon finds the clusters in the data set by using a two-phase algorithm.During the first phase, Chameleon uses a graph-partitioning algorithm to cluster the

data items into several relatively small subclusters.During the second phase, it uses an algorithm to find the genuine clusters by repeatedly

combining these sub-clusters.

Page 58: Clustering Methods

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Page 59: Clustering Methods

Grid-Based Clustering Method

Using multi-resolution grid data structure Several interesting methods

– STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

– WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)

• A multi-resolution clustering approach using wavelet method

– CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 60: Clustering Methods

Vertical griddingWe can observe that almost all methods discussed so far suffer from the curse of cardinality (for very large cardinality data sets, the algorithms are too slow to finish in the average life time!) and/or the curse of dimensionality (points are all at ~ same distance).

The work-arounds employed to address the curses

sampling (throw out most of the points in a way that what remains is low enough cardinality for the algorithm to finish and in such a way that the remaining “sample” contains all the information of the original data set (Therein is the problem – that is impossible to do in general);

Gridding (agglomerate all points in a grid cell and treat them as one point (smooth the data set to this gridding level). The problem with gridding, often, is that info is lost and the data structure that holds the grid cell information is very complex. With vertical methods (e.g., P-trees), all the info can be retained and griddings can be constructed very efficiently on demand. Horizontal data structures can’t do this.

Subspace restrictions (e.g., Principal Components, Subspace Clustering…)

Gradient based methods (e.g., the gradient tangent vector field of a response surface reduces the calculations to the number of dimensions, not the number of combinations of dimensions.)

j-hi gridding: the j hi order bits identify a grid cells and the rest identify points in a particular cell. Thus, j-hi cells are not necessarily cubical (unless all attribute bit-widths are the same).

j-lo gridding; the j lo order bits identify points in a particular cell and the rest identify a grid cell. Thus, j-lo cells always have a nice uniform shape (cubical).

Page 61: Clustering Methods

1-hi gridding of Vector Space, R(A1, A2, A3) in which all bit-widths are the same = 3 (so each grid cell contains 22 * 22 * 22 = 64 potential points). Grid cells are identified by their Peano id (Pid) internally the point’s cell coordinates are shown - called the grid cell id and cell points are id’ed by coordinates within their cell.

A1

hi-bit

A3

hi-bit

A2

hi-bit

0

1

0 1

0

1

gci=001 gcp=

00,00,00

Pid = 001

gci=001 gcp=

00,00,01

gci=001 gcp=

00,01,00 gci=001

gcp=00,01,01

gci=001 gcp=

01,00,00 gci=001

gcp=01,00,01

gci=001 gcp=

01,01,00 gci=001

gcp=01,01,01

gci=001 gcp=

00,00,10 gci=001 gcp=

00,00,11

gci=001 gcp=

00,01,10 gci=001 gcp=

00,01,11 gci=001

gcp=01,00,10 gci=001

gcp=01,00,11

gci=001 gcp=

01,01,10 gci=001 gcp=

01,01,11

gci=001 gcp=

00,10,00 gci=001 gcp=

00,10,01

gci=001 gcp=

00,11,00 gci=001 gcp=

00,11,01 gci=001

gcp=01,10,00 gci=001

gcp=01,10,01

gci=001 gcp=

01,11,00 gci=001 gcp=

01,11,01

gci=001 gcp=

00,10,10 gci=001 gcp=

00,10,11

gci=001 gcp=

00,11,10 gci=001 gcp=

00,11,11 gci=001

gcp=01,10,10

gci=001 gcp=

01,10,11

gci=001 gcp=

01,11,10 gci=001 gcp=

01,11,11

gci=001 gcp=

10,00,00 gci=001

gcp=10,00,01

gci=001 gcp=

10,01,00 gci=001

gcp=10,01,01

gci=001 gcp=

10,00,10 gci=001

gcp=10,00,11

gci=001 gcp=

10,01,10 gci=001

gcp=10,01,11

gci=001 gcp=

10,10,00 gci=001 gcp=

10,10,01

gci=001 gcp=

10,11,00 gci=001 gcp=

10,11,01

gci=001 gcp=

10,10,10 gci=001

gcp=10,10,11

gci=001 gcp=

10,11,10 gci=001 gcp=

10,11,11

gci=001 gcp=

11,00,00 gci=001 gcp=

11,00,01

gci=001 gcp=

11,01.00 gci=001 gcp=

11,01.01

gci=001 gcp=

11,00,10 gci=001

gcp=11,00,11

gci=001 gcp=

11,01,10 gci=001

gcp=11,01,11

gci=001 gcp=

11,10,00 gci=001 gcp=

11,10,01

gci=001 gcp=

11,11,00 gci=001

gcp=11,11,01

gci=001 gcp=

11,10,10 gci=001 gcp=

11,10,11

gci=001 gcp=

11,11,10 gci=001 gcp=

11,11,11

Page 62: Clustering Methods

A1

A3

A2

00 01 10 11

00

01

10

11

00

0110

11

Pid = 001.001

gci=00,00,11

gcp=0,0,0 gci=00,00,11

gcp=0,0,1

gci=00,00,11

gcp=0,1,0 gci=00,00,11

gcp=0,1,1 gci=00,00,11

gcp=1,0,0 gci=00,00,11

gcp=1,0,1

gci=00,00,11

gcp=1,1,0 gci=00,00,11

gcp=1,1,1

2-hi gridding of Vector Space, R(A1, A2, A3) in which all bitwidths are the same = 3 (so each grid cell contains 21 * 21 * 21 = 8 points).

Page 63: Clustering Methods

1-hi gridding of R(A1, A2, A3), bitwidths of 3,2,3

A1

hi-bit

A3

hi-bit

A2

hi-bit

0

1

0 1

0

1

gci=001 gcp=

00,0,00

Pid = 001

gci=001 gcp=

00,0,01

gci=001 gcp=

00,1,00 gci=001

gcp=00,1,01

gci=001 gcp=

01,0,00 gci=001 gcp=

01,0,01

gci=001 gcp=

01,1,00 gci=001

gcp=01,1,01

gci=001 gcp=

00,0,10 gci=001 gcp=

00,0,11

gci=001 gcp=

00,1,10 gci=001 gcp=

00,1,11 gci=001

gcp=01,0,10

gci=001 gcp=

01,0,11

gci=001 gcp=

01,1,10 gci=001 gcp=

01,1,11

gci=001 gcp=

10,0,00 gci=001 gcp=

10,0,01

gci=001 gcp=

10,1,00 gci=001 gcp=

10,1,01

gci=001 gcp=

10,0,10 gci=001

gcp=10,0,11

gci=001 gcp=

10,1,10 gci=001

gcp=10,1,11

gci=001 gcp=

11,0,00 gci=001 gcp=

11,0,01

gci=001 gcp=

11,1,00 gci=001 gcp=

11,1.01

gci=001 gcp=

11,0,10 gci=001

gcp=11,0,11

gci=001 gcp=

11,1,10 gci=001

gcp=11,1,11

Page 64: Clustering Methods

2-hi gridding) of R(A1, A2, A3), bitwidths of 3,2,3 (each grid cell contains 21 * 20 * 21 = 4 potential pts).

A1

2-hi-bit

A3

2-hi-bit

A2

2-hi-bit

00

01

00 01

1011

gcp=0,,0gcp=

0,,1

gcp=1,,0gcp=

1,,1

10 11

0001

10

11

Pid = 3.1.3

Page 65: Clustering Methods

gcp=1010,1010,1011

gcp=1010,1010,1010

HOBBit disks and rings: (HOBBit = Hi Order Bifurcation Bit) 4-lo grid where A1,A2,A3 have bit-widths, b1+1, b2+1, b3+1, HOBBit grid centers are points of the form (exactly one per grid cell):x=(x1,b1

..x1,41010, x2,b2..x2,41010, x3,b3

..x3,41010) where xi,js range over all binary patterns

HOBBit disk about x, of radius 20 , H(x,20).

Note: we have switched the direction of A3

A1

A3

A2(x1,b1

..x1,41010, x2,b2..x2,41010, x3,b3

..x3,41010)

(x1,b1..x1,41010, x2,b2

..x2,41011, x3,b3..x3,41010)

(x1,b1..x1,41011, x2,b2

..x2,41010, x3,b3..x3,41011)(x1,b1

..x1,41010, x2,b2..x2,41010, x3,b3

..x3,41011)

(x1,b1..x1,41010, x2,b2

..x2,41011, x3,b3..x3,41011)

(x1,b1..x1,41011, x2,b2

..x2,41010, x3,b3..x3,41010)

gcp=1010,1011,1011

gcp=1010,1011,1010 gcp=

1011,1010,1011

gcp=1011,1010,1010

gcp=1011,1011,1011

gcp=1011,1011,1010

(x1,b1..x1,41011, x2,b2

..x2,41011, x3,b3..x3,41011)

(x1,b1..x1,41011, x2,b2

..x2,41011, x3,b3..x3,41010)

Page 66: Clustering Methods

H(x,21) HOBBit disk about a HOBBit grid center pt, x = , of radius 21

A1

A2 A3

(x1,b1..x1,41000, x2,b2

..x2,41000, x3,b3..x3,41000)

(x1,b1..x1,41011, x2,b2

..x2,41011, x3,b3..x3,41011)

(x1,b1..x1,41000, x2,b2

..x2,41011, x3,b3..x3,41000)

(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)

(x1,b1..x1,41011, x2,b2

..x2,41000, x3,b3..x3,41011)

(x1,b1..x1,41011, x2,b2

..x2,41000, x3,b3..x3,41000)

(x1,b1..x1,41011, x2,b2

..x2,41000, x3,b3..x3,41010)

(x1,b1..x1,41011, x2,b2

..x2,41000, x3,b3..x3,41001)

(x1,b1..x1,41011, x2,b2

..x2,41010, x3,b3..x3,41011)

(x1,b1..x1,41000, x2,b2

..x2,41011, x3,b3..x3,41011)

(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)

Page 67: Clustering Methods

The Regions of H(x,21) are as follows:

A1

A2

A3

Page 68: Clustering Methods

These REGIONS are labeled with dimensions in which length is increased(e.g., all three dimensions are increased below).

A1

A2

A3

123-REGION

Page 69: Clustering Methods

A1

A2

A3

13-REGION

Page 70: Clustering Methods

A1

A2

A3

23-REG

Page 71: Clustering Methods

A1

A2

A3

12-REGION

Page 72: Clustering Methods

A1

A2

A3

3-REG

Page 73: Clustering Methods

A1

A2

A3

2-REG

Page 74: Clustering Methods

A1

A2

A3

1-REGION

Page 75: Clustering Methods

A1

A2

A3

H(x,20) =123-REGOf H(x,20)

Page 76: Clustering Methods

1-REGION

A1

A2 A3

H(x,20

2-REG

3-REG

12-REGION

23-REG

13-REGION

123-REGION

Page 77: Clustering Methods

Algorithm (for computing gradients):

1. Select an “outlier threshold”, (pts without neighbors in their ot L-disk are outliers – That is, there is no gradient at these outlier points (instantaneous rate of response change is zero).

2. Create an j-lo grid with j=ot (see previous slides - where HOBBit disks are built out from HOBBit centers x = ( x1,b1

…x1,ot+11010 , … , xn,bn…xn,ot+11010 ), xi,j

s ranging over all

binary patterns).

3. Pick a point, x in R. Build out alternating one-sided-rings centered at x until a neighbor is found or radius ot is exceeded (in which case x is declared an outlier). If a neighbor is found at a raduis, ri < ot 2j, f/ xk(x) is estimated as below:

1. Note: one can use L-HOBBit or L ordinary distance.2. Note: One-sided means that each successive build out increases aternatively only in the positive direction in all

dimensions then only in the negative direction in all dimensions.3. Note: Building out HOBBit disks from a HOBBit center automatically gives “one-sided” rings (a built-out ring

is defined to be the built-out disk minus the previous built-out disk) as shown in the next few slides.

( RootcountD(x,ri) - RootcountD(x,ri)k ) / xk where D(x,ri)k is D(x,ri-1) expandedin all dimensions except k.

Alternatively in 3., actually calculate the mean (or median?) of the new points encountered in D(x,ri) (we have a P-tree mask for the set so this it trivial) and measure the xk-distance.

NOTE: Might want to go 1 more ring out to see if one gets the same or a similar gradient(this seems particularly important when j is odd (since the gradient then points the opposite way.)NOTE: the calculation of xk can be done in various ways. Which is best?

Page 78: Clustering Methods

H(x,21) H(x,21)1

HOBBit center, x=(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)First new point

grad

ient

A1

A2 A3

Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(-1) = -1

Page 79: Clustering Methods

H(x,21)

H(x,21)2

grad

ient

( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1

A1

A2 A3

Page 80: Clustering Methods

H(x,21)

H(x,21)3

grad

ient

( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1

A1

A2 A3

Page 81: Clustering Methods

H(x,21)

HOBBit center, x=(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)First new point

grad

ient

A1

A2 A3

Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(-1) = -1

Est f/ xk(x)= ( RcD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1

Est f/ xk(x)=( RcD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1

Estimated gradient Check other regions too?

Page 82: Clustering Methods

H(x,21) H(x,21)1

HOBBit center, x=(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)

grad

ient

A1

A2 A3

Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-2)/(-1) = 0

First new point

Page 83: Clustering Methods

H(x,21)

H(x,21)2

A1

A2 A3

grad

ient

( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1

Page 84: Clustering Methods

H(x,21)

H(x,21)3

A1

A2 A3

( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1

grad

ient

Page 85: Clustering Methods

H(x,21)A1

A2 A3

grad

ient

( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(-1) = -1

( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(-1) = -1

Estimated gradient Check other regions too?

Page 86: Clustering Methods

H(x,21) H(x,21)1

HOBBit center, x=(x1,b1..x1,41010, x2,b2

..x2,41010, x3,b3..x3,41010)First new point

A1

A2 A3

Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(-1) = -1

Page 87: Clustering Methods

H(x,21)

H(x,21)2

A1

A2 A3

( RootcountD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-2)/(-1) = 0

Page 88: Clustering Methods

H(x,21)

H(x,21)3

A1

A2 A3

( RootcountD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-2)/(-1) = 0

Page 89: Clustering Methods

H(x,21)

A1

A2 A3

Estimated gradient Check other regions too?Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(-1) = -1

Intuitively, this Gradient estimation method seems to work.

Next we consider a potential accuracy improvement in which we take the medoid of all new points as the gradient

(or, more accurately, as the point to which we climb in any response surface hill climbing technique)

Page 90: Clustering Methods

H(x,21)

new points = s

A1

A2 A3

Estimate the gradient arrowhead as being at the medoid of the new point set (or, more correctly, estimate the next hill-climb step).

Note: If the original points are truly part of a strong cluster, the hill climb will be excellent.

new points centroid =

Page 91: Clustering Methods

H(x,21)

new points = s

A1

A2 A3

Estimate the gradient arrowhead as being at the medoid of the new point set (or, more correctly, estimate the next hill-climb step).

Note: If the original points are not truly part of a strong cluster, the weak hill climb will indicate that.

new points centroid =

Page 92: Clustering Methods

H(x,22)H(x,22)1

First new point

Est f/ xk(x) = (RcD(x,ri) - RootcountD(x,ri)1 ) / x1 =(2-1)/(3) = 1/3 Est f/ xk(x)= ( RcD(x,ri) - RootcountD(x,ri)2 ) / x2 = (2-1)/(3) = 1/3

Est f/ xk(x)=( RcD(x,ri) - RootcountD(x,ri)3 ) / x3 = (2-1)/(3) = 1/3

A1

A2 A3

Note that the gradient points in the right direction and is very short (as it should be!)

Page 93: Clustering Methods

H(x,1)1 H(x,0)

H(x,1)2H(x,1)3

H(x,1)12

H(x,1)23

H(x,1)13

H(x,1)123

H(x,2)

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

To evaluate how well the formula estimates thegradient, it is important to consider all cases of thenew point appearing in one of these regions(if 1 point appears, gradient components areadditive, so it suffices to consider 1?

Page 94: Clustering Methods

H(x,1)123

H(x,1)3

H(x,1)2H(x,1)23

H(x,1)1 H(x,0)

H(x,1)12

H(x,1)13

H(x,2)1

H(x,2)2

H(x,2)3

H(x,2)23

H(x,2)13

H(x,2)12

H(x,2)123

To evaluate how well the formula estimates the gradient, it is important to consider all cases of the new point appearing in 1 of these regions (if 1 pt appears, gradient comps add)

Page 95: Clustering Methods

H(x,1)123

H(x,1)3

H(x,1)2H(x,1)23

H(x,1)1 H(x,0)

H(x,1)12

H(x,1)13

H(x,2)1

H(x,2)2

H(x,2)3

H(x,2)23

H(x,2)13

H(x,2)12

H(x,2)123

H(x,3)1

H(x,3)13H(x,3)13

H(x,3)123

H(x,3)13

H(x,3)3

H(x,3)13

Page 96: Clustering Methods

H( x,23 )Notice that the HOBBit center moves more and more toward the true center as the grid size increases.

Page 97: Clustering Methods

Grid based Gradients and Hill Climbing

If we are using gridding to produce the gradient vector field of a response surface, might we always vary xi in the positive direction only? “How can that be done most efficiently?”

1. j-lo gridding, building out HOBBit rings from HOBBit grid centers (see previous slides where this approach was used.) or j-lo gridding. building out HOBBit rings from lo-value grid pts (ending in j 0-bits)

x = ( x1,b1…x1,j+10…0 , … , xn,bn

…xn,j+10…0 )

2. Ordinary j-lo griddng, building out rings from lo-value ids (ending in j zero bits)3. Ordinary j-lo gridding, uilding out Rings from true centers.4. Other? (there are many other possibilities, but we will first explore 2.)

1. 2. Using j-lo gridding with j=3 and lo-value cell identifiers, is shown on the next slide.

3. Of course, we need not use HOBBit build out.

4. With ordinary unit radius build out, the results are more exact, but are the calculations may be more complex???

Page 98: Clustering Methods

H(x,1)1H(x,0)

H(x,1)2

H(x,1)3

H(x,1)12

H(x,1)23

H(x,1)13

H(x,1)123

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,2)1

H(x,1)1H(x,0)

H(x,1)2

H(x,1)3

H(x,1)12

H(x,1)23

H(x,1)13

H(x,1)123

H(x,2)1

H(x,2)2

H(x,2)3

H(x,2)23

H(x,2)13

H(x,2)12

H(x,2)123

HOBBit j-lo rings using lo-value cell ids x=(x1,b1

…x1,j+10…0 ,…, xn,bn…xn,j+10…0)

Page 99: Clustering Methods

Disk(x,0)

Ring(x,1)

Ring(x,2)

= PDisk(x,2)^P’Disk(x,1)

wherePD(x,i) =

Pxb^..^Pxj+1

^P’j^..^P’i+1

= PDisk(x,3)^P’Disk(x,2)

Ordinary j-lo rings using lo-value cell ids x=(x1,b1

…x1,j+10…0 ,…, xn,bn…xn,j+10…0)

Page 100: Clustering Methods

k-Medoids Clustering Review: Find representative objects (medoids) (actual objects in the cluster - mean seldom is). PAM (Partitioning Around Medoids)

– Select k representative objects arbitrarily

– For each pair of non-selected object h and selected object i, calculate the total swapping cost TCi,h

– For each pair of i and h,

• If TCi,h < 0, i is replaced by h

• Then assign each non-selected object to the most similar representative object

– repeat steps 2-3 until there is no change

CLARA (Clustering LARge Apps) draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM. Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS (Clustering Large Apps based on RANdom Search) draws sample of neighbors dynamically. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum (Genetic-Algorithm-like). Finally the best local optimum is chosen after some stopping condition. It is more efficient and scalable than both PAM and CLARA

Page 101: Clustering Methods

A Vertical k-Medoids Clustering Algorithm Following PAM (to illustrate the main killer idea – but it can apply much more widely)

Select k component P-trees

1. The Goal here is to efficiently get one Ptree mask for each component

2. e.g., calculate the smallest j : the j-lo gridding has > k cells.

3. Agglomerate into precisely k components (by ORing Ptrees of cells with closest means/mediods/corners(single_link…)

Where PAM uses: “For each pair of non-selected object h and selected object i, calculate total swapping cost TCi,h. For each pair of i and h, if TCi,h < 0, i is replaced by h, then assign each non-selected object to the most similar object.”, use:

1. Find Medoid of each component, Ci: Calculate TV(Ci,x) x Ci. (create P-tree of all

points with min TV so far. smaller TV, reset this P-tree to it. Ending up with a P-tree, PtMi

of “tieing-Medoids” for Ci. Calc TV(PtMi ,x) x PtMi and pick its medoid (if there are still

multiple, repeat). This is Ci-medoid! Alternatively, just pick 1 pre-Medoid!Note: this avoids expense of pairwise swappings, and avoids subsampling as in CLARA, CLARANS)

2. Put each point with its closest Medoid (building P-tree component masks as you do this).

3. Repeat 1 & 2 until (some stopping condition – such as: no change in Medoid set?)

4. Can we cluster at step 4 without a scan (create component P-trees??)

Page 102: Clustering Methods

A Vertical k-Means Clustering Algorithm

As mentioned on previous slide, a Vertical k-Means algorithm goes similarily:

1. Select k component P-trees (The Goal here is to efficiently get one Ptree mask for each component, e.g., calculate the smallest j : the j-lo gridding has > k cells and agglomerate into precisely k components (by ORing Ptrees of cells with closest means…)

2. Calculate the Means of each component, Ch, by

1. ANDing each basic Ptree with PCi

2. In dimension, Ak , calculate the k-component of the mean as i=bk..0 2i *

rc(PCh ^ Pk,i )

3. Put each pt with closest Mean (building P-tree component masks as you do this).

4. Repeat 2, 3 until (some stopping condition – such as: no change in Mean set?)

5. Can we cluster at step 4 without a scan (create component P-trees??)

Zillions of vertical hybrid clustering algs leap to mind (involving partitioning, hierarchical, density methods)! Pick one!

Page 103: Clustering Methods

Finding Density Attractors for Density Clustering Alg

Finding density attractors (and their attractor sets – which, when agglomerated via a density threshold constitutes the generic density clustering algorithm)

1. Pick a point.2. Build out (HOBBit or Ordinary or?) rings one at a time until

the first k neighbors are found.3. In that ring, computer the medoid points (as in step 1 of the

V-k-Medoids algorithm above)4. if the Medoid increases the density, climb to it and goto 25. Else declare it a density attractor and goto 1

Page 104: Clustering Methods

j-grids - P-tree relationship1-hi-gridding of R(A1, A2, A3).Bitwidths: 3,2,3 (dim cardinalities: 8,4,8)Using tree node-like identifierscell (tree) id (ci) of the form, c0c1…cd

point (coord) id (pi) of the form, p1,p2,p3

00.0.0000.0.01

00.1.0000.1.01

01.0.0001.0.01

01.1.0101.1.01

00.0.10

00.0.11

00.1.1000.1.11

01.0.10

01.0.11

01.1.10

01.1.11 10.0.0010.0.01

10.1.0010.1.01

10.0.10 10.0.11

10.1.1010.1.11 11.0.00

11.0.01

11.1.0011.1.01

11.0.1011.0.11

11.1.1011.1.11A1

A2

A3

ci=000

ci=001

ci=010

ci=011

ci=100

ci=101

ci=110

ci=111

root

000 001 010 011 101100 110 111

00.0.00

00.0.01

00.1.00

00.1.0101.0.00

01.0.01

01.1.00

01.1.01

00.0.10

00.0.11

00.1.10

00.1.11

01.0.10

01.0.11

01.1.10

01.1.11

10.0.00

10.0.01

10.1.00

10.1.01

11.0.00

11.0.01

11.1.00

11.1.01

10.0.10

10.0.11

10.1.10

10.1.11

11.0.10

11.0.11

11.1.10

11.1.11

A 1-hi-grid yields a P-tree with level-0 (cell level) fanout of 23 and level-1 (point level) fanout of 25. If leaves are segment labelled (not coords):

000.00

000.01

010.00

010.01000.10

000.11

010.10

010.11

. . .

Page 105: Clustering Methods

j-grids - P-tree relationship (Cont.)One can view a standard P-tree as nested 1-hi-griddings, with compressed out constant subtrees.R(A1, A2, A3) with bitwidths: 3,2,3

0001

1011

A1

A2

A3

000

001

010

011

100

101

110

111

root

000 001 010 011 101100 110 111

000

000 001 010 011 101100 110 111

001

010

011100

101

110

111

00 01 11 10

Page 106: Clustering Methods

Gridding categorical data?

The following bioinformatics (yeast genome) data used was extracted mostly from the MIPS database (Munich Information center for Protein Sequences)

Left column shows features– treat these with hi-order bit (1 iff gene participates)– There may be more levels of hierarchy (e.g., function:

some genes actually cause the function when they express in sufficient quantities, while others are transcription factors for those primary genes. Primary genes have hi bit on, tf’s have second bit on…)

Right column shows # distinct feature values– Bitmap these

Feature Total Valuespathway 80

EC 622complexes 316function 259

localization 43protein class 191

phenotype 181interactions 6347

Page 107: Clustering Methods

Data Representation

gene-by-feature table.

For a categorical feature, we consider each category as a separate attribute or column by bit-mapping it.

The resulting table has a total of – 8039 distinct feature bit vectors (corresponding to “items” in

MBR) for– 6374 yeast genes (corresponding to transactions in MBR)

Page 108: Clustering Methods

STING: A Statistical Information Grid Approach

Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to different

levels of resolution

Page 109: Clustering Methods

STING: A Statistical Information Grid Approach (2)

– Each cell at a high level is partitioned into a number of smaller cells in the next lower level

– Statistical info of each cell is calculated and stored beforehand and is used to answer queries

– Parameters of higher level cells can be easily calculated from parameters of lower level cell

• count, mean, s, min, max • type of distribution—normal, uniform, etc.

– Use a top-down approach to answer spatial data queries– Start from a pre-selected layer—typically with a small number of

cells– For each cell in the current level compute the confidence interval

Page 110: Clustering Methods

STING: A Statistical Information Grid Approach (3)

– Remove the irrelevant cells from further consideration– When finish examining the current layer, proceed to the next lower

level – Repeat this process until the bottom layer is reached– Advantages:

• Query-independent, easy to parallelize, incremental update• O(K), where K is the number of grid cells at the lowest level

– Disadvantages:• All the cluster boundaries are either horizontal or vertical, and

no diagonal boundary is detected

Page 111: Clustering Methods

WaveCluster (1998)

Sheikholeslami, Chatterjee, and Zhang (VLDB’98) A multi-resolution clustering approach which applies

wavelet transform to the feature space– A wavelet transform is a signal processing technique that

decomposes a signal into different frequency sub-band. Both grid-based and density-based Input parameters:

– # of grid cells for each dimension– the wavelet, and the # of applications of wavelet transform.

Page 112: Clustering Methods

WaveCluster (1998)

How to apply wavelet transform to find clusters– Summaries the data by imposing a multidimensional

grid structure onto data space– These multidimensional spatial data objects are

represented in a n-dimensional feature space– Apply wavelet transform on feature space to find the

dense regions in the feature space– Apply wavelet transform multiple times which result in

clusters at different scales from fine to coarse

Page 113: Clustering Methods

What Is Wavelet (2)?

Page 114: Clustering Methods

Quantization

Page 115: Clustering Methods

Transformation

Page 116: Clustering Methods

WaveCluster (1998) Why is wavelet transformation useful for clustering

– Unsupervised clustering It uses hat-shape filters to emphasize region where points cluster, but

simultaneously to suppress weaker information in their boundary – Effective removal of outliers– Multi-resolution– Cost efficiency

Major features:– Complexity O(N)– Detect arbitrary shaped clusters at different scales– Not sensitive to noise, not sensitive to input order– Only applicable to low dimensional data

Page 117: Clustering Methods

CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

http://portal.acm.org/citation.cfm?id=276314

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal length interval– It partitions an m-dimensional data space into non-overlapping rectangular

units– A unit is dense if the fraction of total data points contained in the unit

exceeds the input model parameter– A cluster is a maximal set of connected dense units within a subspace

Page 118: Clustering Methods

CLIQUE: The Major Steps

Partition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters:– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters– Determine maximal regions that cover a cluster of connected

dense units for each cluster– Determination of minimal cover for each cluster

Page 119: Clustering Methods

ABSTRACTData mining applications place special requirements on clustering algorithms including:

the ability to find clusters embedded in subspaces of high dimensional data,

scalability, end-user comprehensibility of the results,

non-presumption of any canonical data distribution, and

insensitivity to the order of input records.

We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality.

It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension.

It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution.

Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

CLIQUE

Page 120: Clustering Methods

Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

Page 121: Clustering Methods

Strength and Weakness of CLIQUE

Strength – It automatically finds subspaces of the highest dimensionality such

that high density clusters exist in those subspaces– It is insensitive to the order of records in input and does not

presume some canonical data distribution– It scales linearly with the size of input and has good scalability as

the number of dimensions in the data increases Weakness

– The accuracy of the clustering result may be degraded at the expense of simplicity of the method

Page 122: Clustering Methods

Model-Based Clustering Methods

Attempt to optimize the fit between the data and some mathematical model

Statistical and AI approach– Conceptual clustering

• A form of clustering in machine learning• Produces a classification scheme for a set of unlabeled objects• Finds characteristic description for each concept (class)

– COBWEB (Fisher’87) • A popular a simple method of incremental conceptual learning• Creates a hierarchical clustering in the form of a classification tree• Each node refers to a concept and contains a probabilistic description of that

concept

Page 123: Clustering Methods

COBWEB Clustering Method

A classification tree

Page 124: Clustering Methods

More on Statistical-Based Clustering

Limitations of COBWEB– The assumption that the attributes are independent of each other is

often too strong because correlation may exist– Not suitable for clustering large database data – skewed tree and

expensive probability distributions CLASSIT

– an extension of COBWEB for incremental clustering of continuous data

– suffers similar problems as COBWEB AutoClass (Cheeseman and Stutz, 1996)

– Uses Bayesian statistical analysis to estimate the number of clusters

– Popular in industry

Page 125: Clustering Methods

Other Model-Based Clustering Methods

Neural network approaches– Represent each cluster as an exemplar, acting as a

“prototype” of the cluster– New objects are distributed to the cluster whose exemplar

is the most similar according to some dostance measure Competitive learning

– Involves a hierarchical architecture of several units (neurons)

– Neurons compete in a “winner-takes-all” fashion for the object currently being presented

Page 126: Clustering Methods

Model-Based Clustering Methods

Page 127: Clustering Methods

Self-organizing feature maps (SOMs)

Clustering is also performed by having several units competing for the current object

The unit whose weight vector is closest to the current object wins

The winner and its neighbors learn by having their weights adjusted

SOMs are believed to resemble processing that can occur in the brain

Useful for visualizing high-dimensional data in 2- or 3-D space

Page 128: Clustering Methods

Hybrid Clustering

One of the common approaches is to combine k-means method and hierarchical clustering

First partition the dataset into k small clusters, and then merge the clusters based on similarity using hierarchical method.

Hybrid clustering combines partitioning clustering and hierarchical clustering approach

K = 7

Page 129: Clustering Methods

Problems of Existing Hybrid Clustering

Predefine the number of preliminary clusters, K.Unable to handle noisy data

0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 7000

20

40

60

80

100

120

0 20 40 60 80 100 120

Page 130: Clustering Methods

Similarity Measures

Similarity is fundamental to the definition of a cluster.

1. Minkowski Metric 2. Context Metrics:

1. Mutual neighbor distance2. Conceptual similarity measure

Page 131: Clustering Methods

Minkowski Metric Minkowski Metric = ||xi – xj||p It works well when a data set has “compact” or

“isolated” clusters The drawback is the tendency of the largest-scaled

feature to dominate the others. Solutions to this problem include normalization of the

continuous features (to a common range or variance) or other weighting schemes.

Page 132: Clustering Methods

Mutual neighbor distance (MND) MND comes from real life observations and is based on a non-

symmetric similarity (e.g., friendship level). Two persons A and B group together as close friends if they

mutually feel that the other is his closest friend. If A feels that B is not such a close friend to him, then even

though B may feel that A is his closest friend, the bond of friendship between them is comparatively weak.

The strength of the bond of friendship between two persons is a function of mutual feeling rather than one-way feeling.

Page 133: Clustering Methods

Mutual neighbor distance (MND)

MND(xi, xj) = NN(xi, xj) + NN(xj, xi)

where NN(xi, xj) is the neighbor-ness value (or similarity) of xj to xi

The MND is not a metric (it does not satisfy the triangle inequality). It satisfies the first two conditions of a metric: 1. MND(xi, xj) >= 0, and MND(xi, xj) = 0 xi = xj positive definite2. MND(xi, xj) = MND(xj, xi) symmetric

– Note the symmetry was manufactured by defining MND as the sum of the two non-symmetric NN measures.

In spite of this, MND has been successfully applied in several clustering applications.

This observation supports the viewpoint that the dissimilarity does not need to be a metric.

Page 134: Clustering Methods

MND Distance (Cont.)

Figure 4: – NN(A, B) = 1, NN(B, A) = 1, MND(A,B) = 2– NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3

Figure 5:– NN(A, B) = 1, NN(B, A) = 4, MND(A,B) = 5– NN(B, C) = 1, NN(C, B) = 2, MND(B, C) = 3

Page 135: Clustering Methods

Conceptual Similarity Measure In the case of conceptual clustering, the similarity

between xi and xj is defined as

where is the context (the set of surrounding points, and is a set of pre-defined concepts.

The conceptual similarity measure is the most general similarity measure.

s( xi, xj ) = f( xi, xj, σ, ε )

Page 136: Clustering Methods

Conceptual Similarity Measure (Cont.)

The Euclidean distance between points A and B is less than that between B and C.

However, B and C can be viewed as “more similar” than A and B because B and C belong to the same concept (ellipse) and A belongs to a different concept (rectangle).

Page 137: Clustering Methods

Problems and ChallengesSome progress has been made in scalable clustering methods

– Partitioning: k-means, k-medoids, CLARANS– Hierarchical: BIRCH, CURE– Density-based: DBSCAN, CLIQUE, OPTICS– Grid-based: STING, WaveCluster– Model-based: Autoclass, Denclue, Cobweb

Current clustering techniques do not address all the requirements adequately Constraint-based clustering analysis: Constraints exist in data space

(bridges and highways) or in user queries Still the curse of cardinality stands as

THE difficult problem!

Page 138: Clustering Methods

What Is Outlier Discovery? What are outliers?

– The set of objects are considerably dissimilar from the remainder of the data

– Example: Sports: Michael Jordon, Wayne Gretzky, ... Problem

– Find top n outlier points Applications:

– Credit card fraud detection– Telecom fraud detection– Customer segmentation– Medical analysis

Page 139: Clustering Methods

Outlier Discovery: Statistical Approaches

Assume a model underlying distribution that generates data set (e.g. normal distribution)

Use discordancy tests depending on – data distribution– distribution parameter (e.g., mean, variance)– number of expected outliers

Drawbacks– most tests are for single attribute– In many cases, data distribution may not be known

Page 140: Clustering Methods

Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods– We need multi-dimensional analysis without knowing data distribution.

Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least fraction p (usually ~99/100 or so) of the objects in T lie at a distance greater than D from O (basically: Too many of the points lie far from O)

A dual formulation is, basically: Too few of the points lie close to O, so O is a DB(p, D)-outlier if at least fraction p (usually ~1/100 or so) of the objects in T lie at a distance less than D from O

»

Algorithms for mining distance-based outliers – Index-based algorithm– Nested-loop algorithm – Cell-based algorithm

Page 141: Clustering Methods

Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics of objects in a group

Objects that “deviate” from this description are considered outliers

sequential exception technique – simulates the way in which humans can distinguish unusual objects from

among a series of supposedly like objects

OLAP data cube technique– uses data cubes to identify regions of anomalies in large

multidimensional data

Page 142: Clustering Methods

Outlier Detection

Mining rare events, deviant objects and exceptions Find anomaly interests in many domains:

• Criminal activities in electronic commerce• Intrusion in network• Pest infestation in agriculture• Detecting bugs in software, etc

An example in business• British Telecom broke a multimillion dollar fraud

Huge amount of Data Unexpected Knowledge/Anomaly Interest

Outlier Detection

Page 143: Clustering Methods

Outlier Analysis (Cont.)

Papadimitriou’s LOCI

(ICDE 2002)Breunig’s LOF

(SIGMOD 2000)Depth

(Preparata 1988)Distribution

(Barnett 1994)

Distance-based Density-basedStatistic-based

Outlier Analysis

Angiulli(TKDE2005)

Ramaswamy(MOD 2000)

Knorr and Ng(VLDB 1998)

Cluster-based

Jiang’s MST(PRL)

He’s CBLOF(PRL)

Page 144: Clustering Methods

Distance-based Outlier Detection

Distance-based outlier definition, DB (p, D), by Knorr and Ng (1997)– p: a number percentage; D: distance threshold– Unify distribution-based outlier definitions– Most well known definition

Outlier detection methods– Nested loop method (Knorr and Ng, 1998)– Cell structure-based method (Knorr and Ng, 1998)– Partition-based method (Ramaswamy, 2000)– Work well for small datasets– Not efficient for large datasets

Our proposed work followed aims to improve the efficiency of the DB outlier detection process

Page 145: Clustering Methods

Knorr and Ng’s Definition– x is an outlier if {y X |d (y-x) ≥ D & |y| ≥ p*|X|}– d(y,x): distance between y and x

Outlier Definition

Equivalent Description– x is an outlier if {y X |d (y-x) < D & |y|<p*|X|}

·Dx

|y| ≥ p*|X|

·Dx

|y| < (1-p)*|X|

Figure 1. Knorr’s definition Figure 2. Equivalent definition

complement

Page 146: Clustering Methods

Definition 1: Disk neighborhood– Define the Disk-neighborhood of x with radius r as a set

satisfying DiskNbr (x, r) = {y X |d(y-x) r}

where, y is any point in DiskNbr(x,r) and d is the distance between y and x.

– Define points in DiskNbr (x, r) as r-neighbors of x.

Definition 2: DiskNbr-based outlier– A point x is considered as a DB (p, D) outlier iff |

DiskNbr(x,D)|≤ (1-p)*|X|.where |DiskNbr (x,D)| is the number of D-neighbors of x,|X| is the total size of dataset X

– (1-p)*|X| is a number threshold given user defined p.

Our Definition

Page 147: Clustering Methods

Proposition 1

If |DiskNbr (x, D/2)| ≥ (1-p)*|X|, then all the D/2-neighbors of x are not outliers. Proof: For q DiskNbr(x,D/2)

– DiskNbr (q, D) DiskNbr (x, D/2)– |DiskNbr(q, D)| ≥ |DiskNbr (x, D/2)| ≥ (1-p)*|X|

q is not an outlier. Mark D/2-neighbors as non-outliers A pruning rule

D/2

D

DiskNbr (x, D/2)

qx

D

Figure 3. D/2-neighbors of x marked as non-outliers

Page 148: Clustering Methods

Proposition 2

If |DiskNbr (x, 2D)| < (1-p)* |X|, then all D-neighbors of x are outliers.Proof: For q DiskNbr (x, D)

– DiskNbr (q, D) DiskNbr (x, 2D) – |DiskNbr (q, D)| ≤ |DiskNbr(x, 2D)| < (1-p)* |X|

q is an outlier All points in DiskNbr(x,D) are outliers Identify outliers by-neighborhood, faster than by point

D

2D

DiskNbr(x,D)

xq

D

Figure 4. All D-Neighbors of x are outliers

Page 149: Clustering Methods

The DBODLP Algorithm

Step 1: Select a point x arbitrarily from X; Step 2: Search D-neighbors of x and calculate |

DiskNbr(x,D)|:

– |DiskNbr(x,D)| < (1-p)*|X| • r <- 2D• Compute |DiskNbr (x, 2*D)| • IF |DiskNbr(x,2*D)| < (1-p)*|X| insert D-neighbors into the outlier set.

ELSE only x is an outlier.

– |DiskNbr(x,D)| ≥ (1-p)*|X|• r <- D/2; • Calculate |DiskNbr(x,D/2)|.• IF |DiskNbr(x,D/2)| ≥ (1-p) * |X|

mark all D/2-neighbors as non-outliers.ELSE only x is non outlier.

Step 3: The process is conducted iteratively until all points in dataset X are examined.

Figure 5. An example

·D

·D

·D/2

x

x

·2D

Page 150: Clustering Methods

The DBODLP Algorithm

The algorithm is implemented based on a vertical storage and index model

Vertical Data Structures

Page 151: Clustering Methods

Vertical DBODLP

DBODLP is implemented based on P-Tree structure

Predicate Range Tree is used to calculate the number of disk nearest neighbors – PDiskNbr(x,D)= Py>x-D AND Py≤x+D

– |DiskNbr(x,D)|=COUNT (PDiskNbr(x,D))

The experiments are done in DataMIME

Page 152: Clustering Methods

Architecture for the DataMIME™ System

(DataMIMEtm = P-tree based Data base and data mining system)

DII (Data Integration Interface)

Data Integration Language DIL

YOUR DATA

Data Repositorylossless, compressed, distributed, vertically-

structured P-tree database

DMI (Data Mining Interface)

YOUR DATA MINING

Ptree (Predicates) Query Language

Query and Data mining

P-Tree Feeders

Meta Data File

Page 153: Clustering Methods

Preliminary experimental analysis

0

500

1000

1500

2000

Run Time (in second)

Data Size

Comparison of NL,PODM, PODMP

NL 0.06 0.8 20.03 336.5 1657.9

PODM 0.12 1.15 7.32 65.4 803.2

PODMP 0.11 1.05 5.3 30.2 267.52

256 1024 4096 16384 65536

n

Figure 6. Run Time Comparison

1. Our method has almost an order of magnitude of improvement over the nested loop method in terms of run time

2. All methods produce same outlier set

3. Our method scales better than others

scalability of NL, PODM and PODMP

-200

0

200

400

600

800

1000

1200

1400

1600

1800

256 1024 4096 16384 65536

data size

run

time NL

PODM

PODMP

Figure 7. Scalability Comparisons

Page 154: Clustering Methods

Summary Cluster analysis groups objects based on their similarity and

has wide applications Measure of similarity can be computed for various types of

data Clustering algorithms can be categorized into partitioning

methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis, such as constraint-based clustering

Page 155: Clustering Methods

References (1)

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98

M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the

clustering structure, SIGMOD’99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in

large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing

techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-

172, 1987. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on

dynamic systems. In Proc. VLDB’98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.

SIGMOD'98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

Page 156: Clustering Methods

References (2) L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.

John Wiley & Sons, 1990. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.

John Wiley and Sons, 1988. P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.

Proc. 1996 Int. Conf. on Pattern Recognition, 101-105. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering

approach for very large spatial databases. VLDB’98. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data

Mining, VLDB’97. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very

large databases. SIGMOD'96.