3.3 hierarchical methods

ClusteringHierarchical

Methods

1

2

Hierarchical Clustering Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

3

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

4

Dendrogram

5

DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

6

Distance Measures Minimum distance dmin(Ci, Cj) = min p Ci,p’ Cj |p-p’|

Nearest Neighbor algorithm

Single-linkage algorithm – terminates when distance > threshold

Agglomerative Hierarchical – Minimal Spanning tree

Maximum distance dmax(Ci, Cj) = max p Ci,p’ Cj |p-p’|

Farthest-neighbor clustering algorithm

Complete-linkage - terminates when distance > threshold

Mean distance dmean(Ci, Cj) = |mi-mj|

Average distance davg(Ci, Cj) = (1/ninj) p Ci ,p’ Cj |p-p’|

7

Difficulties faced in Hierarchical Clustering

Selection of merge/split points

Cannot revert operation

Scalability

8

Recent Hierarchical Clustering Methods

Integration of hierarchical and other techniques: BIRCH: uses tree structures and incrementally adjusts the quality of

sub-clusters

CURE: Represents a Cluster by a fixed number of representative

objects

ROCK: clustering categorical data by neighbor and link analysis

CHAMELEON : hierarchical clustering using dynamic modeling

9

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data

structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression

of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-

tree

Scales linearly: finds a good clustering with a single scan and improves the

quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the

data record.

10

Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

11

CF-Tree in BIRCH Clustering feature:

summary of the statistics for a given subcluster registers crucial measurements for computing cluster and utilizes storage

efficiently

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters Branching factor: specify the maximum number of children. threshold: max diameter of sub-clusters stored at the leaf nodes

12

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

13

CURE (Clustering Using Representatives) Supports Non-Spherical Clusters Fixed number of representative points are used

Select scattered objects and shrink by moving towards cluster center

Merge clusters with closest pair of representatives For large Databases

Draw a random Sample S Partition Partially cluster each partition Eliminate outliers and slow growing clusters Continue Clustering Label Cluster by Representatives

14

Cure: Shrinking Representative Points

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

15

Clustering Categorical Data: ROCK Algorithm ROCK: RObust Clustering using linKs

Major ideas Use links to measure similarity/proximity

Not distance-based

Number of Common Neighbours

Algorithm: sampling-based clustering Draw random sample

Cluster with links

Label data in disk

Sim T TT TT T

( , )1 21 2

1 2

16

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling

Measures the similarity based on a dynamic model

Two clusters are merged only if the interconnectivity and closeness

(proximity) between two clusters are high

A two-phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large number of

relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine

clusters by repeatedly combining these sub-clusters

17

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Chameleon Computes similarity based on Relative Interconnectivity

and Relative Closeness

18

3.3 hierarchical methods

Education