3.3 hierarchical methods
TRANSCRIPT
ClusteringHierarchical
Methods
1
2
Hierarchical Clustering Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
dc
e
a a b
d ec d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
3
AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
4
Dendrogram
5
DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
6
Distance Measures Minimum distance dmin(Ci, Cj) = min p Ci,p’ Cj |p-p’|
Nearest Neighbor algorithm
Single-linkage algorithm – terminates when distance > threshold
Agglomerative Hierarchical – Minimal Spanning tree
Maximum distance dmax(Ci, Cj) = max p Ci,p’ Cj |p-p’|
Farthest-neighbor clustering algorithm
Complete-linkage - terminates when distance > threshold
Mean distance dmean(Ci, Cj) = |mi-mj|
Average distance davg(Ci, Cj) = (1/ninj) p Ci ,p’ Cj |p-p’|
7
Difficulties faced in Hierarchical Clustering
Selection of merge/split points
Cannot revert operation
Scalability
8
Recent Hierarchical Clustering Methods
Integration of hierarchical and other techniques: BIRCH: uses tree structures and incrementally adjusts the quality of
sub-clusters
CURE: Represents a Cluster by a fixed number of representative
objects
ROCK: clustering categorical data by neighbor and link analysis
CHAMELEON : hierarchical clustering using dynamic modeling
9
BIRCH Balanced Iterative Reducing and Clustering using Hierarchies
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression
of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-
tree
Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the
data record.
10
Clustering Feature Vector in BIRCH
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
11
CF-Tree in BIRCH Clustering feature:
summary of the statistics for a given subcluster registers crucial measurements for computing cluster and utilizes storage
efficiently
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters Branching factor: specify the maximum number of children. threshold: max diameter of sub-clusters stored at the leaf nodes
12
CF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
13
CURE (Clustering Using Representatives) Supports Non-Spherical Clusters Fixed number of representative points are used
Select scattered objects and shrink by moving towards cluster center
Merge clusters with closest pair of representatives For large Databases
Draw a random Sample S Partition Partially cluster each partition Eliminate outliers and slow growing clusters Continue Clustering Label Cluster by Representatives
14
Cure: Shrinking Representative Points
Shrink the multiple representative points towards the gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
x
y
x
y
15
Clustering Categorical Data: ROCK Algorithm ROCK: RObust Clustering using linKs
Major ideas Use links to measure similarity/proximity
Not distance-based
Number of Common Neighbours
Algorithm: sampling-based clustering Draw random sample
Cluster with links
Label data in disk
Sim T TT TT T
( , )1 21 2
1 2
16
CHAMELEON: Hierarchical Clustering Using Dynamic Modeling
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters
17
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
Chameleon Computes similarity based on Relative Interconnectivity
and Relative Closeness
18