3.3 hierarchical methods

18
Clustering Hierarchical Methods 1

Upload: krishver2

Post on 13-Feb-2017

577 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 3.3 hierarchical methods

ClusteringHierarchical

Methods

1

Page 2: 3.3 hierarchical methods

2

Hierarchical Clustering Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 3: 3.3 hierarchical methods

3

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 4: 3.3 hierarchical methods

4

Dendrogram

Page 5: 3.3 hierarchical methods

5

DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 6: 3.3 hierarchical methods

6

Distance Measures Minimum distance dmin(Ci, Cj) = min p Ci,p’ Cj |p-p’|

Nearest Neighbor algorithm

Single-linkage algorithm – terminates when distance > threshold

Agglomerative Hierarchical – Minimal Spanning tree

Maximum distance dmax(Ci, Cj) = max p Ci,p’ Cj |p-p’|

Farthest-neighbor clustering algorithm

Complete-linkage - terminates when distance > threshold

Mean distance dmean(Ci, Cj) = |mi-mj|

Average distance davg(Ci, Cj) = (1/ninj) p Ci ,p’ Cj |p-p’|

Page 7: 3.3 hierarchical methods

7

Difficulties faced in Hierarchical Clustering

Selection of merge/split points

Cannot revert operation

Scalability

Page 8: 3.3 hierarchical methods

8

Recent Hierarchical Clustering Methods

Integration of hierarchical and other techniques: BIRCH: uses tree structures and incrementally adjusts the quality of

sub-clusters

CURE: Represents a Cluster by a fixed number of representative

objects

ROCK: clustering categorical data by neighbor and link analysis

CHAMELEON : hierarchical clustering using dynamic modeling

Page 9: 3.3 hierarchical methods

9

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies

Incrementally construct a CF (Clustering Feature) tree, a hierarchical data

structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression

of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-

tree

Scales linearly: finds a good clustering with a single scan and improves the

quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the

data record.

Page 10: 3.3 hierarchical methods

10

Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Page 11: 3.3 hierarchical methods

11

CF-Tree in BIRCH Clustering feature:

summary of the statistics for a given subcluster registers crucial measurements for computing cluster and utilizes storage

efficiently

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters Branching factor: specify the maximum number of children. threshold: max diameter of sub-clusters stored at the leaf nodes

Page 12: 3.3 hierarchical methods

12

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

Page 13: 3.3 hierarchical methods

13

CURE (Clustering Using Representatives) Supports Non-Spherical Clusters Fixed number of representative points are used

Select scattered objects and shrink by moving towards cluster center

Merge clusters with closest pair of representatives For large Databases

Draw a random Sample S Partition Partially cluster each partition Eliminate outliers and slow growing clusters Continue Clustering Label Cluster by Representatives

Page 14: 3.3 hierarchical methods

14

Cure: Shrinking Representative Points

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Page 15: 3.3 hierarchical methods

15

Clustering Categorical Data: ROCK Algorithm ROCK: RObust Clustering using linKs

Major ideas Use links to measure similarity/proximity

Not distance-based

Number of Common Neighbours

Algorithm: sampling-based clustering Draw random sample

Cluster with links

Label data in disk

Sim T TT TT T

( , )1 21 2

1 2

Page 16: 3.3 hierarchical methods

16

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling

Measures the similarity based on a dynamic model

Two clusters are merged only if the interconnectivity and closeness

(proximity) between two clusters are high

A two-phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large number of

relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine

clusters by repeatedly combining these sub-clusters

Page 17: 3.3 hierarchical methods

17

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Page 18: 3.3 hierarchical methods

Chameleon Computes similarity based on Relative Interconnectivity

and Relative Closeness

18