cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. postech ie...

Cluster analysis

포항공과대학교산업공학과

확률통계연구실이 재 현

POSTECH IE PASTA CLUSTER ANALYSIS

Definition

Cluster analysis is a technigue used for combining observations into groups or clusters such that

Each group or cluster is homogeneous or compact with respect to certain characteristics

Each group should be different from other groups with respect to the same characteristics

Example A marketing manager is interested in identifying similar cities that can be used for t

est marketing The campaign manager for a political candidate is interested in identifying groups o

f votes who have similar views on important issues


Objective of clustering analysis

The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables

overview of cluster analysis

step 1 ; n objects measured on p variables

step 2 ; Transform to n * n similarity(distance)

matrix

step 3 ; Cluster formation

(Hierarchical or nonhierarchical clusters)

step 4 ; Cluster profile


Key problem

Measure of similarity Fundamental to the use of any clustering technique is the computation of a measur

e of similarity to distance between the respective objects. Distance-type measures – Euclidean distance for standardized data, Mahalanobis

distance Matching-type measures – Association coefficients, correlation coefficients

A procedure for forming the clusters Hierarchical clustering – Centroid method, Single-linkage method, Complete-link

age method, Average-linkage method, Ward’s method. Nonhierarchical clustering – k-means clustering


Similarity Measure – Distance type

Minkowski metric

If r = 2, then Euclidean distance if r = 1, then absolute distance

consider below example

rp

k

r

jkikij XXd

/1

1

Data

Subject Income Education

S1 5 5

S2 6 6

S3 15 14

S4 16 15

S5 25 20

S6 30 19

Similarity matrix

S1 S2 S3 S4 S5 S6

S1 0.00 2.00 181.00221.00625.00821.00

S2 2.00 0.00 145.00181.00557.00745.00

S3 181.00145.00 0.00 2.00 136.00250.00

S4 211.00181.00 2.00 0.00 106.00212.00

S5 625.00557.00136.00106.00 0.00 26.00

S6 821.00745.00250.00212.00 26.00 0.00

PersonWeight in

PoundsHeight in

Feet

A 160 5.5

B 163 6.2

C 163 6.0

Height in Feet Height in inches

dAB = 3.08 dAB = 8.92

dAC = 5.02 dAC = 7.81

dBC = 2.01 dBC = 3.12


Similarity Measure – Distance type

Euclidean distance for standardized data To make scale invariant data The squared euclidean distance is weighted by

Mahalanobis distance

x is p*1 vector, S is a p*p covariance matrix

It is designed to take into account the correlation among the variables and is also sc

ale invariant.

si

2/1

Similarity matrix

S1 S2 S3 S4 S5 S6

S1 0.00 0.35 3.00 3.68 9.55 11.09

S2 0.35 0.00 2.38 3.00 8.45 9.94

S3 3.00 2.38 0.00 0.035 1.89 2.87

S4 3.68 3.00 0.035 0.00 1.43 2.36

S5 9.55 8.45 1.89 1.43 0.00 0.28

S6 11.09 9.94 2.87 2.36 0.28 0.00

)()'( 1jijiij xxSxxMD


Similarity Measure – Matching type

Association coefficients This type of measure is used to represent similarity for binary variables

Similarity coefficients

AttributePerson 1 2 3 4 5 6

A 0 1 1 0 1 1B 1 0 1 0 0 1

Person A

Person B

+ -+ 2 1 3- 2 1 3

4 2 6


Similarity Measure – Matching type

Correlation coefficient Pearson product moment correlation coefficient is used for measure of similarity.

dAB = 1, dAC = 0.82

Person X1 X2 X3 X4

A 1 3 2 2

B 4 10 7 7

C 1 2 2 2


Hierarchical clustering

Centroid method Each group is replaced by Average Subject which is the centroid of that group

Data for five clusters

ClusterCluster

membersIncome Education

1 S1 & S2 5.5 5.52 S3 15.0 14.03 S4 16.0 15.04 S5 25.0 20.05 S6 30.0 19.0

Similarity matrix

S1 & S2 S3 S4 S5 S6

S1 & S2 0.00 162.50 200.50 590.50 782.50

S3 162.50 0.00 2.00 135.96 250.00

S4 200.20 2.00 0.00 106.00 212.00

S5 590.50 135.96 106.00 0.00 26.00

S6 782.50 250.00 212.00 26.00 0.00

Data for four clusters

ClusterCluster


1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 16.0 15.04 S6 25.0 20.0

Similarity matrix

S1 & S2 S3 & S4 S5 S6

S1 & S2 0.00 181.00 590.50 782.50

S3 & S4 181.00 0.00 120.50 230.50

S5 590.50 120.50 0.00 26.00

S6 782.50 230.50 26.00 0.00



Data for three clusters

ClusterCluster


1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 & S6 27.5 19.5

Similarity matrix

S1 & S2 S3 & S4 S5 & S6

S1 & S2 0.00 181.00 680.00

S3 & S4 181.00 0.00 169.00

S5 & S6 680.00 169.00 0.00



Single-Linkage method The distance between two clusters is represented by the minimum of the distance

between all possible pairs of subjects in the two clusters

= 181 and = 145

= 221 and = 181

Similarity matrix

S1 & S2 S3 S4 S5 S6

S1 & S2 0.00 145.00 181.00 557.00 745.00

S3 145.00 0.00 2.00 136.00 250.00

S4 181.00 2.00 0.00 106.00 212.00

S5 557.00 136.00 106.00 0.00 26.00

S6 745.00 250.00 212.00 26.00 0.00

213D 2

23D

214D 2

24D



Complete-Linkage method The distance between two clusters is defined as the maximum of the distances

between all possible pairs of observations in the two clusters

= 181 and = 145

= 625 and = 557

Similarity matrix

S1 & S2 S3 S4 S5 S6

S1 & S2 0.00 181.00 221.00 625.00 821.00

S3 181.00 0.00 2.00 136.00 250.00

S4 221.00 2.00 0.00 106.00 212.00

S5 625.00 136.00 106.00 0.00 26.00

S6 821.00 250.00 212.00 26.00 0.00

213D 2

23D

215D 2

25D



Average-Linkage method The distance between two clusters is obtained by taking the average distance

between all pairs of subjects in the two clusters

and

(181 + 145) / 2 = 163

Similarity matrix

S1 & S2 S3 S4 S5 S6

S1 & S2 0.00 163.00 201.00 591.00 783.00

S3 163.00 0.00 2.00 136.00 250.00

S4 201.00 2.00 0.00 106.00 212.00

S5 591.00 136.00 106.00 0.00 26.00

S6 783.00 250.00 212.00 26.00 0.00

213D 2

23D



Ward’s method It forms clusters by maximizing within-clusters homogeneity. The within-group

sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares

k

j

n

i

n

iij

jij

j j

Xn

XSSE1 1

2

1

2 1..


Evaluating the cluster solution and determining the number of cluster

Root-mean-square standard deviation(RMSSTD)of the new cluster RMSSTD if the pooled standard deviation of all the variables forming the cluster.

pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables

R-Squared(RS)

RS is the ratio of SSb to SSt (SSt = SSb + SSw)

RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376

Within-Group sum of squares and degrees of freedom for cluster formed in steps 1,2,3,4 and 5Within-Group Sum of Squares Degrees of Freedom

RMSSTDStep number Cluster Income Education Pooled Income Education Pooled

1 CL5 0.500 0.500 1.000 1 1 2 0.702 CL4 0.500 0.500 1.000 1 1 2 0.703 CL3 12.500 0.500 13.000 1 1 2 2.604 CL2 157.000 26.000 183.000 3 3 6 5.525 CL1 498.333 202.833 701.166 5 5 10 8.37



Semipartial R-Squared (SPR) The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of

homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters.

SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster.

SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241

Distance between clusters It is simply the euclidean distance between the centroids of the two clusters that are

to be joined or merger and it is termed the centroid distance (CD)

Data for three clusters

ClusterCluster


1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 & S6 27.5 19.5

00.13)5.145.19()5.155.27(2 22 CL



Summary of the statistics for evaluating cluster solution

Statistic Concept measured Comments

RMSSTD Homogeneity of new clusters Value should be small

SPR Homogeneity of merged clusters Value should be small

RS Homogeneity of new clusters Value should be high

CD Homogeneity of merged clusters Value should be small


Nonhierarchical clustering

The data are divided into k partitions or groups with each partition representing a cluster.

The number of clusters must be known a priori.

Step

1. Select k initial cluster centroids or seeds, where k is number of clusters desired.

2. Assign each observation to the cluster to which it is the closest.

3. Reassign or reallocate each observation to one of the k clusters according to a prede

termined stopping rule.

4. Stop if there is no reallocation of data points or if the reassignment satisfies the crit

eria set by the stopping rule. Otherwise go to Step 2.

Difference

the method used for obtaining initial cluster centroids or seeds

the rule used for reassigning observations



Algorithm 1 step

1. select the first k observation as cluster center

2. compute the centroid of each cluster

3. reassigned by computing the distance of each observation

Initial cluster centroidsCluster

Variable 1 2 3Income 5 6 15

Education 5 6 14

Distance from cluster centroid

Observation 1 2 3Assign to

clusterS! 0 2 181 1S2 2 0 145 2S3 181 145 0 3S4 221 181 2 3S5 625 557 136 3S6 821 745 250 3

Reassignment of ObservationObservatio

n1 2 3 Previous

Reassignment

S! 0 2 416.25 1 1S2 2 0 361.25 2 2S3 181 145 51.25 3 3S4 221 181 34.25 3 3S5 625 557 21.25 3 3S6 821 990 76.25 3 3

Centroid of the three clustersCluster

Variable 1 2 3Income 5 6 21.5

Education 5 6 17.0



Algorithm 2 step

1. select the first k observation as cluster center

2. seeds are replaced by remaining observation.

3. reassigned by computing the distance of each observation

Distance from cluster centroid

Observation 1 2 3Assign to

clusterS! 0 181 625 1S2 2 145 557 1S3 181 0 136 2S4 221 2 106 2S5 625 136 0 3S6 821 250 26 3

1. {1}, {2}, {3}2. {1}, {2}, {3, 4}3. {1, 2}, {5}, {3, 4}4. {1, 2}, {5, 6}, {3, 4}


Variable 1 2 3Income 5.5 15.5 27.5

Education 5.5 14.5 19.5


n1 2 3 Previous

Reassignment

S! 0.5 200.5 716.5 1 1S2 0.5 162.5 644.5 1 1S3 162.5 0.5 186.5 2 2S4 200.5 0.5 152.5 2 2S5 590.5 120.5 6.5 3 3S6 60.5 230.5 6.5 3 3



Algorithm 3 selecting the initial seeds

Sum(i) be the sum of the values of the variables Minimizes the ESS

1)0001.0)()((

MinMax

kMiniSumCi

Initial Assignment

Subject Income Education Sum(i) CiAssigned to cluster

S1 5 5 10 1 1S2 6 6 12 1 1S3 15 14 29 2 2S4 16 15 31 2 2S5 25 20 45 3 3S6 30 19 49 3 3


Variable 1 2 3Income 5.5 15.5 27.5

Education 5.5 14.5 19.5

Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2 increase decrease


n1 2 3 Previous

Reassignment

S! 1074.5 300.5 - 1 1S2 966.5 243.75 - 1 1S3 279.5 - 243.5 2 2S4 228.5 - 300.75 2 2S5 - 177.5 882.5 3 3S6 - 585.5 1170.5 3 3


Which clustering method is best

Hierarchical methods advantage ; Do not require a priori knowledge of the number of clusters of the

starting partition. disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned

to another cluster. Nonhierarchical methods

The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition.

k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition.

Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.

cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. postech ie...

Documents