cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. postech ie...
TRANSCRIPT
Cluster analysis
포항공과대학교산업공학과
확률통계연구실이 재 현
POSTECH IE PASTA CLUSTER ANALYSIS
Definition
Cluster analysis is a technigue used for combining observations into groups or clusters such that
Each group or cluster is homogeneous or compact with respect to certain characteristics
Each group should be different from other groups with respect to the same characteristics
Example A marketing manager is interested in identifying similar cities that can be used for t
est marketing The campaign manager for a political candidate is interested in identifying groups o
f votes who have similar views on important issues
POSTECH IE PASTA CLUSTER ANALYSIS
Objective of clustering analysis
The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables
overview of cluster analysis
step 1 ; n objects measured on p variables
step 2 ; Transform to n * n similarity(distance)
matrix
step 3 ; Cluster formation
(Hierarchical or nonhierarchical clusters)
step 4 ; Cluster profile
POSTECH IE PASTA CLUSTER ANALYSIS
Key problem
Measure of similarity Fundamental to the use of any clustering technique is the computation of a measur
e of similarity to distance between the respective objects. Distance-type measures – Euclidean distance for standardized data, Mahalanobis
distance Matching-type measures – Association coefficients, correlation coefficients
A procedure for forming the clusters Hierarchical clustering – Centroid method, Single-linkage method, Complete-link
age method, Average-linkage method, Ward’s method. Nonhierarchical clustering – k-means clustering
POSTECH IE PASTA CLUSTER ANALYSIS
Similarity Measure – Distance type
Minkowski metric
If r = 2, then Euclidean distance if r = 1, then absolute distance
consider below example
rp
k
r
jkikij XXd
/1
1
Data
Subject Income Education
S1 5 5
S2 6 6
S3 15 14
S4 16 15
S5 25 20
S6 30 19
Similarity matrix
S1 S2 S3 S4 S5 S6
S1 0.00 2.00 181.00221.00625.00821.00
S2 2.00 0.00 145.00181.00557.00745.00
S3 181.00145.00 0.00 2.00 136.00250.00
S4 211.00181.00 2.00 0.00 106.00212.00
S5 625.00557.00136.00106.00 0.00 26.00
S6 821.00745.00250.00212.00 26.00 0.00
PersonWeight in
PoundsHeight in
Feet
A 160 5.5
B 163 6.2
C 163 6.0
Height in Feet Height in inches
dAB = 3.08 dAB = 8.92
dAC = 5.02 dAC = 7.81
dBC = 2.01 dBC = 3.12
POSTECH IE PASTA CLUSTER ANALYSIS
Similarity Measure – Distance type
Euclidean distance for standardized data To make scale invariant data The squared euclidean distance is weighted by
Mahalanobis distance
x is p*1 vector, S is a p*p covariance matrix
It is designed to take into account the correlation among the variables and is also sc
ale invariant.
si
2/1
Similarity matrix
S1 S2 S3 S4 S5 S6
S1 0.00 0.35 3.00 3.68 9.55 11.09
S2 0.35 0.00 2.38 3.00 8.45 9.94
S3 3.00 2.38 0.00 0.035 1.89 2.87
S4 3.68 3.00 0.035 0.00 1.43 2.36
S5 9.55 8.45 1.89 1.43 0.00 0.28
S6 11.09 9.94 2.87 2.36 0.28 0.00
)()'( 1jijiij xxSxxMD
POSTECH IE PASTA CLUSTER ANALYSIS
Similarity Measure – Matching type
Association coefficients This type of measure is used to represent similarity for binary variables
Similarity coefficients
AttributePerson 1 2 3 4 5 6
A 0 1 1 0 1 1B 1 0 1 0 0 1
Person A
Person B
+ -+ 2 1 3- 2 1 3
4 2 6
POSTECH IE PASTA CLUSTER ANALYSIS
Similarity Measure – Matching type
Correlation coefficient Pearson product moment correlation coefficient is used for measure of similarity.
dAB = 1, dAC = 0.82
Person X1 X2 X3 X4
A 1 3 2 2
B 4 10 7 7
C 1 2 2 2
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Centroid method Each group is replaced by Average Subject which is the centroid of that group
Data for five clusters
ClusterCluster
membersIncome Education
1 S1 & S2 5.5 5.52 S3 15.0 14.03 S4 16.0 15.04 S5 25.0 20.05 S6 30.0 19.0
Similarity matrix
S1 & S2 S3 S4 S5 S6
S1 & S2 0.00 162.50 200.50 590.50 782.50
S3 162.50 0.00 2.00 135.96 250.00
S4 200.20 2.00 0.00 106.00 212.00
S5 590.50 135.96 106.00 0.00 26.00
S6 782.50 250.00 212.00 26.00 0.00
Data for four clusters
ClusterCluster
membersIncome Education
1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 16.0 15.04 S6 25.0 20.0
Similarity matrix
S1 & S2 S3 & S4 S5 S6
S1 & S2 0.00 181.00 590.50 782.50
S3 & S4 181.00 0.00 120.50 230.50
S5 590.50 120.50 0.00 26.00
S6 782.50 230.50 26.00 0.00
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Data for three clusters
ClusterCluster
membersIncome Education
1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 & S6 27.5 19.5
Similarity matrix
S1 & S2 S3 & S4 S5 & S6
S1 & S2 0.00 181.00 680.00
S3 & S4 181.00 0.00 169.00
S5 & S6 680.00 169.00 0.00
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Single-Linkage method The distance between two clusters is represented by the minimum of the distance
between all possible pairs of subjects in the two clusters
= 181 and = 145
= 221 and = 181
Similarity matrix
S1 & S2 S3 S4 S5 S6
S1 & S2 0.00 145.00 181.00 557.00 745.00
S3 145.00 0.00 2.00 136.00 250.00
S4 181.00 2.00 0.00 106.00 212.00
S5 557.00 136.00 106.00 0.00 26.00
S6 745.00 250.00 212.00 26.00 0.00
213D 2
23D
214D 2
24D
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Complete-Linkage method The distance between two clusters is defined as the maximum of the distances
between all possible pairs of observations in the two clusters
= 181 and = 145
= 625 and = 557
Similarity matrix
S1 & S2 S3 S4 S5 S6
S1 & S2 0.00 181.00 221.00 625.00 821.00
S3 181.00 0.00 2.00 136.00 250.00
S4 221.00 2.00 0.00 106.00 212.00
S5 625.00 136.00 106.00 0.00 26.00
S6 821.00 250.00 212.00 26.00 0.00
213D 2
23D
215D 2
25D
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Average-Linkage method The distance between two clusters is obtained by taking the average distance
between all pairs of subjects in the two clusters
and
(181 + 145) / 2 = 163
Similarity matrix
S1 & S2 S3 S4 S5 S6
S1 & S2 0.00 163.00 201.00 591.00 783.00
S3 163.00 0.00 2.00 136.00 250.00
S4 201.00 2.00 0.00 106.00 212.00
S5 591.00 136.00 106.00 0.00 26.00
S6 783.00 250.00 212.00 26.00 0.00
213D 2
23D
POSTECH IE PASTA CLUSTER ANALYSIS
Hierarchical clustering
Ward’s method It forms clusters by maximizing within-clusters homogeneity. The within-group
sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares
k
j
n
i
n
iij
jij
j j
Xn
XSSE1 1
2
1
2 1..
POSTECH IE PASTA CLUSTER ANALYSIS
Evaluating the cluster solution and determining the number of cluster
Root-mean-square standard deviation(RMSSTD)of the new cluster RMSSTD if the pooled standard deviation of all the variables forming the cluster.
pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables
R-Squared(RS)
RS is the ratio of SSb to SSt (SSt = SSb + SSw)
RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376
Within-Group sum of squares and degrees of freedom for cluster formed in steps 1,2,3,4 and 5Within-Group Sum of Squares Degrees of Freedom
RMSSTDStep number Cluster Income Education Pooled Income Education Pooled
1 CL5 0.500 0.500 1.000 1 1 2 0.702 CL4 0.500 0.500 1.000 1 1 2 0.703 CL3 12.500 0.500 13.000 1 1 2 2.604 CL2 157.000 26.000 183.000 3 3 6 5.525 CL1 498.333 202.833 701.166 5 5 10 8.37
POSTECH IE PASTA CLUSTER ANALYSIS
Evaluating the cluster solution and determining the number of cluster
Semipartial R-Squared (SPR) The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of
homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters.
SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster.
SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241
Distance between clusters It is simply the euclidean distance between the centroids of the two clusters that are
to be joined or merger and it is termed the centroid distance (CD)
Data for three clusters
ClusterCluster
membersIncome Education
1 S1 & S2 5.5 5.52 S3 & S4 15.5 14.53 S5 & S6 27.5 19.5
00.13)5.145.19()5.155.27(2 22 CL
POSTECH IE PASTA CLUSTER ANALYSIS
Evaluating the cluster solution and determining the number of cluster
Summary of the statistics for evaluating cluster solution
Statistic Concept measured Comments
RMSSTD Homogeneity of new clusters Value should be small
SPR Homogeneity of merged clusters Value should be small
RS Homogeneity of new clusters Value should be high
CD Homogeneity of merged clusters Value should be small
POSTECH IE PASTA CLUSTER ANALYSIS
Nonhierarchical clustering
The data are divided into k partitions or groups with each partition representing a cluster.
The number of clusters must be known a priori.
Step
1. Select k initial cluster centroids or seeds, where k is number of clusters desired.
2. Assign each observation to the cluster to which it is the closest.
3. Reassign or reallocate each observation to one of the k clusters according to a prede
termined stopping rule.
4. Stop if there is no reallocation of data points or if the reassignment satisfies the crit
eria set by the stopping rule. Otherwise go to Step 2.
Difference
the method used for obtaining initial cluster centroids or seeds
the rule used for reassigning observations
POSTECH IE PASTA CLUSTER ANALYSIS
Nonhierarchical clustering
Algorithm 1 step
1. select the first k observation as cluster center
2. compute the centroid of each cluster
3. reassigned by computing the distance of each observation
Initial cluster centroidsCluster
Variable 1 2 3Income 5 6 15
Education 5 6 14
Distance from cluster centroid
Observation 1 2 3Assign to
clusterS! 0 2 181 1S2 2 0 145 2S3 181 145 0 3S4 221 181 2 3S5 625 557 136 3S6 821 745 250 3
Reassignment of ObservationObservatio
n1 2 3 Previous
Reassignment
S! 0 2 416.25 1 1S2 2 0 361.25 2 2S3 181 145 51.25 3 3S4 221 181 34.25 3 3S5 625 557 21.25 3 3S6 821 990 76.25 3 3
Centroid of the three clustersCluster
Variable 1 2 3Income 5 6 21.5
Education 5 6 17.0
POSTECH IE PASTA CLUSTER ANALYSIS
Nonhierarchical clustering
Algorithm 2 step
1. select the first k observation as cluster center
2. seeds are replaced by remaining observation.
3. reassigned by computing the distance of each observation
Distance from cluster centroid
Observation 1 2 3Assign to
clusterS! 0 181 625 1S2 2 145 557 1S3 181 0 136 2S4 221 2 106 2S5 625 136 0 3S6 821 250 26 3
1. {1}, {2}, {3}2. {1}, {2}, {3, 4}3. {1, 2}, {5}, {3, 4}4. {1, 2}, {5, 6}, {3, 4}
Centroid of the three clustersCluster
Variable 1 2 3Income 5.5 15.5 27.5
Education 5.5 14.5 19.5
Reassignment of ObservationObservatio
n1 2 3 Previous
Reassignment
S! 0.5 200.5 716.5 1 1S2 0.5 162.5 644.5 1 1S3 162.5 0.5 186.5 2 2S4 200.5 0.5 152.5 2 2S5 590.5 120.5 6.5 3 3S6 60.5 230.5 6.5 3 3
POSTECH IE PASTA CLUSTER ANALYSIS
Nonhierarchical clustering
Algorithm 3 selecting the initial seeds
Sum(i) be the sum of the values of the variables Minimizes the ESS
1)0001.0)()((
MinMax
kMiniSumCi
Initial Assignment
Subject Income Education Sum(i) CiAssigned to cluster
S1 5 5 10 1 1S2 6 6 12 1 1S3 15 14 29 2 2S4 16 15 31 2 2S5 25 20 45 3 3S6 30 19 49 3 3
Centroid of the three clustersCluster
Variable 1 2 3Income 5.5 15.5 27.5
Education 5.5 14.5 19.5
Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2 increase decrease
Reassignment of ObservationObservatio
n1 2 3 Previous
Reassignment
S! 1074.5 300.5 - 1 1S2 966.5 243.75 - 1 1S3 279.5 - 243.5 2 2S4 228.5 - 300.75 2 2S5 - 177.5 882.5 3 3S6 - 585.5 1170.5 3 3
POSTECH IE PASTA CLUSTER ANALYSIS
Which clustering method is best
Hierarchical methods advantage ; Do not require a priori knowledge of the number of clusters of the
starting partition. disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned
to another cluster. Nonhierarchical methods
The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition.
k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition.
Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.