clustering

Clustering

Introduction

• Cluster analysis is the best-known descriptive data mining method. Given a data matrix composed of n observations (rows) and p variables (columns), the objective of cluster analysis is to cluster the observations into groups that are internally homogeneous (internal cohesion) and heterogeneous from group to group (external separation).

• In order to compare observations, we need to introduce the idea of a distance measure, or proximity, among them.

• Distances are normally used to measure the similarity or dissimilarity between two data objects Xi and Xj. The larger the similarity, the smaller the dissimilarity and hence the smaller the distance between the pair of objects.

• When the variables of interest are quantitative, the indexes of proximity typically used are called distances.

• If the variables are qualitative, the distance between observations can be measured by indexes of similarity.

• If the data are contained in a contingency table, the chi-squared distance can also be employed.

• There are also indexes of proximity that are used on a mixture of qualitative and quantitative variables.

• In the numeric domain, a popular measure is the Minkowski distance. It is defined as:

where q is a positive integer and n is the number of attributes involved. If q = 1, then d is termed Manhattan distance, while for q = 2 it is called Euclidean distance.

Euclidean distance

• Consider a data matrix containing only quantitative (or binary) variables. If x and y are rows from the data matrix then a function d(x, y) is said to be a distance between two observations if it satisfies the following properties:

• Non-negativity. d(x, y) ≥ 0, for all x and y.• Identity. d(x, y) = 0 ⇔ x = y, for all x and y.• Symmetry. d(x, y) = d(y, x), for all x and y.• Triangular inequality. d(x, y) ≤ d(x, z) + d(y, z), for all

x, y and z.

• All such distances can be represented in a matrix of distances. A distance matrix can be represented in the following way:

• where the generic element dij is a measure of distance between the row vectors xi and xj.

• The Euclidean distance is the most commonly used distance measure. It is defined, for any two units indexed by i and j, as the square root of the difference between the corresponding vectors, in the p-dimensional Euclidean space:

• The Euclidean distance can be strongly influenced by a single large difference in one dimension of the values, because the square will greatly magnify that difference.

• Dimensions having different scales (e.g. some values measured in centimetres, others in metres) are often the source of these overstated differences.

• To overcome such limitation, the Euclidean distance is often calculated, not on the original variables, but on useful transformations of them. The most common choice is to standardise the variables.

• After standardisation, every transformed variable contributes to the calculation of the distance with equal weight. When the variables are standardised, they have zero mean and unit variance; furthermore, it can be shown that, for i, j = 1, ..., p:

• Where rij is the correlation coefficient between the observations xi and xj. Thus the Euclidean distance between two observations is a function of the correlation coefficient between them.

Similarity measures

• Given a finite set of observations ui U, a function ∈ S(ui, uj) = Sij from U × U to r is called an index of similarity if it satisfies the following properties:

• Non-negativity. Sij ≥ 0, for all ui, uj ∈ U.

• Normalisation. Sii = 1, for all ui ∈ U.

• Symmetry. Sij = Sji, for all ui, uj ∈ U.

Unlike distances, the indexes of similarity can be applied to all kinds of variables, including qualitative variables. They are defined with reference to the observation indexes, rather than to the corresponding row vectors, and they assume values in the closed interval [0, 1], making them easy to interpret.

• Consider data on n visitors to a website, which has P pages. Correspondingly, there are P binary variables, which assume the value 1 if the specific page has been visited, or else the value 0.

• To demonstrate the application of similarity indexes, we now analyse only data concerning the behaviour of the first two visitors (2 of the n observations) to the website , among the P = 28 web pages that they can visit.

• Table 4.1 summarises the behaviour of the two visitors, treating each page as a binary variable.

• CP, for co-presence, or positive matches• CA, for co-absences or negative matches• PA for presence–absence and AP for absence–presence, where the

first letter refers to visitor A and the second to visitor B.• Note that, of the 28 pages considered, two have been visited by

both visitors, there is a frequency of 21 equal to the number of pages that are visited neither by A nor by B. Finally, the frequencies of 4 and 1 indicate the number of pages that only one of the two navigators visits.

Russel–Rao similarity index• The Russel–Rao similarity index is a function of

the co-presences and is equal to the ratio between the number of the co-presences and the total number of binary variables considered, P:

From Table 4.1 we have

Jaccard similarity index• This index is the ratio between the number of

co-presences and the total number of variables, excluding those that manifest co-absences:

• Note that this index cannot be defined if two visitors or, more generally, the two observations, manifest only co-absences (CA = P).

• In the example above we have

Sokal–Michener similarity index• This is the ratio between the number of co-presences or co-

absences and the total number of the variables:

• In our example

• Its complement to one (a dissimilarity index) corresponds to the average of the squared Euclidean distance between the two vectors of binary variables associated to the observations:

• It is one of the commonly used indexes of similarity.

Patterns to be clustered are either labelled or unlabelled. Based on this, we have:•Clustering algorithms which typically group sets of unlabelled patterns These typesof paradigms are so popular that clustering is viewed as an unsupervisedclassification of unlabelled patterns.•Algorithms which cluster labelled patterns These types of paradigms are practicallyimportant and are called supervised clustering or labelled clustering. Supervisedclustering is helpful in identifying clusters within collections of labelled patterns.Abstractions can be obtained in the form of cluster representatives/descriptionswhich are useful for efficient classification.

• The process of clustering is carried out so that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in a corresponding sense.

• The Euclidean distance between any two points belonging to the same cluster is smaller than that between any two points belonging to different clusters.

• Let us consider a two-dimensional data set of 10 vectors given by

• X1=(1,1) X2=(2,1) X3=(1,2) X4=(2,2) X5=(6,1) X6=(7,1) X7=(6,2) X8=(6,7) X9=(7,7) X10=(7,6)

• Let us say that any two points belong to the same cluster if the distance between them is less than a threshold. Specifically, in this example, we use the squared Euclidean distance to characterise the distance between the points and use a threshold of 5 units to cluster them. The squared Euclidean distance between two points xi and xj is defined as

where d is the dimensionality of the points.

• It is possible to represent the squared Euclidean distance between pairs of points using the matrix.

Note the following points:1. In general, the clustering structure may not be so obvious from

the distance matrix. For example, by considering the patterns in the order X1, X5, X2, X8, X3, X6, X9, X4, X7, X10 , the sub-matrices corresponding to the three clusters are not easy to visualise.

2. In this example, we have used the squared Euclidean distance to denote the distance between two points. It is possible to use other distance functions also.

3. Here, we defined each cluster to have patterns such that the distance between any two patterns in a cluster (intra-cluster distance) is less than 5 units and the distance between two points belonging to two different clusters (inter-cluster distance) is greater than 5 units.

• Partition generated by clustering is either hard or soft. A hard clustering algorithm generates clusters that are non-overlapping.

• Consider a set of patterns where the ith cluster , is such that , and no Xi = φ. If in addition, Xi ∩ Xj = φ, for all i and j, i ≠ j, then we have a hard partition.

• It is possible to generate a soft partition where a pattern belongs to more than one cluster. In such a case, we will have overlapping clusters.

• In general, for a set of n patterns, it is possible to show that the number of 2-partitions is 2n−1−1.

• the number of partitions of n patterns into m blocks (clusters) and is given by

• Knowledge is implicit in clustering. So, partitioning is done based on the domain knowledge or user-provided information in different forms; this could include choice of similarity measure, number of clusters, and threshold values. Further, because of the well-known theorem of the ugly duckling, clustering is not possible without using extra-logical information.

• The theorem of the ugly duckling states that the number of predicates shared by any two patterns is constant when all possible predicates are used to represent patterns. So, if we judge similarity based on the number of predicates shared, then any two patterns are equally similar.

• A cluster of points is represented by its centroid or its medoid.

• The centroid stands for the sample mean of the points in cluster C; it is given by where NC is the number of patterns in cluster C.

• The medoid is the most centrally located point in the cluster; more formally, the medoid is that point in the cluster from which the sum of the distances from the points in the cluster is the minimum.

• The point which is far off from any of the points in the cluster is an outlier.

• The centroid of the data can shift without any bound based on the location of the outlier; however, the medoid does not shift beyond the boundary of the original cluster irrespective of the location of the outlier. So, clustering algorithms that use medoids are more robust in the presence of noisy patterns or outliers.

Example

It is possible to have more than one representative per cluster; for example, four extreme points labelled e1, e2, e3, e4 can represent the cluster as shown in Figure.It is also possible to use a logical description to characterise the cluster. For example, using a conjunction of disjunctions, we have the following description of the cluster:

(x=a1, ..., a2) (y=b1, ..., b3)∧

Why is Clustering Important?• Clusters and their generated descriptions are useful in several decision-

making situations like – classification, – prediction, etc.

• The number of cluster representatives obtained is smaller than the number of input patterns and so there is data reduction.

• Clustering can also be used on the set of features in order to achieve dimensionality reduction which is useful in improving the classification accuracy at a reduced computational resource requirement.

• One of the important applications of clustering is in data re-organisation.• Another important application of clustering is in identifying outliers.• It is also possible to use clustering for guessing missing values in a

pattern matrix.

• For example, let the value of the jth feature of the ith pattern be missing. We can still cluster the patterns based on features other than the jth one.

• Based on this grouping, we find the cluster to which the ith pattern belongs. The missing value can then be estimated based on the jth feature value of all the other patterns belonging to the cluster.

The important steps in the clustering process are1. Pattern representation2. Definition of an appropriate similarity/dissimilarity function3. Selecting a clustering algorithm and using it to generate a

partition/description of the clusters4. Using these abstractions in decision making

• At the top, there is a distinction between hard and soft clustering paradigms based on whether the partitions generated are overlapping or not. Hard clustering algorithms are either hierarchical, where a nested sequence of partitions is generated, or partitional where a partition of the given data set is generated.

• Soft clustering algorithms are based on fuzzy sets, rough sets, artificial neural nets (ANNs),or evolutionary algorithms, specifically genetic algorithms (GAs).

Taxonomy of clustering approaches

Hard Clustering

Soft Clustering

Partitional Clustering

• Partitional clustering algorithms generate a hard or soft partition of the data.

• The most popular of this category of algorithms is the k-means algorithm.

k-Means Algorithm

A simple description of the k-means algorithm is given below.Step 1: Select k out of the given n patterns as the initial cluster centres. Assign each of the remaining n−k patterns to one of the k clusters; a pattern is assigned to its closest centre/cluster.Step 2: Compute the cluster centres based on the current assignment of patterns.Step 3: Assign each of the n patterns to its closest centre/cluster.Step 4: If there is no change in the assignment of patterns to clusters during two successive iterations, then stop; else, goto Step 2.

k-Means Clustering Algorithm

1. Choose a value of k.2. Select k objects in an arbitrary fashion. Use

these as the initial set of k centroids.3. Assign each of the objects to the cluster for

which it is nearest to the centroid.4. Recalculate the centroids of the k clusters.5. Repeat steps 3 and 4 until the centroids no

longer move.

• There are a variety of schemes for selecting the initial cluster centres; these include

• selecting the first k of the given n patterns, • selecting k random patterns out of the given n

patterns, and • viewing the initial cluster seed selection as an

optimisation problem, and using a robust tool to search for the globally optimal solution.

• An important property of the k-means algorithm is that it implicitly minimises the sum of squared deviations of patterns in a cluster from the centre.

• More formally, if Ci is the ith cluster and μi is its centre, then the criterion function minimised by the algorithm is

• In clustering literature, it is called the sum-of-squared-error criterion, within-group-error sum-of-squares, or simply squared-error criterion.

Example

Progress of k-means clustering

• It can be proved that the k-means algorithm will always terminate, but it does not necessarily find the best set of clusters, corresponding to minimising the value of the objective function.

• The initial selection of centroids can significantly affect the result. To overcome this, the algorithm can be run several times for a given value of k, each time with a different choice of the initial k centroids, the set of clusters with the smallest value of the objective function then being taken.

• The k-means algorithm does not guarantee the globally optimal partition.

• The time complexity of the algorithm is O(nkdl), where n is the number of patterns, k is the number of clusters, d is the dimensionality and l is the number of iterations. The space requirement is O(kd).These features make the algorithm very attractive.

• k-means can be applied in applications involving large volumes of data, for example, satellite image data. It is best to use the k-means algorithm when the clusters are hyper-spherical. It does not generate the intended partition if the partition has non-spherical clusters.

• The objective function which is employed by K-means is called the Sum of Squared Errors (SSE) or Residual Sum of Squares (RSS).

• Given a dataset D = {x1,x2, ... , xN} consists of N points, let us denote the clustering obtained after applying K-means clustering by C = {C1, C2, ..., Ck, ... , CK}. The SSE for this clustering is defined in the following Equation where ck is the centroid of cluster Ck. The objective is to find a clustering that minimizes the SSE score. The iterative assignment and update steps of the K-means algorithm aim to minimize the SSE score for the given set of centroids.

Minimization of Sum of Squared Errors

• K-means clustering is essentially an optimization problem with the goal of minimizing the SSE objective function.

• Let us prove the reason behind choosing the mean of the data points in a cluster as the prototype representative for a cluster in the K-means algorithm.

• Let us denote Ck as the kth cluster, xi is a point in Ck, and ck is the mean of the kth cluster.

• We can solve for the representative of Cj which minimizes the SSE by differentiating the SSE with respect to cj and setting it equal to zero.

• Hence, the best representative for minimizing the SSE of a cluster is the mean of the points in the cluster. In K-means, the SSE monotonically decreases with each iteration. This monotonically decreasing behaviour will eventually converge to a local minimum.

Factors Affecting K-Means

• The major factors that can impact the performance of the K-means algorithm are the following:

1. Choosing the initial centroids.2. Estimating the number of clusters K.

Popular Initialization Methods

• In his classical paper [33], MacQueen proposed a simple initialization method which chooses K seeds at random. This is the simplest method and has been widely used in the literature.

• The other popular K-means initialization methods which have been successfully used to improve the clustering performance are given below.

1. Hartigan and Wong[19]: Using the concept of nearest neighbour density, this method suggests that the points which are well separated and have a large number of points within their surrounding multi-dimensional sphere can be good candidates for initial points. The average pair-wise Euclidean distance between points is calculated using following Equation. Subsequent points are chosen in the order of their decreasing density and simultaneously maintaining the separation of d1 from all previous seeds.

2. Bradley and Fayyad[5]: Choose random subsamples from the data and apply K-means clustering to all these subsamples using random seeds. The centroids from each of these subsamples are then collected, and a new dataset consisting of only these centroids is created. This new dataset is clustered using these centroids as the initial seeds. The minimum SSE obtained guarantees the best seed set chosen.

3. K-Means++[1]: The K-means++ algorithm carefully selects the initial centroids for K-means clustering. The algorithm follows a simple probability-based approach where initially the first centroid is selected at random. The next centroid selected is the one which is farthest from the currently selected centroid. This selection is decided based on a weighted probability score.The selection is continued until we have K centroids and then K-means clustering is done using these centroids.

Estimating the Number of Clusters

• The problem of estimating the correct number of clusters (K) is one of the major challenges for the K-means clustering. Several researchers have proposed new methods for addressing this challenge in the literature.

1. Calinski–Harabasz Index[6]: The Calinski–Harabasz index is defined by the following equation

where N represents the number of data points. The number of clusters is chosen by maximizing the function given in the equation. Here B(K) and W(K)are the between and within cluster sum of squares, respectively (with K clusters).

2. Silhouette Coefficient[26]: This is formulated by considering both the intra- and inter-cluster distances. For a given point xi, first the average of the distances to all points in the same cluster is calculated. This value is set to ai.

Then for each cluster that does not contain xi, the average distance of xi to all the data points in each cluster is computed. This value is set to bi.

Using these two values the silhouette coefficient of a point is estimated. The average of all the silhouettes in the dataset is called the average silhouettes width for all the points in the dataset. To evaluate the quality of a clustering one can compute the average silhouette coefficient of all points.

3. Newman and Girvan[40]: In this method, the dendrogram is viewed as a graph, and a betweenness score (which will be used as a dissimilarity measure between the edges) is proposed. The procedure starts by calculating the betweenness score of all the edges in the graph.Then the edge with the highest betweenness score is removed. This is followed by recomputing the betweenness scores of the remaining edges until the final set of connected components is obtained. The cardinality of the set derived through this process serves as a good estimate for K.

4. ISODATA[2]: ISODATA was proposed for clustering the data based on the nearest centroid method. In this method, first K-means is run on the dataset to obtain the clusters. Clusters are then merged if their distance is less than a threshold φ or if they have fewer than a certain number of points. Similarly, a cluster is split if the within cluster standard deviation exceeds that of a user defined threshold.

Variations of K-Means

• The simple framework of the K-means algorithm makes it very flexible to modify and build more efficient algorithms on top of it. Some of the variations proposed to the K-means algorithm are based on (i) Choosing different representative prototypes for the clusters (K-medoids, K-medians, K-modes), (ii) choosing better initial centroid estimates (Intelligent K-means, Genetic K-means), and (iii) applying some kind of feature transformation technique (Weighted K-means, Kernel K-means).

K-Medoids Clustering

• K-medoids is a clustering algorithm which is more resilient to outliers compared to K-means [38]. Similar to K-means, the goal of K-medoids is to find a clustering solution that minimizes a predefined objective function.

• The K-medoids algorithm chooses the actual data points as the prototypes and is more robust to noise and outliers in the data. The K-medoids algorithm aims to minimize the absolute error criterion rather than the SSE.

• Similar to the K-means clustering algorithm, the K-medoids algorithm proceeds iteratively until each representative object is actually the medoid of the cluster.

• The basic K-medoids clustering algorithm is given below.

• In the K-medoids clustering algorithm, specific cases are considered where an arbitrary random point xi is used to replace a representative point m.

• Following this step the change in the membership of the points that originally belonged to m is checked. The change in membership of these points can occur in one of the two ways.

• These points can now be closer to xi (new representative point) or to any of the other set of representative points. The cost of swapping is calculated as the absolute error criterion for K-medoids. For each reassignment operation this cost of swapping is calculated and this contributes to the overall cost function.

Algorithm: K-Medoids Clustering

1: Select K points as the initial representative objects.2: repeat3: Assign each point to the cluster with the nearest

representative object.4: Randomly select a non-representative object xi.

5: Compute the total cost S of swapping the representative object m with xi.

6: If S < 0, then swap m with xi to form the new set of K representative objects.

7: until Convergence criterion is met.

• To deal with the problem of executing multiple swap operations while obtaining the final representative points for each cluster, a modification of the K-medoids clustering called Partitioning Around Medoids (PAM) algorithm is proposed [26]. This algorithm operates on the dissimilarity matrix of a given dataset. PAM minimizes the objective function by swapping all the non-medoid points and medoids iteratively until convergence.

• K-medoids is more robust compared to K-means but the computational complexity of K-medoids is higher and hence is not suitable for large datasets.

• PAM was also combined with a sampling method to propose the Clustering LARge Application (CLARA) algorithm. CLARA considers many samples and applies PAM on each one of them to finally return the set of optimal medoids.

K-Medians Clustering

• The K-medians clustering calculates the median for each cluster as opposed to calculating the mean of the cluster (as done in K-means). K-medians clustering algorithm chooses K cluster centers that aim to minimize the sum of a distance measure between each point and the closest cluster center.

• The distance measure used in the K-medians algorithm is the L1-norm as opposed to the square of the L2-norm used in the K-means algorithm. The criterion function for the K-medians algorithm is defined as follows:

• Where xij represents the jth attribute of the instance xi and medkj represents the median for the jth attribute in the kth cluster Ck. K-medians is more robust to outliers compared to K-means.

• The goal of the K-Medians clustering is to determine those subsets of median points which minimize the cost of assignment of the data points to the nearest medians. The overall outline of the algorithm is similar to that of K-means.

• The two steps that are iterated until convergence are (i) All the data points are assigned to their nearest median and (ii) the medians are recomputed using the median of the each individual feature.

K-Modes Clustering

• One of the major disadvantages of K-means is its inability to deal with non-numerical attributes [51, 3]. Using certain data transformation methods, categorical data can be transformed into new feature spaces, and then the K-means algorithm can be applied to this newly transformed space to obtain the final clusters.

• However, this method has proven to be very ineffective and does not produce good clusters. It is observed that the SSE function and the usage of the mean are not appropriate when dealing with categorical data. Hence, the K-modes clustering algorithm [21] has been proposed to tackle this challenge.

Algorithm: K-Modes Clustering

1: Select K initial modes.2: repeat3: Form K clusters by assigning all the data

points to the cluster with the nearest mode using the matching metric.

4: Recompute the modes of the clusters.5: until Convergence criterion is met.

Fuzzy K-Means Clustering

• This is also popularly known as Fuzzy C-Means clustering.

• Performing hard assignments of points to clusters is not feasible in complex datasets where there are overlapping clusters. To extract such overlapping structures, a fuzzy clustering algorithm can be used.

• In fuzzy C-means clustering algorithm (FCM) [12, 4], the membership of points to different clusters can vary from 0 to 1. The

• SSE function for FCM is provided in the following equation.

Here wxik is the membership weight of point xi belonging to Ck. This weight is used during the update step of fuzzy C-means. The weighted centroid according to the fuzzy weights for Ck is calculated (represented by ck).

• The basic algorithm works similarly to K-means where the algorithm minimizes the SSE iteratively followed by updating wxik and ck.

• This process is continued until the convergence of centroids. As in K-means, the FCM algorithm is sensitive to outliers and the final solutions obtained will correspond to the local minimum of the objective function.

• There are further extensions of this algorithm in the literature such as Rough C-means[34] and Possibilistic C-means [30].

Hierarchical Clustering Algorithms

Hierarchical Clustering Algorithms

• Hierarchical clustering algorithms [23] were developed to overcome some of the disadvantages associated with flat or partitional-based clustering methods.

• Partitional methods generally require a user predefined parameter K to obtain a clustering solution and they are often nondeterministic in nature.

• Hierarchical algorithms were developed to build a more deterministic and flexible mechanism for clustering the data objects.

Hierarchical Algorithms

• Hierarchical algorithms produce a nested sequence of data partitions. The sequence can be depicted using a tree structure that is popularly known as a dendrogram.

• The algorithms are either – divisive or – agglomerative.

• The former starts with a single cluster having all the patterns; at each successive step, a cluster is split. This process continues till we end up with one pattern in a cluster or a collection of singleton clusters.

• Divisive algorithms use a top-down strategy for generating partitions of the data.

Hierarchical Algorithms

• Agglomerative algorithms, on the other hand, use a bottom-up strategy.

• They start with n singleton clusters when the input data set is of size n, where each input pattern is in a different cluster. At successive levels, the most similar pair of clusters is merged to reduce the size of the partition by 1.

• An important property of agglomerative algorithms is that once two patterns are placed in the same cluster at a level, they remain in the same cluster at all subsequent levels.

• Similarly, in the divisive algorithms, once two patterns are placed in two different clusters at a level, they remain in different clusters at all subsequent levels.

Divisive Hierarchical Clustering

• Divisive algorithms are either – polythetic where the division is based on more than one feature or – monothetic when only one feature is considered at a time.

• A scheme for polythetic clustering involves finding all possible 2-partitions of the data and choosing the best partition.

• Here, the partition with the least sum of the sample variances of the two clusters is chosen as the best. From the resulting partition, the cluster with the maximum sample variance is selected and is split into an optimal 2-partition.

• This process is repeated till we get singleton clusters.

• The sum of the sample variances is calculated in the following way. If the patterns are split into two partitions with m patterns X1,...,Xm in one cluster and n patterns Y1,...,Yn in the other cluster with the centroids of the two clusters being C1 and C2 respectively, then the sum of the sample variances will be

• Here, to obtain the optimal 2-partition of a cluster of size n, 2n−1 2-partitions are to be considered and the best of them is chosen; so, O(2n)effort is required to generate all possible 2-partitions and select the most appropriate partition.

• It is possible to use one feature at a time to partition (monothetic clustering) the given data set. In such a case, a feature direction is considered and the data is partitioned into two clusters based on the gap in the projected values along the feature direction. That is, the data set is split into two parts at a point that corresponds to the mean value of the maximum gap found among the values of the data’s feature. Each of these clusters is further partitioned sequentially using the remaining features.

Example: Monothetic divisive clustering

• There are 8 two-dimensional patterns. The patterns are as follows :A = (0.5, 0.5); B = (2, 1.5); C = (2, 0.5); D = (5, 1);E = (5.75, 1); F = (5, 3); G = (5.5, 3); H = (2, 3).

Example: Monothetic divisive clustering

• A major problem with this approach is that in the worst case, the initial data set is split into2d clusters by considering all the d features. This many clusters may not be typically acceptable, so an additional merging phase is required.

• In such a phase, a pair of clusters is selected based on some notion of nearness between the two clusters and they are merged into a single cluster. This process is repeated till the clusters are reduced to the required number.

• A popular characteristic for finding the nearness between a pair of clusters—distance between the centroids of the two clusters.

Algorithm: Basic Divisive Hierarchical Clustering

1: Start with the root node consisting all the data points2: repeat3: Split parent node into two parts C1 and C2

using Bisecting K-means to maximize Ward’s distance W(C1, C2).

4: Construct the dendrogram. Among the current, choose the cluster with the highest squared error.

5: until Singleton leaves are obtained.

• Sorting the n elements in the data and finding the maximum inter-pattern gap is of O(nlogn) time complexity for each feature direction and for the d features under consideration, this effort is O(dnlogn).

• So, this scheme can be infeasible for large values of n and d.

Agglomerative Clustering

• There are different kinds of agglomerative clustering methods which primarily differ from each other in the similarity measures that they employ.

• The widely studied algorithms in this category are the following: – single link(nearest neighbour),– complete link(diameter),– group average(average link),– centroid similarity,and– Ward’s criterion(minimum variance).

Agglomerative Clustering

Typically, an agglomerative clustering algorithm goes through the following steps:

Step 1: Compute the similarity/dissimilarity matrix between all pairs of patterns. Initialise each cluster with a distinct pattern.

Step 2: Find the closest pair of clusters and merge them. Update the proximity matrix to reflect the merge.

Step 3: If all the patterns are in one cluster, stop. Else, goto Step 2.

• Step 1 in the above algorithm requires O(n2) time to compute pair-wise similarities and O(n2) space to store the values, when there are n patterns in the given collection.

• There are several ways of implementing the second step. Some methods require only the only the distance matrix and some require the distance matrix plus the original data matrix. The following methods require only the distance matrix: (These are considered with reference to two clusters, C1 and C2 .

• Single linkage. The distance between two clusters is defined as the minimum of the n1n2 distances between each observation of cluster C1 and each observation of cluster C2:

d(C1,C2) = min(drs), with r ∈ C1, s ∈ C2.

• Complete linkage. The distance between two clusters is defined as the maximum of the n1n2 distances between each observation of a cluster and each observation of the other cluster:

d(C1, C2) = max(drs), with r ∈ C1, s ∈ C2.

• Average linkage. The distance between two clusters is defined as the arithmetic average of the n1n2 distances between each of the observations of a cluster and each of the observations of the other cluster:

Single Link

• In single link clustering[36, 46], the similarity of two clusters is the similarity between their most similar (nearest neighbor) members. This method intuitively gives more importance to the regions where clusters are closest, neglecting the overall structure of the cluster.

• Hence, this method falls under the category of a local similarity-based clustering method. Because of its local behavior, single linkage is capable of effectively clustering non-elliptical, elongated shaped groups of data objects.

• However, one of the main drawbacks of this method is its sensitivity to noise and outliers in the data.

Complete Link

• Complete link clustering [27] measures the similarity of two clusters as the similarity of their most dissimilar members. This is equivalent to choosing the cluster pair whose merge has the smallest diameter.

• As this method takes the cluster structure into consideration it is non-local in behavior and generally obtains compact shaped clusters.

• However, similar to single link clustering, this method is also sensitive to outliers.

Example: Agglomerative Clustering

• The single-link algorithm can be explained with the help of the data shown in the previous example.

• The dendrogram corresponding to the single-link algorithm is shown below.

• Note that there are 8 clusters to start with, where each cluster has one element.

• The distance matrix using city-block distance or Manhattan distance is given in Table 9.9.

(D, F)

Group Averaged and Centroid Agglomerative Clustering

• Group Averaged Agglomerative Clustering(GAAC) considers the similarity between all pairs of points present in both the clusters and diminishes the drawbacks associated with single and complete link methods.

• Let two clusters Ca and Cb be merged so that the resulting cluster is Ca b∪ = Ca C∪ b. The new centroid for this cluster is

• Where Na and Nb are the cardinalities of the clusters Ca and Cb, respectively.

• The similarity measure for GAAC is calculated as follows:

• The distance between two clusters is the average of all the pair-wise distances between the data points in these two clusters. Hence, this measure is expensive to compute especially when the number of data objects becomes large. Centroid-based agglomerative clustering, on the other hand, calculates the similarity between two clusters by measuring the similarity between their centroids.

• The primary difference between GAAC and Centroid agglomerative clustering is that, GAAC considers all pairs of data objects for computing the average pair-wise similarity, whereas centroid-based agglomerative clustering uses only the centroid of the cluster to compute the similarity between two different clusters.

Ward’s Criterion• Ward’s criterion [49, 50] was proposed to compute the distance between two

clusters during agglomerative clustering. This process of using Ward’s criterion for cluster merging in agglomerative clustering is also called as Ward’s agglomeration.

• It uses the K-means squared error criterion to determine the distance. For any two clusters, Ca and Cb, the Ward’s criterion is calculated by measuring the increase in the value of the SSE criterion for the clustering obtained by merging them into Ca C∪ b.

• The Ward’s criterion is defined as follows:

• So the Ward’s criterion can be interpreted as the squared Euclidean distance between the centroids of the merged clustersCaandCbweighted by a factor that is proportional to the product of cardinalities of the merged clusters.

Algorithm: Agglomerative Hierarchical Clustering

1: Compute the dissimilarity matrix between all the data points.

2: repeat3: Merge clusters as Ca b∪ = Ca C∪ b. Set new

cluster’s cardinality as Na b∪ = Na + Nb.

4: Insert a new row and column containing the distances between the new cluster Ca b∪ and

the remaining clusters.5: until Only one maximal cluster remains.

• Hierarchical clustering algorithms are computationally expensive. • The agglomerative algorithms require computation and storage

of a similarity or dissimilarity matrix of values that has O(n2) time and space requirement. They can be used in applications where hundreds of patterns were to be clustered. However, when the data sets are larger in size, these algorithms are not feasible because of the non-linear time and space demands.

• It may not be easy to visualise a dendrogram corresponding to 1000 patterns.

• Similarly, divisive algorithms require exponential time to analyse the number of patterns or the number of features. So, they too do not scale up well in the context of large-scale problems involving millions of patterns.

Density-Based Methods

• Partitioning and hierarchical methods are designed to find spherical-shaped clusters.

• They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval clusters.

• Density-based clustering methods can discover clusters of nonspherical shape.

DBSCAN: Density-Based Clustering Based on ConnectedRegions with High Density

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects that have dense neighbourhoods. It connects core objects and their neighbourhoods to form dense regions as clusters.

• A user-specified parameter ε > 0 is used to specify the radius of a neighbourhood we consider for every object.

• The ε -neighbourhood of an object o is the space within a radius ε centered at o.

• Due to the fixed neighbourhood size parameterized by ε , the density of a neighbourhood can be measured simply by the number of objects in the neighbourhood.

• To determine whether a neighbourhood is dense or not, DBSCAN uses another user-specified parameter, MinPts, which specifies the density threshold of dense regions.

• An object is a core object if the ε -neighbourhood of the object contains at least MinPts objects.

• Given a set, D, of objects, we can identify all core objects with respect to the given parameters, ε and MinPts. The clustering task is therein reduced to using core objects and their neighbourhoods to form dense regions, where the dense regions are clusters.

• For a core object q and an object p, we say that p is directly density-reachable from q (with respect to ε and MinPts) if p is within the ε -neighbourhood of q. Clearly, an object p is directly density-reachable from another object q if and only if q is a core object and p is in the ε -neighbourhood of q.

• Using the directly density-reachable relation, a core object can “bring” all objects from its ε -neighbourhood into a dense region.

Application of Cluster Analysis

• Data Reduction• Hypothesis generation and Testing• Prediction based on groups• Finding K-nearest neighbours• Outlier detection

References:

• “Pattern Recognition - An Algorithmic Approach”- M. Narasimha Murty· V. Susheela Devi.

• “Applied Data Mining for Business and Industry” – Paolo Giudici, Silvia Figini; 2nd Edition.

• “Principles of Data Mining” – Max Bramer.• “Data Mining – Multimedia, Soft Computing and

Bioinformatics” – Sushmita Mitra, Tinku Acharya.• “Data Clustering Algorithms and Applications” –

Charu C. Aggarwal, Chandan K. Reddy.

References[1] D. Arthur and S. Vassilvitskii.K-means++: The advantages of careful seeding.

In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

[2] G. H. Ball and D. J. Hall. ISODATA, a novel method of data analysis and pattern classification. Technical report, DTIC Document, 1965.

[3] P. Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, J. Kogan, C. Nicholas, and M. Teoulle, Eds., Springer, Berlin Heidelberg, pages 25–71, 2006.

[4] J. C. Bezdek.Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981.

[5] P. S. Bradley and U. M. Fayyad. Refining initial points fork-means clustering. In Proceedings of the Fifteenth International Conference on Machine Learning, volume 66. San Francisco, CA, USA, 1998.

[6] T. Cali´ nski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics—Theory and Methods, 3(1):1–27, 1974.

[7] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.

[8] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.

[9] T.H.Cormen. Introduction to Algorithms. MIT Press, 2001.[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernelk-means: Spectral

clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 551–556. ACM, 2004.

[11] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis, Vol.3,Wiley,New York, 1973.

[12] J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters.Journal of Cybernetics, 3(3):32–57, 1973.

[13] C. Elkan. Using the triangle inequality to acceleratek-means. InProceedings of International Conference on Machine Learning (ICML), pages 147–153, 2003.

[14] D. Fisher. Optimization and simplification of hierarchical clusterings. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD), pages 118–123, 1995.

[15] D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139–172, 1987.

[16] J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54–64, 1969.

[17] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In ACM SIGMOD Record, volume 27, pages 73–84. ACM, 1998.

[18] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th International Conference on Data Engineering, pages 512–521. IEEE, 1999.

[19] J. A. Hartigan and M. A. Wong. Algorithm as 136: Ak-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.

[20] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li. Automated variable weighting ink-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):657–668, 2005.

[21] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283–304, 1998.

[22] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

[23] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review.ACM Computing Surveys (CSUR), 31(3):264–323, 1999.

[24] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.

[25] G. Karypis, E. H. Han, and V. Kumar. CHAMELEON: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75, 1999.

[26] L. Kaufman, P.J. Rousseeuw, et al. Finding Groups in Data: An Introduction to Cluster Analysis, volume 39. Wiley Online Library, 1990.

[27] B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 62(317):86–101, 1967.

[28] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.

[29] K. Krishna and M. N. Murty. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 29(3):433–439, 1999.

[30] R. Krishnapuram and J. M. Keller. The possibilistic C-means algorithm: Insights and recommendations.IEEE Transactions on Fuzzy Systems, 4(3):385–393, 1996.

[31] G. N. Lance and W. T. Williams. A general theory of classificatory sorting strategies II. Clustering systems. The Computer Journal, 10(3):271–277, 1967.

[32] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.

[33] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,volume 1, pages 281–297, Berkeley, CA, USA, 1967.

[34] P. Maji and S. K. Pal. Rough set based generalized fuzzyc-means algorithm and quantitative indices. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 37(6):1529–1540, 2007.

[35] C. D. Manning, P. Raghavan, and H. Schutze.Introduction to Information Retrieval, volume 1. Cambridge University Press, Cambridge, 2008.

[36] L. L. McQuitty. Elementary linkage analysis for isolating orthogonal and oblique types and typal relevancies. Educational and Psychological Measurement17(2):207–229, 1957.

[37] G. W. Milligan. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187–199, 1981.

[38] B. G. Mirkin.Clustering for Data Mining: A Data Recovery Approach, volume 3. CRC Press, Boca Raton, FL, 2005.

[39] R. Mojena. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20(4):359–363, 1977.

[40] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2):026113+, 2003.

[41] C. F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313– 1325, 1995.

[42] D. Pelleg and A. Moore. X-means: Extendingk-means with efficient estimation of the number of clusters. InProceedings of the Seventeenth International Conference on Machine Learning, pages 727–734, San Francisco, CA, USA, 2000.

[43] G. Rudolph. Convergence analysis of canonical genetic algorithms. IEEE Transactions on Neural Networks, 5(1):96–101, 1994.

[44] B. Sch¨olkopf, A. Smola, and K. R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319, 1998.

[45] S. Z. Selim and M. A. Ismail. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1):81–87, 1984.

[46] P. H. A. Sneath and R. R. Sokal. Numerical taxonomy.Nature, 193:855–860, 1962.

[47] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, volume 400, pages 525–526. Boston, MA, USA, 2000.

[48] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001.

[49] J. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963.

[50] D. Wishart. An algorithm for hierarchical classifications.Biometrics, 25(1):165–170, 1969.

[51] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005.

[52] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzzo. Model-based clustering and data transformations for gene expression data.Bioinformatics, 17(10):977–987, 2001.

[53] C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20(1):68–86, 1971

clustering

Documents