objective clustering

Advanced Review

Objective function-basedclusteringLawrence O. Hall∗

Clustering is typically applied for data exploration when there are no or veryfew labeled data available. The goal is to find groups or clusters of like data. Theclusters will be of interest to the person applying the algorithm. An objectivefunction-based clustering algorithm tries to minimize (or maximize) a functionsuch that the clusters that are obtained when the minimum/maximum is reachedare homogeneous. One needs to choose a good set of features and the appropri-ate number of clusters to generate a good partition of the data into maximallyhomogeneous groups. Objective functions for clustering are introduced. Cluster-ing algorithms generated from the given objective functions are shown, with anumber of examples of widely used approaches discussed. C© 2012 Wiley Periodicals,Inc.

How to cite this article:WIREs Data Mining Knowl Discov 2012, 2: 326–339 doi: 10.1002/widm.1059

INTRODUCTION

C onsider the case in which you are given an elec-tronic repository of news articles. You want to

try to determine the future direction of a commoditylike oil, but do not want to sift through the 50,000 ar-ticles by hand. You would like to have them groupedinto categories and then you could browse the ap-propriate ones. You might use the count of wordsappearing in the document as features. Having wordssuch as commodity or oil or wheat appear multipletimes in an article would be good clues as to what itwas concerned with.

Clustering can do the grouping into categoriesfor you. Objective function-based clustering is oneway of accomplishing the grouping. In this type ofclustering algorithm, there is a function (the objectivefunction) which you try to minimize or maximize. Theexamples or objects to be partitioned into clusters aredescribed by a set of s features. To begin, we will thinkof all features as being continuous numeric valuessuch that we can measure distances between them.

A challenge of using objective function-basedclustering lies in the fact that it is an optimizationproblem.1,2 As such, it is sensitive to the initializationthat is provided. This means that you can get different

∗Correspondence to: [email protected]

Department of Computer Science and Engineering, University ofSouth Florida, Tampa, FL, USA

DOI: 10.1002/widm.1059

final partitions of the data depending upon where youstart.

In this article, we will cover a selection of objec-tive function-based clustering algorithms. There aremany such algorithms and we will focus on the morebasic ones and build on them to illustrate more com-plex approaches. The document clustering exampleabove can be well solved by some, but not all al-gorithms described here. Latent Dirichlet allocation3

is a more complex algorithm which can do the doc-ument clustering task well, but is not covered herebecause it requires that more advanced concepts beintroduced in limited space. The algorithms, here,produce clusters of data that are, ideally, homoge-neous or all from the same (unknown) class. We saythe result of clustering the data is a partition into a setof k clusters. The partition may be hard, all examplesbelong to only one cluster; or soft, the examples maybelong to multiple clusters; or probabilistic in whicheach example has a probability of belonging to eachcluster.

Some critical issues when clustering data arehow many clusters for which to search; a reasonableinitialization of the algorithm; how to define distancesbetween examples, computational complexity, or runtime; and how to tell when a partition of the datais good. As with any kind of data analysis one mustbe concerned with noise, for which robust clusteringapproaches exist,4–6 which is not a focus here. Theyinclude reformulations of the objective function toincorporate different distance functions and different

326 Volume 2, Ju ly /August 2012c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Objective function-based clustering

integrations with robust statistics. The choice of thenumber of clusters to search for can be domain driven,but when data driven is inter-related to the questionof determining that a partition is good. Cluster orpartition validity metrics are useful for determiningwhether a partition into k = 2 or k = 3 or k = 4clusters, for example, is best.

The most straightforward idea for evaluation ofa partition, as well as creating one, is to look at howtight the clusters are and how well separated they are.Oversimplifying, we prefer clusters which have smallstandard deviations from their centroid (or mean) andwhose centroids are more distant from each other.

In the proceeding, the k-means algorithm willfirst be introduced in detail as the baseline example ofobjective function-based clustering algorithms. For abroader look at how k-means and related algorithmsfit together in a common mathematical framework,see Ref 7. Then a selected set of other clustering al-gorithms will be discussed in detail and related tok-means. There will be a brief discussion of validitymetrics and distance measures.

k-MEANS CLUSTERING

One of the earliest objective function-based clusteringapproaches was called k-means clustering.8–10 The krefers to the number of clusters that are desired. Theidea behind the objective function for k-means is bothsimple and elegant: Minimize the within cluster scatterand maximize the between cluster scatter. So, you tryto minimize the distances between examples assignedto a cluster and maximize the distances between ex-amples assigned to different clusters.

The equation for the k-means objective functionis

J1 = (U, V) =k∑

i=1

n∑j=1

ui j D(x j , vi ), (1)

where

• xj ∈ X represents one of n feature vectors ofdimension s;

• vi ∈ V is an s dimensional cluster center, rep-resenting the average value of the examplesassigned to a cluster;

• U is the k × n matrix of example assignmentsto cluster centers and uij ∈ U and uij ∈ {0, 1}indicates whether the jth example belongs tothe ith cluster;

• k is the number of clusters;

1. Initialize the k cluster centers V 0,

, ,

T = 1.

2. For all x j do

ui j = 1 if minargi D(x j vi)0 otherwise

3. For i = 1 to k dovi = ∑n

j= 1 ui j × x j

4. if V T − V T − 1 < ε stopelse T = T + 1 and go to 2.

(2)

FIGURE 1 | k-Means clustering algorithm.

• and D(xj, vi) = ‖xj − vi‖2 is the distance. Forexample, the Euclidean distance which is alsoknown as the L2 norm.

In (1), we add up the distances between the ex-amples assigned to a cluster and the correspondingcluster center. The value J1 is to be minimized. Theway to accomplish the minimization is to fix V or Uand calculate the other and then reverse the process.This requires an initialization to which the algorithmis quite sensitive.11–13 The good news is that the algo-rithm is guaranteed to converge.14 The k-means clus-tering algorithm is shown in Figure 1. We have usedbold notation to indicate that x, v are vectors and U,V are matrices. We are going to drop the bold in theproceeding for convenience expecting the reader willrecall they are vectors or matrices.

Now, we have our first objective-based cluster-ing algorithm defined. We can use it to cluster a fourdimensional, three class dataset as an illustrative ex-ample. Here, the Weka data mining tool15 has beenused to cluster and display the Iris data.16 The datasetshown in Figure 2 describes Iris plants and has 150examples with 50 examples in each of three classes.There were four numeric features measured—sepallength, sepal width, petal length, and petal width. Theprojection here is into two dimensions, petal lengthand petal width. You can see it looks like there mightbe two classes, as 2 overlap in Figure 2(a) with oneclearly separate. In Figure 2(b), you see a partition ofthe data into three classes and in Figure 2(c) a dif-ferent partition (from a different initialization, thusillustrating the sensitivity to initialization).

It is important to note that the expert whocreated the Iris dataset recognized the three classesof Iris flowers. However, the features recorded donot necessarily disambiguate the classes with 100%accuracy. This is true for many more complex

Volume 2, Ju ly /August 2012 327c© 2012 John Wi ley & Sons , Inc .

Advanced Review wires.wiley.com/widm

FIGURE 2 | The Iris data (a) with labels (b) clustered by k-means with a Good initialization and (c) clustered by k-means with a Badinitialization.

real-world problems. The point is that, without la-bels, the features may give us a different number ofclasses than the known number. In this case, we mightwant better features (or a different algorithm).

Note, that for the Iris data some claim the fea-tures really only allow for 2 clusters.13,17 Real-worldground truth tells us that there are three clusters orclasses. Which is correct? Perhaps both with the givenfeatures?

FUZZY k-MEANS

If you allow an example xj to partially belong to morethan one cluster with a membership in cluster i ofμi(xj) ∈ [0, 1], this is called a fuzzy membership.18

Using fuzzy memberships allows the creation of thefuzzy k-means (FKM) algorithm in which each ex-ample has some membership in each cluster. Thealgorithm was originally called fuzzy c-means. Likek-means an educated guess of the number of

clusters using domain knowledge is required ofthe user.

The objective function for FKM is Jm in Eq. (3):

Jm (U, V) =k∑

i=1

n∑j=1

w j umi j D(xj , vi ), (3)

where uij is the membership value of the jth example,xj, in the ith cluster; vi is the ith cluster centroid; n isthe number of examples; k is the number of clusters;and m controls the fuzziness of the memberships withvery close to one causing the partition to be nearlycrisp or approximate k-means. Higher values causefuzzier partitions spreading the memberships acrossmore clusters:

D(xj, vi) = ‖xj − vi‖2 is the norm, for example,the Euclidean distance.

wj is the weight of the jth example. For FKM,wj = 1, ∀j. We will use this value, which is not typi-cally shown, later.



1. l = 0

2. Initialize the cluster centers (vi’s) to get V 0.

3. Repeat

• l = l +1,• calculate Ul according to Eq. (4)• calculate V l according to Eq. (5)

4. Until V l −V l−1 < ε

FIGURE 3 | FKM algorithm.

U and V can be calculated as

ui j = D(xj , vi

) 11−m

∑ki=1 D

(xj , vi

) 11−m

, (4)

vi =∑n

j=1 w j (ui j )mxj∑nj=1 w j (ui j )m

. (5)

The clustering algorithm is shown in Figure 3.There is some extra computation when compared tok-means and we must choose the value for m. Thereare many papers that describe approaches to choosingm,19,20 some of which are automatic. A default choiceof m = 2 often works reasonably well as a startingpoint. There is a convergence theorem21 for FKM thatshows that it ends up in local minima or saddle points,but will stop iterating.

This algorithm is very useful if you know youhave examples that truly are a mixture of classes.It also allows for the easy identification of exam-ples that do not fit well into a single cluster andprovides information on how well any example fitsa cluster. An interesting case of its use is with thepreviously mentioned Iris dataset. FKM in many ex-periments (over 5000 random initializations) alwaysconverged to the same final partition which had 16errors when searching for three clusters as shown inFigure 2. k-Means converged to one of three parti-tions with the most likely one the same as FKM’s.13

The other two were local extrema that resulted in sig-nificantly higher values of J1, one of which is shown inFigure 2. Of course, both algorithms are sensitive toinitialization and for other datasets FKM will not al-ways converge to the same solution.

The FKM algorithm with the Euclidean dis-tance function has a bias toward hyperspherical clus-ters that are equal sized. If you change the distancefunction22,23 hyperellipsoidal clusters can be found.

EXPECTATION MAXIMIZATIONCLUSTERING

Consider the case in which you want to assign prob-abilities to whether an example is in one cluster oranother. We might have the case that x5 belongs tocluster A with a probability of 0.9, cluster B with aprobability of 0, and cluster C with a probability of0.1 for a three cluster or class problem. Note, withoutlabels the cluster designations are arbitrary. The clus-tering algorithm to use is based on the idea of expec-tation maximization24 or finding the maximum like-lihood solution for estimating the model parameters.We are going to give a simplified clustering focusedversion of the algorithm here and more general detailscan be found in Refs 15 and 24. The algorithm doescome with a convergence proof25 which guaranteesthat we can find a solution (although not necessarilythe optimal solution) under conditions that usuallyare not violated by real data.

We want to find the set of parameters that max-imize the log likelihood of generating our data X.

Now � will consist of our probability functionfor examples belonging to classes which, in the sim-plest case, requires us to find the centroid of clus-ters and the standard deviation of them. More gen-erally, the necessary parameters to maximize Eq. (6)are found:

� = argmax�

N∑i=1

log(P(xi |�)). (6)

Let pj be the probability of class j. Let zij = 1, ifexample i belongs to class j and 0, otherwise

Now our objective function will be

L(X,�) =n∑

i=1

k∑j=1

zi j log(pj ∗ P(xi | j)). (7)

Now, how do we calculate P(xi|j)? A simple for-mulation is given in Eq. (8) using a Gaussian-baseddistance:

P(xi | j) = f (xi ; μ j , σ j ) = 1√2πσ j

e(xi −μ j )2

2σ2j . (8)

This works for roughly balanced classes witha spherical shape in feature space. A more generaldescription which works better for data that does notfit the constraints of the previous sentence is given inRef 26. Our objective function depends on μ and σ .We observe that

pj =n∑

i=1

P(xi | j)/n. (9)



1. Initialize the μ js and σ js by running one iteration of k-means and taking the k-cluster centersand their standard deviations. Initialize ε and L0 = − ε

2. Repeat

3. E-step: Calculate P(xi| j) as in Eq. (8)

, where L is calculated from (7)

4. M-step: n j = ∑ni=1 P(xi| j), 1 ≤ j ≤ k

p j = n j/ n

μ j = ∑ni=1 P(xi| j)∗xi

n j, 1 ≤ j ≤ k

σ j = ∑ni=1 P(xi| j)(xi− μ j)2

∑ni=1 P(xi| j)

5. Until |Lt − Lt− 1| < ε

FIGURE 4 | The EM clustering algorithm.

FIGURE 5 | The Iris data clustered by the EM algorithm.

The EM algorithm is shown in Figure 4. Wehave applied the EM algorithm as implemented inWeka to the Iris data. A projection of the partitionobtained when searching for three classes is shownin Figure 5. The final partition differs, albeit slightlyfrom that found in Figure 2. However, it is interestingthat on even this simple dataset there are disagree-ments which, unsurprisingly, involve the two over-lapping classes.

POSSIBILISTIC k-MEANS

The clustering algorithm discussed in this section wasdesigned to be able to cluster data like that shownin Figure 6. Visually, there is a bar with a sphericalobject at each end. A person would most likely saythere are three clusters, the two spheres and the linearbar. The problem for the clustering algorithms dis-

FIGURE 6 | Three cluster problem that is difficult.

cussed thus far is that they must find very differentshapes (spherical and linear) and are not designed todo so. The possibilistic k-means (PKM) algorithm canfind nonspherical clusters together with ellipsoidal orspherical clusters.27 The algorithm, originally namedpossibilistic c-means, is also significantly more noisetolerant than FKM.28,29

The approach is more computationally complexand requires some attention to parameter setting.28

The innovation is to view the examples as being pos-sible members of clusters. Possibility theory is uti-lized to create the objective function.30 So, an exam-ple might have a possibility of 1 (potentially completebelonging) to more than one cluster. The membershipvalue can also be viewed as the compatibility of theassignment of an example to a cluster.

The objective function for PKM looks like thatfor FKM with an extra term and some differentconstraints on the membership values as shown inEq. (10). The second term forces the uij to be as largeas possible to avoid the trivial solution:

Jm (U, V) =k∑

i=1

n∑j=1

umi j D(xj , vi )

+k∑

i=1

ηi

n∑u=1

(1 − ui j )m, (10)

where, ηi are suitable positive numbers, uij: is themembership value of the jth example, xj, in theith cluster such that uij ∈ [0, 1], 0 <

∑nj=1 ui j ≤ n

and max j ui j > 0 ∀i . Critically, the memberships are



1. Run FKM for one full iteration. Choose m and k.

2. l = 0

3. Initialize the cluster centers (vis) from FKM to get V 0.

4. Estimate ηi using Eq. (12).

5. Repeat

• l = l + 1• calculate Ul according to Eq. (11)• calculate V l according to Eq. (5)

6. Until V l − V l− 1 < ε

Now, if you want to know the shapes of the possibility distributions, first reestimate ηi with Eq.(13). This is optional.

1. Repeat

• l = l + 1

• calculate Ul according to Eq. (11)

• calculate V l according to Eq. (5)

2. Until V l − V l− 1 < ε

FIGURE 7 | PKM algorithm.

now not constrained to add to 1 across classes withthe only requirement being that every example havenonzero membership (or possibility) for at least 1cluster. vj is the jth cluster centroid; n is the num-ber of examples; k is the number of clusters; m affectsthe possibilistic membership values with closer to 1making the results closer to a hard partition (whichhas memberships of only 0 or 1); and D(xj, vi) =‖xj − vi‖2 is the norm, such as the Euclidean distance.

Now, the calculation for the cluster centers isstill done by Eq. (5) with wj = 1 ∀j. The calculationfor the possibilsitic memberships is given by Eq. (11):

ui j = 1

1 +(

D(xj ,vi )ηi

) 1m−1

. (11)

The value of ηi has the effect of determining thedistance at which an examples’ membership becomes0.5. It should be chosen based on the bandwidth ofthe desired membership distribution for a cluster. Inpractice, a value proportional to the average fuzzyintracluster distance can work as in Eq. (12). Theauthors of the approach note that R = 1 is a typicalchoice:

ηi = R

∑nj=1 (ui j )mD(xj , vi )∑n

j=1 (ui j )m. (12)

The PKM algorithm is quite sensitive to thechoice of m.28,29 You can fix the ηis or calculate them

each iteration. When fixed, you have guaranteed con-vergence. The algorithm that will allow you to findclusters such as in Figure 6 benefits from the use ofEq. (13) to generate η after convergence is achievedusing Eq. (12). Consider the �i to contain all thememberships for the ith cluster. Then (�i)α containsall membership values above α and in terms of possi-bility theory is called an α cut. So, for example withan α ≥ 0.5 you get a good set of members that havea pretty strong affinity for the cluster:

ηi = R

∑xj ∈(�i )α D(xj , vi )

|(�i )α| . (13)

The algorithm for PKM is shown in Figure 7.A nice advantage of this algorithm is its perfor-

mance when your dataset is noisy. As well as beingable to nicely extract different shapes, although thistypically requires a distance function that is a bit morecomplex than the Euclidean distance. Two other dis-tance functions that can be used are shown in Eqs (14)and (16). The scaled Mahalanobis distance31 allowsfor nonspherical shapes. If you have spherical shellspotentially in your data, then the distance measure32

shown in Eq. (16) can be effective. However, the in-troduction of the radius into the distance measurerequires a new set of updated equations for findingthe cluster centers, which can be found in Ref 27:

Di j = |Fi |1/n(xj − vi )T F −1i (xj − vi ), (14)



1. Given R = [ ri j], initialize 2 ≤ k < n, and initialize U0 ∈ Mk|ui j ∈ {0, 1} with the constraintsof Eq. (17) holding and T = 1.

2. Calculate the k-mean vectors vi to create V T as

(19)

(20)

vi = ( ui1, ui2, . . . , uin)T /n

∑j= 1

ui j

3. Update UT using Eq. (4) where the distance is

D(x j, vi) = ( Rvi) j − (vTi Rvi)/ 2

4. If UT − UT − 1 < ε stopelse T = T + 1 and go to 2.

FIGURE 8 | Relational k-means clustering algorithm.

where Fi is the fuzzy covariance matrix of cluster vi

and can be updated with Eq. (15):

Fi =∑n

j=1 umi j (xj − vi )(xj − vi )T∑n

j=1 umi j

, (15)

Di j = (‖xj − vi )‖1/2 − ri )2, (16)

where ri is the radius of cluster i. With new calcu-lations to find the cluster centers, this results in analgorithm called possibilistic k-shells. It is quite effec-tive if you have a cluster within a cluster (like a big Ocontaining a small o).

There are a number of alternative formulationsof possibilistic clustering such as that in Ref 33 wherethe authors argue their approach is less sensitive toparameter setting. In Ref 34, a mutual cluster repul-sion term is introduced to solve the technical problemof coincident cluster centers providing the best mini-mization and it introduces some other potentially use-ful properties.

The Mahalanobis distance measure can be usedin k-means (with just the covariance matrix), FKMand EM as well. In FKM, the use of Eq. (14) as thedistance measure gives the so called GK clusteringalgorithm,23 which is known for its ability to capturehyperellipsoidal clusters.

RELATIONAL CLUSTERING

How do we cluster data for which the attributes arenot all numeric? Relational clustering is one approachthat can be used stand alone or in an ensemble of clus-tering algorithms.35,36 Cluster ensembles can provideother options for dealing with mixed attribute types.How about if all we know is how examples relate toone another in terms of how similar they are? Rela-

tional clustering works when ∀xi ∈ X, ρ(xi, xj) = rij ∈[0, 1]. We can think of ρ as a binary fuzzy relation. R= [rij] is a fuzzy relation (or a typical relation matrix ifrij ∈ {0, 1}). Relational clustering algorithms are typ-ically associated with graphs because R can always beviewed as the adjacency matrix of a weighted digraphon the n examples (nodes) in X.

Graph clustering is not typically addressed withan objective function-based clustering approach.37

However, there are relational versions of k-means andFKM38 which are applicable to graphs and will be dis-cussed. First, our U matrix of example membershipsin clusters can be put into a context that allows forboth the k-means and FKM to be described with oneobjective function:

Mf k = {U ∈ Rkn|0 ≤ ui j ≤ 1,

k∑i=1

ui j = 1, for 1 ≤ j ≤ n, (17)

n∑j=1

ui j > 0, for 1 ≤ i ≤ k}

In Eq. (17) the memberships can be fuzzy or in{0, 1}, called crisp. The same constraints as for FKMand k-means hold. Our objective function is

JRm(U) =k∑

i=1

⎛⎝ n∑

j=1

n∑l=1

umi j u

mil r jl

2∑n

t=1 umit

⎞⎠ , (18)

where m ≥ 1. If we have numeric data, we can createril = δ2

il = ‖xi − xl‖2 for some distance function. Thesquare just ensures a positive number. The algorithmfor relational k-means is shown in Figure 8.

This algorithm has a convergence proof basedon the nonrelational case. It allows us to do graph



1. Given R = [ ri j], initialize 2 ≤ k < n, and initialize U0 ∈ Mk|ui j ∈ [0, 1] with the constraintsof (17) holding and T = 1. Choose m > 1.

2. Calculate the k mean vectors vi to create V T as

vi = (umi1, um

i2, . . . , umin)T /

n

∑j= 1

umi j

3. Update UT using (2) where the distance is

D(x j, vi) = (Rvi) j − (vTi Rvi)/ 2

4. If UT − UT − 1 < ε stopelse T = T + 1 and go to 2.

(21)

(22)

FIGURE 9 | Relational fuzzy k-means clustering algorithm.

clustering using an objective function-based cluster-ing algorithm. The fuzzy version (with the member-ships relaxed to be in [0, 1]) is shown in Figure 9.It also converges38 and provides a second optionfor relational clustering with objective function-basedalgorithms.

Another approach to fuzzy relational clusteringis fuzzy medoid clustering.39 A set of k fuzzy represen-tatives from the data (medoids) are found to minimizethe dissimilarity in the created clusters. The approachtypically requires less computational time than thatdescribed here.

ADJUSTING THE PERFORMANCE OFOBJECTIVE FUNCTION-BASEDALGORITHMS

The algorithms discussed thus far are very good clus-tering algorithms. However, for the most part, theyhave important limitations. With the exception ofPKM, noise will have a strong negative effect on them.Unless otherwise noted there is a built-in bias for hy-perspherical clusters that are equal sized. That is aproblem if you have, for example, a cluster of interestwhich is rare and, hence, small.

To get different cluster shapes, the distance mea-sure can be changed. We have seen an example ofthe Mahalanobis distance and discussed the easy tocompute Euclidean distance. There are lots of otherchoices, such as given in Ref 40. Any that involvethe use of the covariance matrix of a cluster, or somevariation of it, typically will not require changes inthe clustering algorithm.

For example, we can change our probability cal-culation in EM to be as follows:

P(xi | j) = f

⎛⎝xi ; μ j ,

∑j

⎞⎠

= exp{− 12 (xi − μ j )T ∑−1

j (xi − μ j )}(2π )s/2| ∑1/2

j |, (23)

where∑

j is the covariance matrix for the jth cluster,s is the number of features, and μj is the centroid(average) of cluster j. This gives us ellipsoidal clusters.Now, if we want to have different shapes for clusterswe can look at parameterizations of the covariancematrix. An eigenvalue decomposition is∑

j

= λ j Dj Aj DTj , (24)

where Dj is the orthogonal matrix of eigenvectors, Aj

is a diagonal matrix whose elements are proportionalto the eigenvalues of

∑j , and λj is a scalar value.41

This formulation can be used in k-means and FKMand really PKM with a fuzzy covariance matrix forthe latter two. It is a very flexible formulation.

The orientation of the principal components of∑j is determined by Dj, the shape of the density con-

tours is determined by Aj. Then λj specifies the volumeof the corresponding ellipsoid, which is proportionalto λs

j |Aj |. The orientation, volume and shape of thedistributions can be estimated from the data and beallowed to vary for each cluster (although you canconstrain them to be the same for all).

In Ref 22, a fuzzified version of Eq. (23) isgiven and modifications are made to FKM to resultin the so-called Gath–Geva clustering algorithm. The



1. l = 0

2. Initialize U0, perhaps with 1 iteration of FKM or k-means. Initialize m.

3. Repeat

• l = l + 1• For FKM: calculate Ul according to Eq. (4) using Eq. (29) for the distance. Note

that the distances depend on the previous membership values. You might add a step tocalculate them all to improve computation time, if you like.

• For k-means: calculate Ul according to Eq. (2) using Eq. (29) for the distance withm = 1.

4. Until Ul − Ul− 1 < ε

FIGURE 10 | Kernel-based k-means/FKM algorithm.

algorithm also has a built-in method to discover thenumber of clusters in the data, which will be ad-dressed in the proceeding.

Kernel-Based ClusteringAnother interesting way to change distance functionsis to use the ‘kernel trick’42 associated with supportvector machines which are trained with labeled data.A very simplified explanation of the idea, which iswell explained by Burges,43 is the following. Con-sider projecting the data into a different space, whereit may be more simply separable. From a clusteringstandpoint, we might think of a three cluster problemwhere the clusters are touching in some dimensions.When projected they may nicely group for any clus-tering algorithm to find them.44

Now consider �: Rs → H to be a nonlin-ear mapping function from the original input spaceto a high-dimensional feature space H. By applyingthe nonlinear mapping function �, the dot productxi · xj in our original space is mapped to �(xi) · �(xj)in feature space. The key notion in kernel-based learn-ing is that the mapping function � need not be explic-itly specified. The kernel function K(xi, xj) in the orig-inal space Rs can be used to calculate the dot product�(xi) · �(xj).

First, we introduce 3 (of many) potential kernelswhich satisfy Mercer’s condition43:

K(xi , xj ) = (xi · xj + b)d, (25)

where d is the degree of the polynomial and b is someconstant offset;

K(xi , xj ) = e− −‖xi −xj ‖2

2σ2 , (26)

where σ 2 is a variance parameter;

K(xi , xj ) = tanh(α(xi · xj) + β), (27)

where α and β are constants that shape the sigmoidfunction.

Practically, the kernel function K(xi, xj) can beintegrated into the distance function of a clusteringalgorithm which changes the update equations.45,46

The most general approach is to construct the clustercenter prototypes in kernel space46 because it allowsfor more kernel functions to be used. Here, we willtake a look at hard and FKM approaches to objec-tive function-based clustering with kernels. Now ourdistance becomes D(xj, vi) = ‖�(xj) − vi‖2. So ourobjective function reads as

JK,m =k∑

i=1

n∑j=1

umi j‖�(xj ) − vi‖2. (28)

For k-means, we just need m = 1 with uij ∈ {0, 1}as usual. We see our objective function has �(xj) in itso our update equations and distance equation mustchange. Before discussing the modified algorithm, weintroduce the distance function.

Our distance will be as shown below Eq. (29) forthe fuzzy case. It is simple to modify for k-means. It isimportant to note that the cluster centers themselvesdo not show up in the distance equation:

D(xj , vi ) = K(xj , xj ) − 2∑n

l=1 umil K(xl, xj )∑nl=1 um

il

+∑n

q=1

∑nl=1 um

iqumil K(xq, xl )(∑n

q=1 umiq

)2 . (29)

The algorithm for k-means and FKM is thenshown in Figure 10.

Choosing the Right Number of Clusters:Cluster ValidityThere are many functions47–49 that can be applied toevaluate the partitions produced by a clustering algo-rithm using different numbers of cluster centers. Thesilhouette criterion50 and fuzzy silhouette criterion51

measure the similarity of objects to others in their



1. Set initial number of clusters I, typically 2. Set maximum number of clusters MC, MC << nunless something is unusual. Initialize T = I, k = 0. Choose the validity metric to be usedand parameters for it.

2. While (T ≤ MC and k = = 0) do

3. Cluster data into T clusters

4. k = checkvalidity /* Returns the number of clusters if applicable or 0. */

5. T = T + 1

Return (IF (k = = 0) return MC ELSE return k)

FIGURE 11 | Finding the right number of clusters with a partition validity metric. Any validity metric that applies to a particular objectivefunction-based clustering algorithm can be applied.

own cluster against the nearest object from another.A nice comparison for k-means and some hierarchi-cal approaches is found in Ref 52. Perhaps the sim-plest (but far from only approach) is that shown inFigure 11. Here you start with 2 clusters and run theclustering algorithm to completion (or enough itera-tions to have a reasonable partition), then increase to3, 4, . . ., MC clusters and repeat the process. A valid-ity metric is applied to each partition and can be usedto pick out the ‘right’ one according to it. This willtypically be determined at the point when the declin-ing or increasing value of the metric changes direction(increases/decreases). Note that you cannot use theobjective function because it will prefer many clusters(sometimes as many as there are examples).

A very good, simple cluster validity metricfor fuzzy partitions of data is the Xie–Beni index[Eq. (30)].53,54 It uses the value m with uij and youcan set m to the same value as used in clustering orsimply a default (say m = 2). The search is for thesmallest S.

S =∑k

i=1

∑nj=1 um

i j D(xj , vi )

n · mini j {D(vi , v j )} . (30)

The generalized Dunn index47 has proved to bea good validity metric for k-means partitions. Onegood version of it is shown in Eq. (31) with a nicediscussion given in Ref 47. It looks at between clusterscatter versus within cluster scatter measured by thebiggest intercluster distance.

Sgd = min1≤s≤k

{min

1≤t≤k,t =s

×{ ∑

z∈vs ,w∈vtD(z, w)

|vs | · |vt| max1≤l≤k(maxx,y∈vl D(x, y))

}}. (31)

A relatively new approach to determining thenumber of clusters involves the user of the clusteringalgorithm. These approaches are called visual assess-ment techniques.55,56 The examples or objects are or-dered to reflect dissimilarities (rows and columns inthe dissimilarity matrix are moved). The newly or-dered matrix of pairwise example dissimilarities is dis-played as an intensity image. Clusters are indicated bydark blocks of pixels along the diagonal. The viewercan decide on how many clusters one sees.

Scaling Clustering Algorithms to LargeData-setsThe clustering algorithms, we have discussed, take asignificant amount of time when the number of ex-amples or features or both are large. For example,k-means run time requires checking the distance ofn examples of dimension s against k cluster centerswith an update of the (n × k) U matrix during eachiteration. The run time is proportional to (nsk + nk)tfor t iterations. Using big O notation,57 the averagerun time is O(nskt). This is linear in n it is true, butthe distances are computationally costly to compute.As n gets very large, we would like to reduce the timerequired to accomplish clustering.

Clustering can be sped up with distributedapproaches58–60 where the algorithm is generalizedto work on multiple processors. An early approach tospeeding up k-means clustering is given in Ref 61.They provide a four-step process shown below toallow just one pass through the data assuming amaximum-sized piece of memory can be used to storedata as shown in Figure 12. An advantage of thisapproach is that you are only loading the data fromdisk one time, meaning it can be much larger than theavailable memory. The clustering is done in step 2.



1. Obtain next available (possibly random) sample from the dataset and put it in free memoryspace. Initially, you may fill it all.

2. Update current model using data in memory.

3. Based on the updated model, classify the loaded examples elements as(a) needs to be retained in memory(b) can be discarded with updates to the sufficient statistics(c) can be reduced via compression and summarized as sufficient statistics

4. Determine if stopping criteria are satisfied. If so, terminate; else go to 1.

FIGURE 12 | An approach to speeding up k-means (applied in step 2) with one pass through the data.

Now to speed up FKM, the single-pass algo-rithm can be used.62,63 It makes use of the weightsshown in Eq. (3). The approach is pretty simple. Breakthe data into c chunks.

(1) Cluster the first chunk.

(2) Create weights for the cluster centers basedon the amount of membership assigned tothem with Eq. (32). Here, nd is the numberof examples being clustered for a chunk ofdata.

(3) Bring in the next chunk of data and the kweighted cluster centers from the previousstep and apply FKM.

(4) Go to step 2 until all chunks are processed.

So, if c = 10, in the first step you process 10%of the n examples and nd = 0.1n. In the next 9 steps,there are nd = 0.1n + k examples to cluster:

w j =nd∑

l=1

ujlwl, 1 ≤ j ≤ k. (32)

SUMMARY AND DISCUSSION

Objective function-based clustering describes an ap-proach to grouping unlabeled data in which there is aglobal function to be minimized or maximized to findthe best data partition The approach is sensitive toinitialization and a brief example of this was given. Itis necessary to specify the number of clusters for theseapproaches, although algorithms can easily be built22

to incorporate methods to determine the number ofclusters.

Four major approaches to objective function-based clustering have been covered—k-means, FKM,PKM, and expectation maximization. Relational clus-tering can be done with any of the major clustering

algorithms by using a matrix of distances betweenexamples based on some relation. We have brieflydiscussed the importance of the distance measure indetermining the shapes of the clusters that can befound.

A kernel-based approach to objective function-based clustering has been introduced. It has thepromise, with the choice of the right kernel function,of allowing most any shape cluster to be found.

A section on how to determine the right numberof clusters (cluster validity) was included. It shows justtwo of many validity measures; however, they bothhave performed well.

With the exception of PKM the other ap-proaches discussed here are sensitive to noise. Mostclustering problems are not highly noisy. If yours isand you want to use objective function-based tech-niques, take a look at Refs 4–6 that contain modifiedalgorithms that are essential for this problem.

For a clustering problem, one needs to thinkabout the data and choose a type of algorithm. The ex-pected shape of clusters, number of features, amountof noise, whether examples of mixed classes exist,and amount of data are among the critical consider-ations. You can choose some expected values for, k,the number of clusters or use a validity function totell you the best number. You might want to try mul-tiple initializations and take the lowest (highest) valuefrom the objective function which indicates the bestpartition. There are a number of publicly availableclustering algorithms that can be tried. In particular,the freely available Weka15 data mining tool has sev-eral, including k-means and EM. A couple of fuzzyclustering algorithms are available.64,65

There are a lot of approaches that are not dis-cussed here. This includes time series clustering.66–68

Some other notable ones are in Refs 69–74. They allcontain some advances that might be helpful for yourproblem. Happy clustering.



ACKNOWLEDGMENTS

This work was partially supported by grant 1U01CA143062- 01, Radiomics of NSCLC fromthe National Institutes of Health.

REFERENCES

1. Jain A, Dubes R. Algorithms for Clustering Data. Up-per Saddle River, NJ: Prentice-Hall; 1998.

2. Jain AK, Murty MN, Flynn PJ. Data clustering: a re-view. ACM Comput Surv 1999, 31:264–323.

3. Blei DM, Ng AY, Jordan MI. Latent Dirichlet alloca-tion. J Mach Learn Res 2003, 3:993–1022.

4. Dave RN, Krishnapuram R. Robust clustering meth-ods: a unified view. IEEE Trans Fuzzy Syst 1997,5:270–293.

5. Kim J, Krishnapuram R, Dave R. Application of theleast trimmed squares technique to prototype-basedclustering. Pattern Recognit Lett 1996, 17:633–641.

6. Wu K-L, Yang M-S. Alternative c-means clustering al-gorithms. Pattern Recognit 2002, 35:2267–2278.

7. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clusteringwith Bregman divergences. J Mach Learn Res 2005,6:1705–1749.

8. MacQueen J. Some methods for classification and anal-ysis of multivariate observations. In: Proceedings of theFifth Berkeley Symposium on Mathematical Statisticsand Probability. Los Angeles, CA: University of Cali-fornia Press; 1967, 281–297.

9. Jain AK. Data clustering: 50 years beyond k-means.Pattern Recognit Lett 2010, 31:651–666. Award win-ning papers from the 19th International Conference onPattern Recognition (ICPR).

10. Steinbach M, Karypis G, Kumar V. A comparison ofdocument clustering techniques. In: The Sixth ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining; Boston, MA; August 20–23,2000.

11. Redmond SJ, Heneghan C. A method for initialising thek-means clustering algorithm using kd-trees. PatternRecognit Lett 2007, 28:965–973.

12. He J, Lan M, Tan C-L, Sung S-Y, Low H-B. Ini-tialization of cluster refinement algorithms: a reviewand comparative study. In: 2004 IEEE InternationalJoint Conference on Neural Networks; 2004, 1–4:(xlvii+3302).

13. Hall LO, Ozyurt IB, Bezdek JC. Clustering with agenetically optimized approach. IEEE Trans EvolutComput 1999, 3:103–112.

14. Selim SZ, Ismail MA. K-means-type algorithms: A gen-eralized convergence theorem and characterization oflocal optimality. IEEE Trans Pattern Anal Mach Intell1984, PAMI-6(1):81–87.

15. Witten IH, Frank E. Data Mining: Practical MachineLearning Tools and Techniques. 2nd ed. San Francisco:Morgan Kaufmann; 2005.

16. Bezdek JC, Keller JM, Krishnapuram R, Kuncheva LI,Pal NR. Will the real iris data please stand up? IEEETrans Fuzzy Syst 1999, 7:368–369.

17. Kothari R, Pitts D. On finding the number of clusters.Pattern Recognit Lett 1999, 20:405–416.

18. Kandel A. Fuzzy Mathematical Techniques With Ap-plications. Boston, MA: Addison-Wesley; 1986.

19. Wu K-L. Analysis of parameter selections forfuzzy c-means. Pattern Recognit 2012, 45:407–415.http://dx.doi.org/10.1016/j.patcog.2011.07.01

20. Yu J, Cheng Q, Huang H. Analysis of the weighting ex-ponent in the FCM. Syst Man Cybern, Part B: Cybern2004, 34:634–639.

21. Bezdek J, Hathaway R, Sobin M, Tucker W. Conver-gence theory for fuzzy c-means: counterexamples andrepairs. IEEE Trans Syst Man Cybern 1987, 17:873–877.

22. Gath I, Geva AB. Unsupervised optimal fuzzy clus-tering. IEEE Trans Pattern Anal Mach Intell 1989,11:773–780.

23. Gustafson DE, Kessel WC. Fuzzy clustering with afuzzy covariance matrix. In: Proc IEEE CDC 1979,761–766.

24. Dempster AP, Laird NM, Rubin DB. Maximum likeli-hood from incomplete data via the em algorithm. J RStat Soc Ser B 1997, 39:1–38.

25. Wu CFJ. On the convergence properties of the EMalgorithm. Ann Stat 1983, 11:95–103.

26. Fraley C, Raferty AE. How many clusters? Which clus-tering method? Answers via model-based cluster anal-ysis. Comput J 1998, 41:579–588.

27. Krishnapuram R, Keller JM. A possibilistic approachto clustering. IEEE Trans Fuzzy Syst 1993, 1:98–110.

28. Krishnapuram R, Keller JM. The possibilistic c-meansalgorithm: insights and recommendations. IEEE TransFuzzy Syst 1996, 4:385–393.

29. Barni M, Cappellini V, Mecocci A. Comments on apossibilistic approach to clustering. IEEE Trans FuzzySyst 1996, 4:393–396.

30. Dubois D, Prade H. Possibility theory, probability the-ory and multiple-valued logics: a clarification. AnnMath Artif Intell 2001, 32:35–66.



31. Sung K-K, Poggio T. Example-based learning for view-based human face detection. IEEE Trans Pattern AnalMach Intell 1998, 20:39–51.

32. Krishnapuram R, Nasraoui O, Frigui H. The fuzzyc spherical shells algorithm: a new approach. IEEETrans Neural Netw 1992, 3:663–671.

33. Yang M-S, Wu K-L. Unsupervised possibilistic cluster-ing. Pattern Recognit 2006, 39:5–21.

34. Timm H, Borgelt C, Doring C, Kruse R. An extensionto possibilistic fuzzy cluster analysis. Fuzzy Sets Syst2004, 147:3–16.

35. Ghosh J, Acharya A. Cluster ensembles. WIREs DataMin Knowl Discov 2001, 1:305–315.

36. Strehl A, Ghosh J, Cardie C. Cluster ensembles—aknowledge reuse framework for combining multiplepartitions. J Mach Learn Res 2002, 3:583–617.

37. Schaeffer SE. Graph clustering. Comput Sci Rev 2007,1:27–64.

38. Hathaway RJ, Davenport JW, Bezdek JC. Relationalduals of the c-means clustering algorithms. PatternRecognit 1989, 22:205–212.

39. Krishnapuram R, Joshi A, Nasraoui O, Yi L. Low-complexity fuzzy relational clustering algorithms forweb mining. IEEE Trans Fuzzy Syst 2001, 9:595–607.

40. Aggarwal C, Hinneburg A, Keim D. On the surprisingbehavior of distance metrics in high dimensional space.Lecture Notes in Computer Science. Springer, 2001,420–434.

41. Banfield JD, Raftery AE. Model-based Gaussian andnon-Gaussian clustering. Biometrics 1993, 49:803–821.

42. Hofmann T, Scholkopf B, Smola AJ. Kernel methodsin machine learning. Ann Stat 2008, 36:1171–1220.

43. Burges CJC. A tutorial on support vector machines forpattern recognition. Data Min Knowl Discov 1998,2:121–167.

44. Kim D-W, Lee KY, Lee DH, Lee KH. Evaluation of theperformance of clustering algorithms in kernel-inducedfeature space. Pattern Recognit 2005, 38:607–611.

45. Heo G, Gader P. An extension of global fuzzy c-meansusing kernel methods. In: 2010 IEEE InternationalConference on Fuzzy Systems (FUZZ); 2010, 1–6.

46. Chen L, Chen CLP, Lu M. A multiple-kernel fuzzy c-means algorithm for image segmentation. IEEE TransSyst Man Cybern, Part B: Cybern 2011, 99:1–12.

47. Bezdek JC, Pal NR. Some new indexes of cluster va-lidity. IEEE Trans Syst Man Cybern, Part B: Cybern1998, 28:301–315.

48. Wang J-S, Chiang J-C. A cluster validity measure withoutlier detection for support vector clustering. IEEETrans Syst Man Cybern, Part B: Cybern 2008, 38:78–89.

49. Pal NR, Bezdek JC. On cluster validity for the fuzzyc-means model. Fuzzy Systems, IEEE Transactions on1995, 3(3):370–379.

50. Kaufman L, Rousseeuw P. Finding Groups in Data.New York: John Wiley & Sons; 1990.

51. Campello RJGB, Hruschka ER. A fuzzy extension ofthe silhouette width criterion for cluster analysis. FuzzySets Syst 2006, 157:2858–2875.

52. Vendramin L, Campello RJGB, Hruschka ER. Rela-tive clustering validity criteria: a comparative overview.Stat Anal Data Min 2010, 3:209–235.

53. Xie XL, Beni G. A validity measure for fuzzy clustering.IEEE Trans Pattern Anal Mach Intell 1991, 13:841–847.

54. Pal NR, Bezdek JC. Correction to “on cluster validityfor the fuzzy c-means model” [correspondence]. IEEETrans Fuzzy Syst 1997, 5:152–153.

55. Bezdek J, Hathaway R. Vat: a tool for visual assessmentof (cluster) tendency. In: Proceedings of InternationalJoint Conference on Neural Networks 2002, 2225–2230.

56. Bezdek J, Hathaway R, Huband J. Visual assessmentof clustering tendency for rectangular dissimilarity ma-trices. IEEE Trans Fuzzy Syst 2007, 15:890–903.

57. Sedgewick R, Flajolet P. An Introduction to the Analy-sis of Algorithms. Boston, MA: Addison-Wesley; 1995.

58. Kargupta H, Huang W, Sivakumar K, Johnson E. Dis-tributed clustering using collective principal compo-nent analysis. Knowl Inf Syst 2001, 3:422–448.

59. Kriegel H-P, Krieger P, Pryakhin A, Schubert M. Effec-tive and efficient distributed model-based clustering.IEEE Int Conf Data Min 2005, 0:258–265.

60. Olman V, Mao F, Wu H, Xu Y. Parallel clustering algo-rithm for large data sets with applications in bioinfor-matics. IEEE/ACM Trans Comput Biol Bioinf 2009,6:344–352.

61. Bradley PS, Fayyad U, Reina C. Scaling clustering algo-rithms to large databases. In: Proceedings of the FourthInternational Conference on Knowledge Discovery andData Mining 1998, 9–15.

62. Hore P, Hall LO, Goldgof DB. Single pass fuzzy cmeans. In: IEEE International Fuzzy Systems Confer-ence, FUZZ IEEE 2007. IEEE; 2007, 1–7.

63. Hore P, Hall L, Goldgof D, Gu Y, Maudsley A,Darkazanli A. A scalable framework for segmentingmagnetic resonance images. J Sig Process Syst 2009,54:183–203.

64. Eschrich S, Ke J, Hall LO, Goldgof DB. Fast accuratefuzzy clustering through data reduction. IEEE TransFuzzy Syst 2003, 11:262–270.

65. Hore P, Hall LO, Goldgof DB, Gu Y. Scalable clus-tering code. Available at: http://www.csee.usf.edu/hall/scalable. (Accessed April 26, 2012).

66. D’Urso P. Fuzzy clustering for data time arrays withinlier and outlier time trajectories. IEEE Trans FuzzySyst 2005, 13:583–604.

67. Coppi R, D’Urso P. Fuzzy unsupervised classificationof multivariate time trajectories with the Shannon



entropy regularization. Comput Stat Data Anal 2006,50:1452–1477.

68. Liao TW. Clustering of time series dataaıa survey. Pat-tern Recognit 2005, 38:1857–1874.

69. Zhang T, Ramakrishnan R, Livny M. Birch: an efficientdata clustering method for very large databases. In:Proceedings of the 1996 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’96.New York: ACM; 1996, 103–114.

70. Guha S, Rastogi R, Shim K. Cure: an efficient clusteringalgorithm for large databases. In: Proceedings of ACMSIGMOD International Conference on Managementof Data; 1998, 73–74.

71. Aggarwal CC, Han J, Wang J, Yu PS. A frameworkfor clustering evolving data streams. In: Proceedings

of the International Conference on Very Large DataBases; 2003.

72. Gupta C, Grossman R. Genic: a single pass generalizedincremental algorithm for clustering. In: Proceedingsof the Fourth SIAM International Conference on DataMining (SDM 04), 2004, 22–24. In: den Bussche JV,Vianu V, eds. Database Theory— ICDT 2001, volume1973 of Lecture Notes in Computer Science, Vol. 1973.Berlin/Heidelberg: Springer; 2001, 420–434.

73. Dhillon IS, Mallela S, Kumar R. A divisive informationtheoretic feature clustering algorithm for text classifi-cation. J Mach Learn Res 2003, 3:1265–1287.

74. Linde Y, Buzo A, Gray R. An algorithm for vectorquantizer design. IEEE Trans Commun 1980, 28:84–95.


objective clustering

Documents