researcharticle a clustering method for data in...

12
Research Article A Clustering Method for Data in Cylindrical Coordinates Kazuhisa Fujita 1,2 1 University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585, Japan 2 Department of Computer and Information Engineering, National Institute of Technology, Tsuyama College, 654-1 Numa, Tsuyama, Okayama 708-8506, Japan Correspondence should be addressed to Kazuhisa Fujita; [email protected] Received 7 April 2017; Revised 18 July 2017; Accepted 16 August 2017; Published 27 September 2017 Academic Editor: Ivan Giorgio Copyright © 2017 Kazuhisa Fujita. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose a new clustering method for data in cylindrical coordinates based on the -means. e goal of the -means family is to maximize an optimization function, which requires a similarity. us, we need a new similarity to obtain the new clustering method for data in cylindrical coordinates. In this study, we first derive a new similarity for the new clustering method by assuming a particular probabilistic model. A data point in cylindrical coordinates has radius, azimuth, and height. We assume that the azimuth is sampled from a von Mises distribution and the radius and the height are independently generated from isotropic Gaussian distributions. We derive the new similarity from the log likelihood of the assumed probability distribution. Our experiments demonstrate that the proposed method using the new similarity can appropriately partition synthetic data defined in cylindrical coordinates. Furthermore, we apply the proposed method to color image quantization and show that the methods successfully quantize a color image with respect to the hue element. 1. Introduction Clustering is an important technique in many areas such as data analysis, data visualization, image processing, and pattern recognition. e most popular and useful clustering method is the -means. e -means uses the Euclidean distance as coefficient and partitions data to clusters. e Euclidean distance is a reasonable measurement for data sampled from an isotropic Gaussian distribution. We cannot always obtain a good clustering result using the -means because not all data distributions are isotropic Gaussian distributions. e present study focuses on data in cylindrical coor- dinates. Data in cylindrical coordinates have a periodic element, so clustering methods using the Euclidean distance will lead to an improper analysis of the data. Furthermore, a clustering method using the Euclidean distance may not be able to extract meaningful centroids. For example, if a distribution in cylindrical coordinates is remarkably curved crescent-shape, the centroid of the distribution calculated by the -means may not be on the data distribution. However, there are no clustering methods optimized for data in cylin- drical coordinates. e cylindrical data are found in many fields such as image processing, meteorology, and biology. Movements of plants and animals and wind direction with another environ- mental measure are typical examples of cylindrical data [1]. e most popular example of data in cylindrical coordinates is color defined in the HSV color model. e HSV color has three attributes that are hue (direction), saturation (radius), and value that means brightness (height). e HSV color model can represent hue information and has a more natural correspondence to human vision than the RGB color model [2]. e clustering method for cylindrical coordinates is useful for many fields, especially image processing. e purpose of this study is to develop a new clustering method for data in cylindrical coordinates based on the - means. We first derive a new similarity for clustering data in cylindrical coordinates assuming that the data are sampled from a probabilistic model that is the product of a von Mises distribution and Gaussian distributions. We propose a new clustering method with this new similarity for data in cylindrical coordinates. Using numerical experiments, we demonstrate that the proposed method can partition syn- thetic data. Furthermore, we evaluate the performance of the proposed method for real world data. Finally, we apply the Hindawi Mathematical Problems in Engineering Volume 2017, Article ID 3696850, 11 pages https://doi.org/10.1155/2017/3696850

Upload: others

Post on 29-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Research ArticleA Clustering Method for Data in Cylindrical Coordinates

Kazuhisa Fujita12

1University of Electro-Communications 1-5-1 Chofugaoka Chofu Tokyo 182-8585 Japan2Department of Computer and Information Engineering National Institute of Technology Tsuyama College654-1 Numa Tsuyama Okayama 708-8506 Japan

Correspondence should be addressed to Kazuhisa Fujita k-znervepcuecacjp

Received 7 April 2017 Revised 18 July 2017 Accepted 16 August 2017 Published 27 September 2017

Academic Editor Ivan Giorgio

Copyright copy 2017 Kazuhisa Fujita This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We propose a new clustering method for data in cylindrical coordinates based on the 119896-means The goal of the 119896-means familyis to maximize an optimization function which requires a similarity Thus we need a new similarity to obtain the new clusteringmethod for data in cylindrical coordinates In this study we first derive a new similarity for the new clusteringmethod by assuming aparticular probabilistic model A data point in cylindrical coordinates has radius azimuth and height We assume that the azimuthis sampled from a von Mises distribution and the radius and the height are independently generated from isotropic Gaussiandistributions We derive the new similarity from the log likelihood of the assumed probability distribution Our experimentsdemonstrate that the proposed method using the new similarity can appropriately partition synthetic data defined in cylindricalcoordinates Furthermore we apply the proposed method to color image quantization and show that the methods successfullyquantize a color image with respect to the hue element

1 Introduction

Clustering is an important technique in many areas suchas data analysis data visualization image processing andpattern recognition The most popular and useful clusteringmethod is the 119896-means The 119896-means uses the Euclideandistance as coefficient and partitions data to 119896 clusters TheEuclidean distance is a reasonable measurement for datasampled from an isotropic Gaussian distribution We cannotalways obtain a good clustering result using the 119896-meansbecause not all data distributions are isotropic Gaussiandistributions

The present study focuses on data in cylindrical coor-dinates Data in cylindrical coordinates have a periodicelement so clustering methods using the Euclidean distancewill lead to an improper analysis of the data Furthermorea clustering method using the Euclidean distance may notbe able to extract meaningful centroids For example if adistribution in cylindrical coordinates is remarkably curvedcrescent-shape the centroid of the distribution calculated bythe 119896-means may not be on the data distribution Howeverthere are no clustering methods optimized for data in cylin-drical coordinates

The cylindrical data are found in many fields such asimage processing meteorology and biology Movements ofplants and animals and wind direction with another environ-mental measure are typical examples of cylindrical data [1]The most popular example of data in cylindrical coordinatesis color defined in the HSV color model The HSV color hasthree attributes that are hue (direction) saturation (radius)and value that means brightness (height) The HSV colormodel can represent hue information and has a more naturalcorrespondence to human vision than the RGB color model[2] The clustering method for cylindrical coordinates isuseful for many fields especially image processing

The purpose of this study is to develop a new clusteringmethod for data in cylindrical coordinates based on the 119896-means We first derive a new similarity for clustering data incylindrical coordinates assuming that the data are sampledfrom a probabilistic model that is the product of a vonMises distribution and Gaussian distributions We propose anew clustering method with this new similarity for data incylindrical coordinates Using numerical experiments wedemonstrate that the proposed method can partition syn-thetic data Furthermore we evaluate the performance of theproposed method for real world data Finally we apply the

HindawiMathematical Problems in EngineeringVolume 2017 Article ID 3696850 11 pageshttpsdoiorg10115520173696850

2 Mathematical Problems in Engineering

proposed method to color image quantization and demon-strate that it can quantize a color image according to the hue

2 Related Works

The most commonly used clustering method is the 119896-means[3] which is one of the top 10 most common algorithms usedin data mining [4] We have applied the 119896-means to variousfields because it is fast simple and easy to understand It usesthe Euclidean distance as a clustering criterion and assumesthat the data is sampled from a mixture of isotropic GaussiandistributionsThuswe can apply the 119896-means to data sampledfrom a mixture of isotropic Gaussian distributions but the119896-means is not appropriate for data generated from otherdistributions Data in cylindrical coordinates have periodiccharacteristics so the 119896-means will be inappropriate as aclustering method for the data

We can cluster periodic data distributed on an 119899-dimensional sphere surface using the spherical 119896-means (sk-means) Dhillon andModha [5] and Banerjee et al [6 7] havedeveloped the sk-means for clustering high dimensional textdata It is a 119896-means based method that uses cosine similarityas the criterion for clusteringThe sk-means assumes that thedata are sampled from a mixture of von Mises-Fisher distri-butions with the same concentrate parameters and the samemixture weights However we cannot apply the sk-means todata that have direction radius and height To appropriatelypartition these data we need a different nonlinear separationmethod

There are many methods for achieving nonlinear separa-tion One method is the kernel 119896-means [8] which partitionsthe data points in a higher-dimensional feature space afterthey are mapped to the feature space using a nonlinear func-tion The spectral clustering [9] is another popular modernnonlinear clustering method which uses the eigenvectors ofa similarity (kernel) matrix to partition data points The sup-port vector clustering [10] is inspired by the support vectormachine [11] These nonlinear clustering methods based onthe kernel method can provide reasonable clustering resultsfor non-Gaussian data However these methods can hardlyprovide significant statistics because they perform the clus-tering in a feature spaceThis is a problem when we also wantto determine some features of data such as color image quan-tization Furthermore we must experimentally select theoptimal kernel functions and its parameters

Clustering methods are frequently used for color imagequantization Color image quantization reduces the numberof colors in an image and plays an important role in appli-cations such as image segmentation [12] image compression[13] and color feature extraction [14] A color quantizationtechnique consists of two stages the palette design stage andthe pixel mapping stage These stages can be respectivelyregarded as calculating the centroids and assigning a datapoint to a cluster Many researchers have developed colorquantizationmethods includingmedian cut [15] the 119896-means[16] the fuzzy 119888-means [17 18] the self-organizing maps[19ndash21] and the particle swarm optimization [22] Howevergenerally color quantization is performed in the RGB colorspace although HSV color space is rarely adopted

3 Methodology

31 Assumed Probabilistic Distribution Adata point in cylin-drical coordinates x is represented by x = (119903 u 119911) withu = 1 and 119903 ge 0 where 119903 u 119911 are called the radiusazimuth and height respectively In this study we representthe azimuth as a unit vector u = (119906119909 119906119910) to simply calculatethe cosine similarity Here each element of x is assumed tobe independent and identically distributed Let a data pointin cylindrical coordinates x = (119903 u 119911) be generated by aprobability density function (pdf) of the form

119901 (x | 120579)= vM (u | 120583119906 120581)119873 (119903 | 120583119903 1205902119903 )119873 (119911 | 120583119911 1205902119911) (1)

where 120579 = 120583119906 120583119903 120583119911 120581 1205902119903 1205902119911 and vM(sdot) and 119873(sdot) arepdfs of a von Mises distribution and an isotropic Gaussiandistribution respectively A pdf of a von Mises distributionvM(sdot) has the form

vM (u | 120583119906 120581) = 121205871198680 (120581) exp (120581120583T119906u) (2)

where 120583119906 is the mean of the azimuth with 120583119906 = 1 120581 isthe concentrate parameter and 1198680(sdot) is the modified Besselfunction of the first kind (order 0) A pdf of an isotropicGaussian distribution119873(sdot) has the form

119873(119910 | 120583 1205902) = 1radic2120587120590 exp(minus(119910 minus 120583)221205902 ) (3)

where 120583 is the mean and 1205902 is the variance 120583119903 and 120583119911 are themeans of the radius and height respectively 1205902119903 and 1205902119911 are thevariances of radius and height respectively Thus the density119901(x | 120579) can be written as

119901 (x | 120579) = 1412058721205901199031205901199111198680 (120581)sdot exp(120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 ) (4)

We can estimate the parameters of density 119901(x | 120579)using maximum likelihood estimation Let data set 119883 =x1 x119873 be generated from density 119901(119883 | 120579) The loglikelihood function of 119901(119883 | 120579) is

ln119901 (119883 | 120579) = 119873sum119894=1

ln119901 (x119894 | 120579)= 119873sum119894=1

(minus ln (412058721205901199031205901199111198680 (120581)) + 120581120583T119906u119894 minus (119903119894 minus 120583119903)221205902119903minus (119911119894 minus 120583119911)221205902119911 )

(5)

Mathematical Problems in Engineering 3

Maximizing this equation subject to 120583119879119906120583119906 = 1 we find themaximum likelihood estimates 119906 120583119903 120583119911 119860(120581) 119903 and 119911obtained from

119906 = sum119873119894=1 u11989410038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817 119860 (120581) = 119877120583119903 = 1119873

119873sum119894=1

1199031198942119903 = 1119873

119873sum119894=1

(119903119894 minus 120583119903)2 120583119911 = 1119873

119873sum119894=1

1199111198942119911 = 1119873

119873sum119894=1

(119911119894 minus 120583119911)2

(6)

where 119860(120581) is119860 (120581) = 11986810158400 (120581)1198680 (120581) = 10038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817119873 = 119877 (7)

It is difficult to estimate the concentrate parameter 120581because an analytic solution cannot be obtained using themaximum likelihood estimate and we can only calculate theratio of the Bessel functions We approximate 120581 using thenumerical method proposed by Sra [23] because it producesthe most accurate estimates for 120581 (compared to other meth-ods)

We estimate 120581 using the recursive function

120581119899 = 120581119899minus1minus 119860119901 (120581119899minus1) minus 119877

1 minus 119860119901 (120581119899minus1)2 minus ((119901 minus 1) 120581119899minus1) 119860119901 (120581119899minus1) (8)

where 119899 is the iteration number The recursive calculationsterminate when |120581119899 minus 120581119899minus1| lt 120576120581 In this study 120576120581 = 0001 Wecalculate 1205810 using the method proposed by Banerjee et al [6]1205810 is

1205810 = 119877 (2 minus 1198772)1 minus 1198772 (9)

32 Cylindrical 119896-Means The 119896-means family uses a partic-ular similarity to decide whether a data point belongs to acluster The Euclidean distance (dissimilarity) is most fre-quently used by the 119896-means family andmoreover is derivedusing the log likelihood of an isotropic Gaussian distributionTherefore the 119896-means using the Euclidean distance will beable to appropriately partition data sampled from isotropicGaussian distributions but not other distributions We mustdevelop a new similarity for data in cylindrical coordinatesbecause the 119896-means family clusters by maximizing the sumof similarities between a centroid of a cluster and data points

that belong to the cluster In this study we obtain the optimalsimilarity for partitioning data in cylindrical coordinatesfrom an assumed pdf

First to develop a 119896-means based method for data incylindrical coordinates (cylindrical 119896-means cyk-means)we obtain a new similarity measure for data in cylindricalcoordinates by assuming a probability distribution Assumethat a data point x in a cluster that has a centroid m =(120583119903120583119906 120583119911) is sampled from the probability distribution 119901(x |120579) denoted by (4) where 120579 = (m 120581 120590119903 120590119911) The naturallogarithm of 119901(x | 120579) is

ln119901 (x | 120579) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 + ln119862 (10)

where 119862 is a normalizing constant given by

119862 = 1412058721205901199031205901199111198680 (120581) (11)

Here we ignore the normalizing constant ln119862 to obtain

119878 (xm) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 (12)

In this study this equation is used as a similarity for thecyk-means 119878(xm) denotes the similarity between the datapoint x and the centroid m The terms in (12) consist of thecosine similarity and the Euclidean similarities and the newsimilarity is a sum of these similarities weighed The weightsindicate the concentrations of distributions This similaritycan also be considered as a simplified log likelihood

The cyk-means partitions data points in cylindrical coor-dinates into 119870 clusters using the procedure same as the 119896-means Let 119883 = x1 x119873 be a set of data points incylindrical coordinates Let m119895 = (120583119903119895120583119906119895 120583119911119895) be thecentroid of the 119895th cluster Using the similarity 119878(x119894m119895) theobjective function 119869 is

119869 = 119873sum119894=1

119870sum119895=1

120574119894119895119878 (x119894m119895)

= 119873sum119894=1

119870sum119895=1

120574119894119895(120581120583T119906119895u119894 minus (119903119894 minus 120583119903119895)221205902119903119895 minus (119911119894 minus 120583119911119895)221205902119911119895 ) (13)

where 120574119894119895 is a binary indicator value If the 119894th data pointbelongs to the 119895th cluster 120574119894119895 = 1 Otherwise 120574119894119895 = 0 Theaim of the cyk-means is to maximize the objective function 119869The process to maximize the objective function is the same asthat of the 119896-means and is described as follows

(1) Fix 119870 and initializem119895(2) Assign each data point to the cluster that has themost

similar centroid(3) Estimate parameters of clusters(4) Return to Step (2) if the cluster assignment of data

points changes or the difference in the values of theoptimal function from the current and last iteration ismore than a threshold 120576119869 Otherwise terminate theprocedure

4 Mathematical Problems in Engineering

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119911119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) 1205902119903119895 = 1119873119895

119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2(16) 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2(17) 119877119895 =

10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895(18) 120581119895 = 119877119895 (2 minus 1198772119895)

1 minus 1198772119895(19) repeat

(20) 120581119895 larr997888 120581119895 minus 119860119901 (120581119895) minus 1198771198951 minus 119860119901 (120581119895)2 minus ((119901 minus 1) 120581119895)119860119901 (120581119895)(21) until convergence(22) end for(23) until convergence

Algorithm 1 cyk-means

In this study we use 120576119869 = |0001times 119869119899|where 119869119899 is the objectivefunction of the 119899th iteration Algorithm 1 shows the details ofthe algorithm of the cyk-means From (6) the elements of thecentroid vectorm119895 = (120583119903119895120583119906119895 120583119911119895) of the 119895th cluster are

120583119903119895 = 1119873119895119873sum119894=1

120574119894119895119903119894120583119906119895 = sum119873119894=1 120574119894119895u10038171003817100381710038171003817sum119873119894=1 120574119894119895u10038171003817100381710038171003817 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(14)

where 119873119895 is the number of data points in the 119895th cluster(which has the form 119873119895 = sum119873119894=1 120574119894119895) The other values used tocalculate the objective function are

119877119895 =10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895

1205902119903119895 = 1119873119895119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2 (15)

120581119895 is approximated by Srarsquos method using the ratio of theBessel function 119877119895

Mathematical Problems in Engineering 5

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) end for(16) until convergence

Algorithm 2 Fixed cyk-means

The cyk-means method has many parameters The 119896-means method for data in three-dimensional Cartesian coor-dinates has only 3119870 parameters which are multiples of thenumber of centroid vectors and dimensions However thecyk-means has 7119870 parameters which are multiples of thenumber of clusters and the number of parameters of acluster The parameters of the 119895th cluster are 120583119903119895 120583119906119895 (twodimensions) 120583119903119895 120590119903119895 120581119895 and 120590119911119895 Because the cyk-means hasmore degrees of freedom the dead unit problem (ie emptyclusters) will frequently occur if the initial 119896 is not optimal

33 Fixed cyk-Means Model based clustering methods havevarious problems such as the dead units and initial valueproblems One reason for this is that the log likelihoodequation can have many local optima [9] If a model hasmore parameters these problems tend to be more frequentIn the fixed cyk-means the concentrate parameter 120581 andthe variances 1205902s are fixed for particular values As a con-sequence the fixed cyk-means has 4119870 parameters Fixingthe parameters decreases the complexity of the model andmakes these problems less Algorithm 2 indicates the fixedcyk-means algorithm

34 Computational Complexity Assigning data points toclusters has a complexity of 119874(119870119873) per iteration We mustestimate six parameters We obtain three 120583s two 120590s and 120581 in119874(6119870119873 + 119870119905120581max) time per iteration where 119905120581max is theconvergence time of 120581 Therefore the total computational

complexity per iteration is119874(7119870119873+119870119905120581max)The complexityof the fixed cyk-means is 119874(5119870119873) per iteration so the cyk-means is approximately 15 times as complex as the fixed cyk-means

4 Experimental Results

In our experiments we use Python and its libraries (NumPySciPy and scikit-learn) to implement the proposed method

41 Synthetic Data In this subsection we demonstrate thatthe cyk-means and the fixed cyk-means can partition syn-thetic data that is defined using cylindrical coordinates Thedataset used in this experience has three clusters as shownin Figure 1(a) The data points in each cluster are generatedfrom the probability distribution denoted by (4) with theparameters shown in Table 1 Figures 1(b) 1(c) and 1(d)show the clustering results of the cyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and 120590119911 = 01 and the 119896-means respectively We can see that the cyk-means and thefixed cyk-means properly partition the dataset into eachcluster On the other hand the 119896-means regards two upperright clusters as one cluster and unsuccessfully partitionsthe dataset Table 2 shows the parameters estimated by thecyk-means the fixed cyk-means and the 119896-means The cyk-means can only estimate the concentrate parameters and thevariances The values of the concentrate parameters and thevariances estimated by the cyk-means are approximate to thetrue values The cyk-means most appropriately estimates the

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

2 Mathematical Problems in Engineering

proposed method to color image quantization and demon-strate that it can quantize a color image according to the hue

2 Related Works

The most commonly used clustering method is the 119896-means[3] which is one of the top 10 most common algorithms usedin data mining [4] We have applied the 119896-means to variousfields because it is fast simple and easy to understand It usesthe Euclidean distance as a clustering criterion and assumesthat the data is sampled from a mixture of isotropic GaussiandistributionsThuswe can apply the 119896-means to data sampledfrom a mixture of isotropic Gaussian distributions but the119896-means is not appropriate for data generated from otherdistributions Data in cylindrical coordinates have periodiccharacteristics so the 119896-means will be inappropriate as aclustering method for the data

We can cluster periodic data distributed on an 119899-dimensional sphere surface using the spherical 119896-means (sk-means) Dhillon andModha [5] and Banerjee et al [6 7] havedeveloped the sk-means for clustering high dimensional textdata It is a 119896-means based method that uses cosine similarityas the criterion for clusteringThe sk-means assumes that thedata are sampled from a mixture of von Mises-Fisher distri-butions with the same concentrate parameters and the samemixture weights However we cannot apply the sk-means todata that have direction radius and height To appropriatelypartition these data we need a different nonlinear separationmethod

There are many methods for achieving nonlinear separa-tion One method is the kernel 119896-means [8] which partitionsthe data points in a higher-dimensional feature space afterthey are mapped to the feature space using a nonlinear func-tion The spectral clustering [9] is another popular modernnonlinear clustering method which uses the eigenvectors ofa similarity (kernel) matrix to partition data points The sup-port vector clustering [10] is inspired by the support vectormachine [11] These nonlinear clustering methods based onthe kernel method can provide reasonable clustering resultsfor non-Gaussian data However these methods can hardlyprovide significant statistics because they perform the clus-tering in a feature spaceThis is a problem when we also wantto determine some features of data such as color image quan-tization Furthermore we must experimentally select theoptimal kernel functions and its parameters

Clustering methods are frequently used for color imagequantization Color image quantization reduces the numberof colors in an image and plays an important role in appli-cations such as image segmentation [12] image compression[13] and color feature extraction [14] A color quantizationtechnique consists of two stages the palette design stage andthe pixel mapping stage These stages can be respectivelyregarded as calculating the centroids and assigning a datapoint to a cluster Many researchers have developed colorquantizationmethods includingmedian cut [15] the 119896-means[16] the fuzzy 119888-means [17 18] the self-organizing maps[19ndash21] and the particle swarm optimization [22] Howevergenerally color quantization is performed in the RGB colorspace although HSV color space is rarely adopted

3 Methodology

31 Assumed Probabilistic Distribution Adata point in cylin-drical coordinates x is represented by x = (119903 u 119911) withu = 1 and 119903 ge 0 where 119903 u 119911 are called the radiusazimuth and height respectively In this study we representthe azimuth as a unit vector u = (119906119909 119906119910) to simply calculatethe cosine similarity Here each element of x is assumed tobe independent and identically distributed Let a data pointin cylindrical coordinates x = (119903 u 119911) be generated by aprobability density function (pdf) of the form

119901 (x | 120579)= vM (u | 120583119906 120581)119873 (119903 | 120583119903 1205902119903 )119873 (119911 | 120583119911 1205902119911) (1)

where 120579 = 120583119906 120583119903 120583119911 120581 1205902119903 1205902119911 and vM(sdot) and 119873(sdot) arepdfs of a von Mises distribution and an isotropic Gaussiandistribution respectively A pdf of a von Mises distributionvM(sdot) has the form

vM (u | 120583119906 120581) = 121205871198680 (120581) exp (120581120583T119906u) (2)

where 120583119906 is the mean of the azimuth with 120583119906 = 1 120581 isthe concentrate parameter and 1198680(sdot) is the modified Besselfunction of the first kind (order 0) A pdf of an isotropicGaussian distribution119873(sdot) has the form

119873(119910 | 120583 1205902) = 1radic2120587120590 exp(minus(119910 minus 120583)221205902 ) (3)

where 120583 is the mean and 1205902 is the variance 120583119903 and 120583119911 are themeans of the radius and height respectively 1205902119903 and 1205902119911 are thevariances of radius and height respectively Thus the density119901(x | 120579) can be written as

119901 (x | 120579) = 1412058721205901199031205901199111198680 (120581)sdot exp(120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 ) (4)

We can estimate the parameters of density 119901(x | 120579)using maximum likelihood estimation Let data set 119883 =x1 x119873 be generated from density 119901(119883 | 120579) The loglikelihood function of 119901(119883 | 120579) is

ln119901 (119883 | 120579) = 119873sum119894=1

ln119901 (x119894 | 120579)= 119873sum119894=1

(minus ln (412058721205901199031205901199111198680 (120581)) + 120581120583T119906u119894 minus (119903119894 minus 120583119903)221205902119903minus (119911119894 minus 120583119911)221205902119911 )

(5)

Mathematical Problems in Engineering 3

Maximizing this equation subject to 120583119879119906120583119906 = 1 we find themaximum likelihood estimates 119906 120583119903 120583119911 119860(120581) 119903 and 119911obtained from

119906 = sum119873119894=1 u11989410038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817 119860 (120581) = 119877120583119903 = 1119873

119873sum119894=1

1199031198942119903 = 1119873

119873sum119894=1

(119903119894 minus 120583119903)2 120583119911 = 1119873

119873sum119894=1

1199111198942119911 = 1119873

119873sum119894=1

(119911119894 minus 120583119911)2

(6)

where 119860(120581) is119860 (120581) = 11986810158400 (120581)1198680 (120581) = 10038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817119873 = 119877 (7)

It is difficult to estimate the concentrate parameter 120581because an analytic solution cannot be obtained using themaximum likelihood estimate and we can only calculate theratio of the Bessel functions We approximate 120581 using thenumerical method proposed by Sra [23] because it producesthe most accurate estimates for 120581 (compared to other meth-ods)

We estimate 120581 using the recursive function

120581119899 = 120581119899minus1minus 119860119901 (120581119899minus1) minus 119877

1 minus 119860119901 (120581119899minus1)2 minus ((119901 minus 1) 120581119899minus1) 119860119901 (120581119899minus1) (8)

where 119899 is the iteration number The recursive calculationsterminate when |120581119899 minus 120581119899minus1| lt 120576120581 In this study 120576120581 = 0001 Wecalculate 1205810 using the method proposed by Banerjee et al [6]1205810 is

1205810 = 119877 (2 minus 1198772)1 minus 1198772 (9)

32 Cylindrical 119896-Means The 119896-means family uses a partic-ular similarity to decide whether a data point belongs to acluster The Euclidean distance (dissimilarity) is most fre-quently used by the 119896-means family andmoreover is derivedusing the log likelihood of an isotropic Gaussian distributionTherefore the 119896-means using the Euclidean distance will beable to appropriately partition data sampled from isotropicGaussian distributions but not other distributions We mustdevelop a new similarity for data in cylindrical coordinatesbecause the 119896-means family clusters by maximizing the sumof similarities between a centroid of a cluster and data points

that belong to the cluster In this study we obtain the optimalsimilarity for partitioning data in cylindrical coordinatesfrom an assumed pdf

First to develop a 119896-means based method for data incylindrical coordinates (cylindrical 119896-means cyk-means)we obtain a new similarity measure for data in cylindricalcoordinates by assuming a probability distribution Assumethat a data point x in a cluster that has a centroid m =(120583119903120583119906 120583119911) is sampled from the probability distribution 119901(x |120579) denoted by (4) where 120579 = (m 120581 120590119903 120590119911) The naturallogarithm of 119901(x | 120579) is

ln119901 (x | 120579) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 + ln119862 (10)

where 119862 is a normalizing constant given by

119862 = 1412058721205901199031205901199111198680 (120581) (11)

Here we ignore the normalizing constant ln119862 to obtain

119878 (xm) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 (12)

In this study this equation is used as a similarity for thecyk-means 119878(xm) denotes the similarity between the datapoint x and the centroid m The terms in (12) consist of thecosine similarity and the Euclidean similarities and the newsimilarity is a sum of these similarities weighed The weightsindicate the concentrations of distributions This similaritycan also be considered as a simplified log likelihood

The cyk-means partitions data points in cylindrical coor-dinates into 119870 clusters using the procedure same as the 119896-means Let 119883 = x1 x119873 be a set of data points incylindrical coordinates Let m119895 = (120583119903119895120583119906119895 120583119911119895) be thecentroid of the 119895th cluster Using the similarity 119878(x119894m119895) theobjective function 119869 is

119869 = 119873sum119894=1

119870sum119895=1

120574119894119895119878 (x119894m119895)

= 119873sum119894=1

119870sum119895=1

120574119894119895(120581120583T119906119895u119894 minus (119903119894 minus 120583119903119895)221205902119903119895 minus (119911119894 minus 120583119911119895)221205902119911119895 ) (13)

where 120574119894119895 is a binary indicator value If the 119894th data pointbelongs to the 119895th cluster 120574119894119895 = 1 Otherwise 120574119894119895 = 0 Theaim of the cyk-means is to maximize the objective function 119869The process to maximize the objective function is the same asthat of the 119896-means and is described as follows

(1) Fix 119870 and initializem119895(2) Assign each data point to the cluster that has themost

similar centroid(3) Estimate parameters of clusters(4) Return to Step (2) if the cluster assignment of data

points changes or the difference in the values of theoptimal function from the current and last iteration ismore than a threshold 120576119869 Otherwise terminate theprocedure

4 Mathematical Problems in Engineering

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119911119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) 1205902119903119895 = 1119873119895

119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2(16) 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2(17) 119877119895 =

10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895(18) 120581119895 = 119877119895 (2 minus 1198772119895)

1 minus 1198772119895(19) repeat

(20) 120581119895 larr997888 120581119895 minus 119860119901 (120581119895) minus 1198771198951 minus 119860119901 (120581119895)2 minus ((119901 minus 1) 120581119895)119860119901 (120581119895)(21) until convergence(22) end for(23) until convergence

Algorithm 1 cyk-means

In this study we use 120576119869 = |0001times 119869119899|where 119869119899 is the objectivefunction of the 119899th iteration Algorithm 1 shows the details ofthe algorithm of the cyk-means From (6) the elements of thecentroid vectorm119895 = (120583119903119895120583119906119895 120583119911119895) of the 119895th cluster are

120583119903119895 = 1119873119895119873sum119894=1

120574119894119895119903119894120583119906119895 = sum119873119894=1 120574119894119895u10038171003817100381710038171003817sum119873119894=1 120574119894119895u10038171003817100381710038171003817 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(14)

where 119873119895 is the number of data points in the 119895th cluster(which has the form 119873119895 = sum119873119894=1 120574119894119895) The other values used tocalculate the objective function are

119877119895 =10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895

1205902119903119895 = 1119873119895119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2 (15)

120581119895 is approximated by Srarsquos method using the ratio of theBessel function 119877119895

Mathematical Problems in Engineering 5

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) end for(16) until convergence

Algorithm 2 Fixed cyk-means

The cyk-means method has many parameters The 119896-means method for data in three-dimensional Cartesian coor-dinates has only 3119870 parameters which are multiples of thenumber of centroid vectors and dimensions However thecyk-means has 7119870 parameters which are multiples of thenumber of clusters and the number of parameters of acluster The parameters of the 119895th cluster are 120583119903119895 120583119906119895 (twodimensions) 120583119903119895 120590119903119895 120581119895 and 120590119911119895 Because the cyk-means hasmore degrees of freedom the dead unit problem (ie emptyclusters) will frequently occur if the initial 119896 is not optimal

33 Fixed cyk-Means Model based clustering methods havevarious problems such as the dead units and initial valueproblems One reason for this is that the log likelihoodequation can have many local optima [9] If a model hasmore parameters these problems tend to be more frequentIn the fixed cyk-means the concentrate parameter 120581 andthe variances 1205902s are fixed for particular values As a con-sequence the fixed cyk-means has 4119870 parameters Fixingthe parameters decreases the complexity of the model andmakes these problems less Algorithm 2 indicates the fixedcyk-means algorithm

34 Computational Complexity Assigning data points toclusters has a complexity of 119874(119870119873) per iteration We mustestimate six parameters We obtain three 120583s two 120590s and 120581 in119874(6119870119873 + 119870119905120581max) time per iteration where 119905120581max is theconvergence time of 120581 Therefore the total computational

complexity per iteration is119874(7119870119873+119870119905120581max)The complexityof the fixed cyk-means is 119874(5119870119873) per iteration so the cyk-means is approximately 15 times as complex as the fixed cyk-means

4 Experimental Results

In our experiments we use Python and its libraries (NumPySciPy and scikit-learn) to implement the proposed method

41 Synthetic Data In this subsection we demonstrate thatthe cyk-means and the fixed cyk-means can partition syn-thetic data that is defined using cylindrical coordinates Thedataset used in this experience has three clusters as shownin Figure 1(a) The data points in each cluster are generatedfrom the probability distribution denoted by (4) with theparameters shown in Table 1 Figures 1(b) 1(c) and 1(d)show the clustering results of the cyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and 120590119911 = 01 and the 119896-means respectively We can see that the cyk-means and thefixed cyk-means properly partition the dataset into eachcluster On the other hand the 119896-means regards two upperright clusters as one cluster and unsuccessfully partitionsthe dataset Table 2 shows the parameters estimated by thecyk-means the fixed cyk-means and the 119896-means The cyk-means can only estimate the concentrate parameters and thevariances The values of the concentrate parameters and thevariances estimated by the cyk-means are approximate to thetrue values The cyk-means most appropriately estimates the

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Mathematical Problems in Engineering 3

Maximizing this equation subject to 120583119879119906120583119906 = 1 we find themaximum likelihood estimates 119906 120583119903 120583119911 119860(120581) 119903 and 119911obtained from

119906 = sum119873119894=1 u11989410038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817 119860 (120581) = 119877120583119903 = 1119873

119873sum119894=1

1199031198942119903 = 1119873

119873sum119894=1

(119903119894 minus 120583119903)2 120583119911 = 1119873

119873sum119894=1

1199111198942119911 = 1119873

119873sum119894=1

(119911119894 minus 120583119911)2

(6)

where 119860(120581) is119860 (120581) = 11986810158400 (120581)1198680 (120581) = 10038171003817100381710038171003817sum119873119894=1 u11989410038171003817100381710038171003817119873 = 119877 (7)

It is difficult to estimate the concentrate parameter 120581because an analytic solution cannot be obtained using themaximum likelihood estimate and we can only calculate theratio of the Bessel functions We approximate 120581 using thenumerical method proposed by Sra [23] because it producesthe most accurate estimates for 120581 (compared to other meth-ods)

We estimate 120581 using the recursive function

120581119899 = 120581119899minus1minus 119860119901 (120581119899minus1) minus 119877

1 minus 119860119901 (120581119899minus1)2 minus ((119901 minus 1) 120581119899minus1) 119860119901 (120581119899minus1) (8)

where 119899 is the iteration number The recursive calculationsterminate when |120581119899 minus 120581119899minus1| lt 120576120581 In this study 120576120581 = 0001 Wecalculate 1205810 using the method proposed by Banerjee et al [6]1205810 is

1205810 = 119877 (2 minus 1198772)1 minus 1198772 (9)

32 Cylindrical 119896-Means The 119896-means family uses a partic-ular similarity to decide whether a data point belongs to acluster The Euclidean distance (dissimilarity) is most fre-quently used by the 119896-means family andmoreover is derivedusing the log likelihood of an isotropic Gaussian distributionTherefore the 119896-means using the Euclidean distance will beable to appropriately partition data sampled from isotropicGaussian distributions but not other distributions We mustdevelop a new similarity for data in cylindrical coordinatesbecause the 119896-means family clusters by maximizing the sumof similarities between a centroid of a cluster and data points

that belong to the cluster In this study we obtain the optimalsimilarity for partitioning data in cylindrical coordinatesfrom an assumed pdf

First to develop a 119896-means based method for data incylindrical coordinates (cylindrical 119896-means cyk-means)we obtain a new similarity measure for data in cylindricalcoordinates by assuming a probability distribution Assumethat a data point x in a cluster that has a centroid m =(120583119903120583119906 120583119911) is sampled from the probability distribution 119901(x |120579) denoted by (4) where 120579 = (m 120581 120590119903 120590119911) The naturallogarithm of 119901(x | 120579) is

ln119901 (x | 120579) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 + ln119862 (10)

where 119862 is a normalizing constant given by

119862 = 1412058721205901199031205901199111198680 (120581) (11)

Here we ignore the normalizing constant ln119862 to obtain

119878 (xm) = 120581120583T119906u minus (119903 minus 120583119903)221205902119903 minus (119911 minus 120583119911)221205902119911 (12)

In this study this equation is used as a similarity for thecyk-means 119878(xm) denotes the similarity between the datapoint x and the centroid m The terms in (12) consist of thecosine similarity and the Euclidean similarities and the newsimilarity is a sum of these similarities weighed The weightsindicate the concentrations of distributions This similaritycan also be considered as a simplified log likelihood

The cyk-means partitions data points in cylindrical coor-dinates into 119870 clusters using the procedure same as the 119896-means Let 119883 = x1 x119873 be a set of data points incylindrical coordinates Let m119895 = (120583119903119895120583119906119895 120583119911119895) be thecentroid of the 119895th cluster Using the similarity 119878(x119894m119895) theobjective function 119869 is

119869 = 119873sum119894=1

119870sum119895=1

120574119894119895119878 (x119894m119895)

= 119873sum119894=1

119870sum119895=1

120574119894119895(120581120583T119906119895u119894 minus (119903119894 minus 120583119903119895)221205902119903119895 minus (119911119894 minus 120583119911119895)221205902119911119895 ) (13)

where 120574119894119895 is a binary indicator value If the 119894th data pointbelongs to the 119895th cluster 120574119894119895 = 1 Otherwise 120574119894119895 = 0 Theaim of the cyk-means is to maximize the objective function 119869The process to maximize the objective function is the same asthat of the 119896-means and is described as follows

(1) Fix 119870 and initializem119895(2) Assign each data point to the cluster that has themost

similar centroid(3) Estimate parameters of clusters(4) Return to Step (2) if the cluster assignment of data

points changes or the difference in the values of theoptimal function from the current and last iteration ismore than a threshold 120576119869 Otherwise terminate theprocedure

4 Mathematical Problems in Engineering

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119911119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) 1205902119903119895 = 1119873119895

119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2(16) 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2(17) 119877119895 =

10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895(18) 120581119895 = 119877119895 (2 minus 1198772119895)

1 minus 1198772119895(19) repeat

(20) 120581119895 larr997888 120581119895 minus 119860119901 (120581119895) minus 1198771198951 minus 119860119901 (120581119895)2 minus ((119901 minus 1) 120581119895)119860119901 (120581119895)(21) until convergence(22) end for(23) until convergence

Algorithm 1 cyk-means

In this study we use 120576119869 = |0001times 119869119899|where 119869119899 is the objectivefunction of the 119899th iteration Algorithm 1 shows the details ofthe algorithm of the cyk-means From (6) the elements of thecentroid vectorm119895 = (120583119903119895120583119906119895 120583119911119895) of the 119895th cluster are

120583119903119895 = 1119873119895119873sum119894=1

120574119894119895119903119894120583119906119895 = sum119873119894=1 120574119894119895u10038171003817100381710038171003817sum119873119894=1 120574119894119895u10038171003817100381710038171003817 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(14)

where 119873119895 is the number of data points in the 119895th cluster(which has the form 119873119895 = sum119873119894=1 120574119894119895) The other values used tocalculate the objective function are

119877119895 =10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895

1205902119903119895 = 1119873119895119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2 (15)

120581119895 is approximated by Srarsquos method using the ratio of theBessel function 119877119895

Mathematical Problems in Engineering 5

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) end for(16) until convergence

Algorithm 2 Fixed cyk-means

The cyk-means method has many parameters The 119896-means method for data in three-dimensional Cartesian coor-dinates has only 3119870 parameters which are multiples of thenumber of centroid vectors and dimensions However thecyk-means has 7119870 parameters which are multiples of thenumber of clusters and the number of parameters of acluster The parameters of the 119895th cluster are 120583119903119895 120583119906119895 (twodimensions) 120583119903119895 120590119903119895 120581119895 and 120590119911119895 Because the cyk-means hasmore degrees of freedom the dead unit problem (ie emptyclusters) will frequently occur if the initial 119896 is not optimal

33 Fixed cyk-Means Model based clustering methods havevarious problems such as the dead units and initial valueproblems One reason for this is that the log likelihoodequation can have many local optima [9] If a model hasmore parameters these problems tend to be more frequentIn the fixed cyk-means the concentrate parameter 120581 andthe variances 1205902s are fixed for particular values As a con-sequence the fixed cyk-means has 4119870 parameters Fixingthe parameters decreases the complexity of the model andmakes these problems less Algorithm 2 indicates the fixedcyk-means algorithm

34 Computational Complexity Assigning data points toclusters has a complexity of 119874(119870119873) per iteration We mustestimate six parameters We obtain three 120583s two 120590s and 120581 in119874(6119870119873 + 119870119905120581max) time per iteration where 119905120581max is theconvergence time of 120581 Therefore the total computational

complexity per iteration is119874(7119870119873+119870119905120581max)The complexityof the fixed cyk-means is 119874(5119870119873) per iteration so the cyk-means is approximately 15 times as complex as the fixed cyk-means

4 Experimental Results

In our experiments we use Python and its libraries (NumPySciPy and scikit-learn) to implement the proposed method

41 Synthetic Data In this subsection we demonstrate thatthe cyk-means and the fixed cyk-means can partition syn-thetic data that is defined using cylindrical coordinates Thedataset used in this experience has three clusters as shownin Figure 1(a) The data points in each cluster are generatedfrom the probability distribution denoted by (4) with theparameters shown in Table 1 Figures 1(b) 1(c) and 1(d)show the clustering results of the cyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and 120590119911 = 01 and the 119896-means respectively We can see that the cyk-means and thefixed cyk-means properly partition the dataset into eachcluster On the other hand the 119896-means regards two upperright clusters as one cluster and unsuccessfully partitionsthe dataset Table 2 shows the parameters estimated by thecyk-means the fixed cyk-means and the 119896-means The cyk-means can only estimate the concentrate parameters and thevariances The values of the concentrate parameters and thevariances estimated by the cyk-means are approximate to thetrue values The cyk-means most appropriately estimates the

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

4 Mathematical Problems in Engineering

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119911119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) 1205902119903119895 = 1119873119895

119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2(16) 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2(17) 119877119895 =

10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895(18) 120581119895 = 119877119895 (2 minus 1198772119895)

1 minus 1198772119895(19) repeat

(20) 120581119895 larr997888 120581119895 minus 119860119901 (120581119895) minus 1198771198951 minus 119860119901 (120581119895)2 minus ((119901 minus 1) 120581119895)119860119901 (120581119895)(21) until convergence(22) end for(23) until convergence

Algorithm 1 cyk-means

In this study we use 120576119869 = |0001times 119869119899|where 119869119899 is the objectivefunction of the 119899th iteration Algorithm 1 shows the details ofthe algorithm of the cyk-means From (6) the elements of thecentroid vectorm119895 = (120583119903119895120583119906119895 120583119911119895) of the 119895th cluster are

120583119903119895 = 1119873119895119873sum119894=1

120574119894119895119903119894120583119906119895 = sum119873119894=1 120574119894119895u10038171003817100381710038171003817sum119873119894=1 120574119894119895u10038171003817100381710038171003817 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(14)

where 119873119895 is the number of data points in the 119895th cluster(which has the form 119873119895 = sum119873119894=1 120574119894119895) The other values used tocalculate the objective function are

119877119895 =10038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817119873119895

1205902119903119895 = 1119873119895119873sum119894=1

120574119894119895 (119903119894 minus 120583119903119895)2 1205902119911119895 = 1119873119895

119873sum119894=1

120574119894119895 (119911119894 minus 120583119911119895)2 (15)

120581119895 is approximated by Srarsquos method using the ratio of theBessel function 119877119895

Mathematical Problems in Engineering 5

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) end for(16) until convergence

Algorithm 2 Fixed cyk-means

The cyk-means method has many parameters The 119896-means method for data in three-dimensional Cartesian coor-dinates has only 3119870 parameters which are multiples of thenumber of centroid vectors and dimensions However thecyk-means has 7119870 parameters which are multiples of thenumber of clusters and the number of parameters of acluster The parameters of the 119895th cluster are 120583119903119895 120583119906119895 (twodimensions) 120583119903119895 120590119903119895 120581119895 and 120590119911119895 Because the cyk-means hasmore degrees of freedom the dead unit problem (ie emptyclusters) will frequently occur if the initial 119896 is not optimal

33 Fixed cyk-Means Model based clustering methods havevarious problems such as the dead units and initial valueproblems One reason for this is that the log likelihoodequation can have many local optima [9] If a model hasmore parameters these problems tend to be more frequentIn the fixed cyk-means the concentrate parameter 120581 andthe variances 1205902s are fixed for particular values As a con-sequence the fixed cyk-means has 4119870 parameters Fixingthe parameters decreases the complexity of the model andmakes these problems less Algorithm 2 indicates the fixedcyk-means algorithm

34 Computational Complexity Assigning data points toclusters has a complexity of 119874(119870119873) per iteration We mustestimate six parameters We obtain three 120583s two 120590s and 120581 in119874(6119870119873 + 119870119905120581max) time per iteration where 119905120581max is theconvergence time of 120581 Therefore the total computational

complexity per iteration is119874(7119870119873+119870119905120581max)The complexityof the fixed cyk-means is 119874(5119870119873) per iteration so the cyk-means is approximately 15 times as complex as the fixed cyk-means

4 Experimental Results

In our experiments we use Python and its libraries (NumPySciPy and scikit-learn) to implement the proposed method

41 Synthetic Data In this subsection we demonstrate thatthe cyk-means and the fixed cyk-means can partition syn-thetic data that is defined using cylindrical coordinates Thedataset used in this experience has three clusters as shownin Figure 1(a) The data points in each cluster are generatedfrom the probability distribution denoted by (4) with theparameters shown in Table 1 Figures 1(b) 1(c) and 1(d)show the clustering results of the cyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and 120590119911 = 01 and the 119896-means respectively We can see that the cyk-means and thefixed cyk-means properly partition the dataset into eachcluster On the other hand the 119896-means regards two upperright clusters as one cluster and unsuccessfully partitionsthe dataset Table 2 shows the parameters estimated by thecyk-means the fixed cyk-means and the 119896-means The cyk-means can only estimate the concentrate parameters and thevariances The values of the concentrate parameters and thevariances estimated by the cyk-means are approximate to thetrue values The cyk-means most appropriately estimates the

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Mathematical Problems in Engineering 5

Input Set 119883 of data points in cylindrical coordinatesOutput A clustering of119883(1) Initialize 120579119895 119895 = 1 119870(2) repeat(3) Set 120574119894119895 = 0 119894 = 1 119873 119895 = 1 119870(4) Assign data points to clusters(5) for 119894 = 0 to119873 do

(6) 119895 larr997888 argmax1198951015840

(1205811198951015840120583T1199061198951015840u119894 minus (119903119894 minus 1205831199031198951015840)2212059021199031198951015840

minus (119911119894 minus 1205831199111198951015840)2212059021199111198951015840

)(7) 120574119894119895 = 1(8) end for(9) Estimate parameters(10) for 119895 = 1 to 119870 do

(11) 119873119895 = 119873sum119894=1

120574119894119895(12) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119903119894(13) 120583119906119895 = sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817sum119873119894=1 120574119894119895u11989410038171003817100381710038171003817(14) 120583119903119895 = 1119873119895

119873sum119894=1

120574119894119895119911119894(15) end for(16) until convergence

Algorithm 2 Fixed cyk-means

The cyk-means method has many parameters The 119896-means method for data in three-dimensional Cartesian coor-dinates has only 3119870 parameters which are multiples of thenumber of centroid vectors and dimensions However thecyk-means has 7119870 parameters which are multiples of thenumber of clusters and the number of parameters of acluster The parameters of the 119895th cluster are 120583119903119895 120583119906119895 (twodimensions) 120583119903119895 120590119903119895 120581119895 and 120590119911119895 Because the cyk-means hasmore degrees of freedom the dead unit problem (ie emptyclusters) will frequently occur if the initial 119896 is not optimal

33 Fixed cyk-Means Model based clustering methods havevarious problems such as the dead units and initial valueproblems One reason for this is that the log likelihoodequation can have many local optima [9] If a model hasmore parameters these problems tend to be more frequentIn the fixed cyk-means the concentrate parameter 120581 andthe variances 1205902s are fixed for particular values As a con-sequence the fixed cyk-means has 4119870 parameters Fixingthe parameters decreases the complexity of the model andmakes these problems less Algorithm 2 indicates the fixedcyk-means algorithm

34 Computational Complexity Assigning data points toclusters has a complexity of 119874(119870119873) per iteration We mustestimate six parameters We obtain three 120583s two 120590s and 120581 in119874(6119870119873 + 119870119905120581max) time per iteration where 119905120581max is theconvergence time of 120581 Therefore the total computational

complexity per iteration is119874(7119870119873+119870119905120581max)The complexityof the fixed cyk-means is 119874(5119870119873) per iteration so the cyk-means is approximately 15 times as complex as the fixed cyk-means

4 Experimental Results

In our experiments we use Python and its libraries (NumPySciPy and scikit-learn) to implement the proposed method

41 Synthetic Data In this subsection we demonstrate thatthe cyk-means and the fixed cyk-means can partition syn-thetic data that is defined using cylindrical coordinates Thedataset used in this experience has three clusters as shownin Figure 1(a) The data points in each cluster are generatedfrom the probability distribution denoted by (4) with theparameters shown in Table 1 Figures 1(b) 1(c) and 1(d)show the clustering results of the cyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and 120590119911 = 01 and the 119896-means respectively We can see that the cyk-means and thefixed cyk-means properly partition the dataset into eachcluster On the other hand the 119896-means regards two upperright clusters as one cluster and unsuccessfully partitionsthe dataset Table 2 shows the parameters estimated by thecyk-means the fixed cyk-means and the 119896-means The cyk-means can only estimate the concentrate parameters and thevariances The values of the concentrate parameters and thevariances estimated by the cyk-means are approximate to thetrue values The cyk-means most appropriately estimates the

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

6 Mathematical Problems in Engineering

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

(a)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

cyk-means

(b)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

Fixed cyk-means

(c)

minus15minus10

minus0500

0510

minus15minus10

minus0500

0510

15

minus10

minus05

00

05

10

15

k-means(d)

Figure 1 (a) Scatter plot of the synthetic dataset including three clusters (b) (c) (d) Clustering results of the cyk-means the fixed cyk-meansand the 119896-means The three clusters are shown with circles triangles and crosses

Table 1 Parameters of dataset

119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 1205901199111198951 200 0 8 1 02 05 012 300 minus2 4 15 015 minus05 023 400 2 12 05 01 1 02119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster120583120579119895 = arctan(119906119910119895119906119909119895)

number of data points in each cluster The fixed cyk-meansmost approximately estimates the all means and the cyk-means also approximately calculates the all means Theseresults show that the cyk-means and the fixed cyk-meanssufficiently approximately estimate the all means

In the next experiment we examine the effectiveness ofthe proposed methods (the cyk-means and the fixed cyk-means with 120581 = 15 120590119903 = 01 and 120590119911 = 01) compared to the119896-means and the kernel 119896-means with a radial basis function

The parameter of the radial basis function is 120574 = 01 Thesynthetic data have 119870 clusters and are defined in cylindricalcoordinatesThe number of data points in each cluster is 200The mean azimuth of the 119895th cluster arctan(119906119910119906119909) is arandom number in [0 2120587 The concentrate parameter 120581119895 is arandom number in [5 30 Themean radius of the 119895th cluster120583119903119895 is a random number in [1 4 The mean height of the 119895thcluster 120583119911119895 is a random number in [minus15 15 The standarddeviations of 120590119903119895 and 120590119911119895 are random numbers in [005 02

Figure 2 shows the relationship between the number ofclusters and adjusted rand index (ARI) ARI evaluates theperformance of clustering algorithms [24] When ARI = 1 alldata points belong to true clusters The figure shows that thecyk-means has the largest ARI for almost all cases The fixedcyk-means performs better than the kernel 119896-means and the119896-meansThe 119896-means performs the worst In conclusion thecyk-means most accurately partitions synthetic data definedin cylindrical coordinates and the fixed cyk-means alsoperforms well

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Mathematical Problems in Engineering 7

Table 2 Parameters estimated by the cyk-means the fixed cyk-means and the 119896-means

Method 119895 119873119895 120583120579119895 120581119895 120583119903119895 120590119903119895 120583119911119895 120590119911119895cyk-means

1 200 minus1960 times 10minus3 7523 09746 01942 05065 0094172 300 minus1997 3979 1495 01492 minus05070 018663 400 1986 1269 04966 01004 1005 01997

Fixed cyk-means1 2002 minus5124 times 10minus5 mdash 1001 mdash 04995 mdash2 2999 minus2000 mdash 1501 mdash minus05056 mdash3 400 1996 mdash 04998 mdash 09995 mdash

119896-means1 18485 minus06627 mdash 1091 mdash 01871 mdash2 25365 minus1974 mdash 1358 mdash minus04974 mdash3 4615 1704 mdash 04406 mdash 09477 mdash

The results are the mean of twenty runs on randomly generated initial values 119895 is a cluster number 120583120579119895 is the azimuth of the centroid of the 119895th cluster 120583120579119895 =arctan(119906119910119895119906119909119895) The best estimations are bold

070

075

080

085

090

095

100

ARI

cyk-meansFixed cyk-means

Kernel k-meansk-means

42 8 106K

Figure 2 Relationship between the number of clusters 119870 andadjusted rand index (ARI) The vertical and the horizontal linesindicate ARI and the number of clusters 119870 respectively The resultsare the mean of 200 runs on randomly generated synthetic data foreach 119870

42 Real World Data We show the performances of theproposed methods for the iris dataset (httpmlearnicsuciedudatabasesiris) and the segmentation benchmark da-taset (httpwwwntuedusghomeasjfcaiBenchmark Websitebenchmark indexhtml) [25] The iris dataset has 150 datapoints of three classes of irises The data point consists ofthe four attributes sepal length in cm sepal width in cmpetal length in cm and petal width in cm The segmentationbenchmark dataset consists of 100 images from the Berkeleysegmentation database [26] and ground-truths generated bymanual labeling

Table 3 depicts the ARI scores of the cyk-means the fixedcyk-means the 119896-means and the kernel 119896-means for the irisdataset The parameters of the fixed cyk-means are 120581 = 15120590119903 = 01 and 120590119911 = 01 120574 of the radial basis function of thekernel 119896-means is 001 In this experiment we use only three

attributes of the iris dataset because the proposed methodsare specialized for 3-dimensional data Furthermore wetransform this dataset that has three attributes into zeromeandataset In all cases the performance of the cyk-means islower than the othermethods Conversely in almost all casesthe performance of the fixed cyk-means is the best Howeverthe difference in the performance between the fixed cyk-means the 119896-means and the kernel 119896-means is not large

Table 4 shows the ARI scores of the cyk-means thefixed cyk-means and the k-mean for seven images in thesegmentation benchmark datasetThe parameters of the fixedcyk-means are 120581 = 15 120590119903 = 01 and 120590119911 = 01 To evaluatethe performances of the cyk-means and the fixed cyk-meanswe convert images from RGB color to HSV color When wecluster the dataset by the 119896-means we use images representedby RGB color andHSV color In this experiment we comparea clustering result with a ground truth using the ARI scoreWe set the number of clusters119870 to the number of segments ina ground truth In all cases the fixed cyk-means stably showsgood performance The cyk-means indicates much better orworse performances than the other methods In other wordsthe cyk-means shows unstable performance This instabilitywill be caused by the cyk-means more easily trapping a localminimum because of more parameters

43 Application to Color Image Quantization We apply thecyk-means and the fixed cyk-means to color image quanti-zation and compare the results to those using the 119896-meansWe convert images quantized by the proposed methods fromRGB color space to HSV color space before quantizationwhereas an image processed by the 119896-means is representedusing RGB Figure 3 contains the four test images from theBerkeley segmentation database [26] and their quantizationresults The original color images have sizes of 481 times 321 or321 times 481 and are used as the test images to quantize intothree colors These quantization results are generated by thecyk-means the fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 and the 119896-means The color of a pixel in thequantized image represents the value of the centroid of thecluster that contains the pixel

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

8 Mathematical Problems in Engineering

Table 3 Comparison of performances of the four methods using the iris dataset

Attributes cyk-means Fixed cyk-means 119896-means Kernel 119896-meansSepal length sepal width petal length 05543 06961 06569 06614Sepal width petal length petal width 05627 07900 07462 07590Petal length petal width sepal length 04179 07114 07302 07359Petal width sepal length sepal width 05171 06121 06104 06099ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoAttributesrdquo indicates three attributes used in clustering

Table 4 Comparison of performances of the four methods using the segmentation benchmark dataset

Number cyk-means Fixed cyk-means 119896-means (HSV) 119896-means (RGB)12003 (119870 = 4) 06397 03744 03695 0227769020 (119870 = 2) 01611 05247 007680 0140680099 (119870 = 2) 09205 07960 06544 004890103029 (119870 = 3) 04697 05934 01794 009143106047 (119870 = 2) 03210 02988 004155 002332310007 (119870 = 7) 01548 03338 03075 03396353013 (119870 = 4) 06125 05993 05020 04683ARI are the mean of 200 runs with random initial values The best estimations are bold ldquoNumberrdquo indicates the number of the image ldquo119896-means (HSV)rdquo andldquo119896-means (RGB)rdquo indicate that we cluster HSV and RGB color images by the 119896-means respectively

For image 118035 in Figure 3 the colors of the back-ground the wall and the roof are obviously different fromeach other The cyk-means and the fixed cyk-means success-fully segment this image whereas the 119896-means extracts theshade from the wall and can not merge the wall to one colorFurthermore the quantization results using the cyk-meansand the fixed cyk-means are very similar

Image 26098 consists of red and green peppers on adisplay table The cyk-means merges the red peppers and theplanks of the display table and divides the dark area into twocolors The fixed cyk-means successfully extracts the redpeppers The 119896-means assigns red to the planks and part ofthe green peppers

Image 299091 consists of some sky with cloud an ocherpyramid and ocher groundThe cyk-means groups the ocherpyramid and white cloud into the same color whereas thefixed cyk-means correctly segments the pyramid and the skyThe 119896-means is unsuccessful it divides the pyramid into threeregions (an ocher region a highlight region and a shaderegion)

The cyk-means did not perform well for image 295087 Itsegments the image into two colors even though we set thenumber of clusters to three Thus the cyk-means makesa dead unit This is because the concentrate parameterand variances respectively become small and large if adistribution of data points is regarded to visually consist ofa few clusters Thus a few clusters include all data points anddead units (empty clusters) appear even if we fix the numberof clusters to a large number In contrast the fixed cyk-means(which has fixed concentrate and variance values) appropri-ately partitions the ground and the blue and the deep blueregions of the skyThe 119896-means extracts shaded regions fromthe ground that is it can not group the ground into oneregion

Furthermore the initial parameters 120581 and 120590s of thefixed cyk-means can control the quantization results Figure 4shows the quantization results generated by the fixed cyk-means using the different parameters The original image inFigure 4 consists of two objects the red fish and the arms of ananemone The fixed cyk-means with 120581 = 25 120590119903 = 01 and120590119911 = 01 can not extract the red fish shown in the middleimage of Figure 4 However the fixed cyk-means with 120581 =50 120590119903 = 05 and 120590119911 = 05 extracts the red fish in theleft image of Figure 4 This is because a large 120581 andor largevariances relatively increase the cosine similarity term of(12) and consequently clustering is more focused on the hueelement

In conclusion the fixed cyk-means is a more suitablemethod for color image quantization than the cyk-meansThe fixed cyk-means quantizes color images with respect tothe hue The quantization results of the fixed cyk-meansdiffer from that generated by 119896-means That is because theEuclidean metric cannot consider the hue

5 Conclusion and Discussion

In this study we develop the cyk-means and the fixed cyk-means methods which are new clustering methods for datain cylindrical coordinates We derive a new similarity for thecyk-means from a probability distribution that is the productof a von Mises distribution and two Gaussian distributions(see (4)) because the Euclidean distance cannot properlyrepresent dissimilarities between data points on periodicaxes Our experiments demonstrate that the cyk-means andthe fixed cyk-means can properly partition synthetic data incylindrical coordinates Furthermore the experimentalresults using real world data show that the fixed cyk-meanshas equal or better performance than the 119896-means and

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Mathematical Problems in Engineering 9

Original cyk-means fixed cyk-means k-means

1180

3525

098

2990

9129

5087

Figure 3 Quantization resultsThe first column contains the original imagesThe second third and fourth columns contain the quantizationresults generated by the cyk-means the fixed cyk-means and the 119896-means respectively All original images are clustered with 119870 = 3

the kernel 119896-means In the final experiment the proposedmethods are applied to color image quantization andsuccessfully quantize a color image with respect to the hueelement

The experiments that partitioned synthetic data demon-strate the effectiveness of the cyk-means In the first exper-imental results the cyk-means produces good estimates ofthe parameters and clustering data The results of the secondexperiment show that the cyk-means performs the best whenclustering synthetic data However in the experiment usingreal world data we find that the cyk-means did not providegood clustering results Furthermore the results of the colorimage quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing

few data points Thus the cyk-means may not be appropriatefor actual applications

Thefixed cyk-meanswill be an effectivemethod for actualapplicationsThe fixed cyk-means is stable and performs wellwhen we apply it to clusterings of synthetic data real worlddata and color image quantization Furthermore the fixedcyk-means hardly makes dead units because the number ofits parameters is smaller than the cyk-means The fixed cyk-means requires less computational time than the cyk-meanswith similar results

In future work we will improve the performance of theproposed methods The proposed methods are exposed tothe ill-initialization problem andor the dead unit problemcaused by an incorrect initialization similar to 119896-means

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

10 Mathematical Problems in Engineering

Original k = 25 r = 01 z = 01 k = 50 r = 05 z = 05

Figure 4 Quantization results of the fixed cyk-means with different parameters The left image is the original The middle and the rightimages are quantized with119870 = 3

The 119896-means++ method proposed by Athur and Vassilvitskii[27] solves the ill-initialization problem of 119896-means andimproves the clustering performance by obtaining an initialset of cluster centers that is close to the optimal solutionThe conscience mechanism improves the performance ofcompetitive learning and clustering algorithms [28ndash30] Itinserts a bias into the competition process so that eachunit can win the competition with equal probability Xu etal [31] proposed an algorithm based on competitive learningcalled rival penalized competitive learning [2 32] whichdetermines the appropriate number of clusters and solves thedead unit problem The strategy of rival penalized competi-tive learning is to adapt the weights of the winning unit to theinput and to unlearn the weights of the 2nd winner Byincorporating the approaches in these algorithms into theproposed methods we will improve the performance andreduce the effect of the intrinsic problems

Conflicts of Interest

The author declares that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The author would like to thank Toya Teramoto of the Univer-sity of Electro-Communications for testing their algorithm

References

[1] C M Anderson-Cook and B S Otieno ldquoCylindrical datardquoEcological Statistics 2013

[2] N-Y An and C-M Pun ldquoColor image segmentation usingadaptive color quantization and multiresolution texture char-acterizationrdquo Signal Image and Video Processing vol 8 no 5pp 943ndash954 2014

[3] J B MacQueen ldquoSome methods for classification and analysisof multivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 University of California Press Berkeley Calif USA1967

[4] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[5] I S Dhillon and D S Modha ldquoConcept decompositions forlarge sparse text data using clusteringrdquo Machine Learning vol42 no 1-2 pp 143ndash175 2001

[6] A Banerjee I Dhillon J Ghosh and S Sra ldquoGenerativemodel-based clustering of directional datardquo in Proceedings ofthe 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD rsquo03 pp 19ndash28 New York NYUSA August 2003

[7] A Banerjee I SDhillon J Ghosh and S Sra ldquoClustering on theunit hypersphere using vonMises-Fisher distributionsrdquo Journalof Machine Learning Research vol 6 pp 1345ndash1382 2005

[8] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[9] A Y Ng M I Jordan and Y Weiss ldquoOn spectral clusteringanalysis and an algorithmrdquo in Advances in Neural InformationProcessing Systems pp 849ndash856 MIT Press Cambridge MassUSA 2001

[10] A Ben-Hur D Horn H T Siegelmann and V Vapnik ldquoSup-port vector clusteringrdquo Journal of Machine Learning Researchvol 2 pp 125ndash137 2002

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Mathematical Problems in Engineering 11

[11] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACMWorkshop on Computational LearningTheory (COLT rsquo92)pp 144ndash152 New York NY USA July 1992

[12] Y Deng and B S Manjunath ldquoUnsupervised segmentation ofcolor-texture regions in images and videordquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 23 no 8 pp800ndash810 2001

[13] C-K Yang and W-H Tsai ldquoColor image compression usingquantization thresholding and edge detection techniques allbased on the moment-preserving principlerdquo Pattern Recogni-tion Letters vol 19 no 2 pp 205ndash215 1998

[14] N-C Yang W-H Chang C-M Kuo and T-H Li ldquoA fastMPEG-7 dominant color extraction with new similarity mea-sure for image retrievalrdquo Journal of Visual Communication andImage Representation vol 19 no 2 pp 92ndash105 2008

[15] P Heckbert ldquoColor image quantization for frame buffer dis-playrdquo Computer Graphics vol 16 no 3 pp 297ndash307 1982

[16] M Emre Celebi ldquoImproving the performance of k-means forcolor quantizationrdquo Image and Vision Computing vol 29 no 4pp 260ndash271 2011

[17] D Ozdemir and L Akarun ldquoA fuzzy algorithm for color quan-tization of imagesrdquo Pattern Recognition vol 35 no 8 pp 1785ndash1791 2002

[18] X Zhao Y Li and Q Zhao ldquoMahalanobis distance based onfuzzy clustering algorithm for image segmentationrdquo DigitalSignal Processing A Review Journal vol 43 pp 8ndash16 2015

[19] A Atsalakis N Papamarkos N Kroupis D Soudris and AThanailakis ldquoColour quantisation technique based on imagedecomposition and its embedded system implementationrdquo inProceedings of the Vision Image and Signal Processing IEEProceedings vol 151 pp 511ndash524

[20] C-H Chang P Xu R Xiao and T Srikanthan ldquoNew adaptivecolor quantization method based on self-organizing mapsrdquoIEEE Transactions on Neural Networks vol 16 no 1 pp 237ndash249 2005

[21] E J Palomo and E Domınguez ldquoHierarchical color quan-tization based on self-organizationrdquo Journal of MathematicalImaging and Vision vol 49 no 1 pp 1ndash19 2014

[22] M G Omran A P Engelbrecht and A Salman ldquoA color imagequantization algorithm based on particle swarm optimizationrdquoInformatica vol 29 no 3 pp 261ndash269 2005

[23] S Sra ldquoA short note on parameter approximation for vonMises-Fisher distributions and a fast implementation of 119868119904(119909)rdquoComputational Statistics vol 27 no 1 pp 177ndash190 2012

[24] K Y Yeung and W L Ruzzo ldquoDetails of the adjusted randindex and clustering algorithms supplement to the paperldquoPrincipal component analysis for clustering gene expressiondatardquordquo Bioinformatics vol 17 no 9 pp 763ndash774 2001

[25] H Li J Cai T N A Nguyen and J Zheng ldquoA benchmark forsemantic image segmentationrdquo in Proceedings of the 2013 IEEEInternational Conference on Multimedia and Expo ICME 2013July 2013

[26] D Martin C Fowlkes D Tal and J Malik ldquoA database ofhuman segmented natural images and its application to evalu-ating segmentation algorithms andmeasuring ecological statis-ticsrdquo in Proceedings of the 8th International Conference onComputer Vision pp 416ndash423 July 2001

[27] D Arthur and S Vassilvitskii ldquok-means++ the advantagesof careful seedingrdquo in Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms SODA rsquo07

Society for Industrial and Applied Mathematics pp 1027ndash1035Philadelphia PA USA 2007

[28] D DeSieno ldquoAdding a conscience to competitive learningrdquo inProceedings of 1993 IEEE International Conference on NeuralNetworks (ICNN rsquo93) pp 117ndash124 vol1 San Diego CA USAMarch 1993

[29] C-D Wang J-H Lai and J-Y Zhur ldquoA conscience on-linelearning approach for kernel-based clusteringrdquo inProceedings ofthe 10th IEEE International Conference on Data Mining ICDM2010 pp 531ndash540 December 2010

[30] C-D Wang J-H Lai and J-Y Zhu ldquoConscience online learn-ing An efficient approach for robust kernel-based clusteringrdquoKnowledge and Information Systems vol 31 no 1 pp 79ndash1042012

[31] L Xu A Krzyzak and E Oja ldquoRival penalized competitivelearning for clustering analysis rbf net and curve detectionrdquoIEEE Transactions on Neural Networks vol 4 no 4 pp 636ndash649 1993

[32] TM Nair C L Zheng J L Fink R O Stuart andM GribskovldquoRival penalized competitive learning (RPCL) a topology-determining algorithm for analyzing gene expression datardquoComputational Biology and Chemistry vol 27 no 6 pp 565ndash574 2003

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: ResearchArticle A Clustering Method for Data in ...downloads.hindawi.com/journals/mpe/2017/3696850.pdf · ResearchArticle A Clustering Method for Data in Cylindrical Coordinates KazuhisaFujita1,2

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of