research article an improved clustering algorithm of...

7
Research Article An Improved Clustering Algorithm of Tunnel Monitoring Data for Cloud Computing Luo Zhong, 1 KunHao Tang, 1,2 Lin Li, 1,3 Guang Yang, 1 and JingJing Ye 1 1 Department of Computer Science and Technology, Wuhan University of Technology, Wuhan 4300702, China 2 Department of Computer and Information Science, Hunan Institute of Technology School, Hunan 421002, China 3 Collaborative Innovation Center of Hubei Province for Information Technology & Service of Elementary Education, Hubei 430070, China Correspondence should be addressed to KunHao Tang; [email protected] Received 7 November 2013; Accepted 4 December 2013; Published 2 April 2014 Academic Editors: W. Sun, G. Zhang, and J. Zhou Copyright © 2014 Luo Zhong et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the rapid development of urban construction, the number of urban tunnels is increasing and the data they produce become more and more complex. It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnel. To solve this problem, an improved parallel clustering algorithm based on k-means has been proposed. It is a clustering algorithm using the MapReduce within cloud computing that deals with data. It not only has the advantage of being used to deal with mass data but also is more efficient. Moreover, it is able to compute the average dissimilarity degree of each cluster in order to clean the abnormal data. 1. Introduction At present, with the rapid development of municipal con- struction in our country, the tunnel of urban is developing progressively. e tunnel data is becoming more and more complex. e clustering algorithm based on partition is of simplicity and accuracy, and it has been widely applied in scientific researches and production practice. As a classical clustering algorithm, -means, which is based on partition, is a hot topic all the time. However, tunnel data has already entered the level of massive amounts of data, taking the time, space, and data volume complexity [1] into account. e traditional clustering algorithm -means has been unable to deal with these data. According to the characteristics of the -means, to use the parallel computing based on MapReduce in the cloud platform imperative. e probability of cloud computing is proposed by Google; it is a kind of calculation model that combines several technologies; it also has some character- istics, such as high reliability, high extensibility, and super large scale. Aſter that, IBM also brought the cloud computing platform [2] so that clients could be ready to use them. Cloud platform, using Hadoop technology, can realize large- scale distributed computing and processing. It is mainly made up by Hadoop distributed file system and MapReduce programming model [3] which was widely used in cloud computing. In addition, the MapReduce model is a distributed par- allel programming model, which was proposed by Google Labs. It can deal with problems including large datasets with computer cluster, and that makes it become a mainstream of the parallel data processing model of cloud computing platform [4]. e MapReduce model also can easily solve the problems that -means cannot handle, such as processing large data and demanding bigger serial overhead. e city tunnel monitoring data include a variety of forms; they contain the instantaneous of traffic lane, speed, occupancy rate, wind speed, temperature, humidity, light intensity, and CO concentration [5]. ey are multidimen- sional data from different equipment so that it can ensure the effectiveness of the experimental data [6]. With the continu- ous development of data processing technology, people begin to analyze huge amounts of data statistics [7] by using various data mining techniques and tools. e purpose of cluster data Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 630986, 6 pages http://dx.doi.org/10.1155/2014/630986

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

Research ArticleAn Improved Clustering Algorithm of Tunnel Monitoring Datafor Cloud Computing

Luo Zhong1 KunHao Tang12 Lin Li13 Guang Yang1 and JingJing Ye1

1 Department of Computer Science and Technology Wuhan University of Technology Wuhan 4300702 China2Department of Computer and Information Science Hunan Institute of Technology School Hunan 421002 China3 Collaborative Innovation Center of Hubei Province for Information Technology amp Service of Elementary EducationHubei 430070 China

Correspondence should be addressed to KunHao Tang kunhaotangsinacom

Received 7 November 2013 Accepted 4 December 2013 Published 2 April 2014

Academic Editors W Sun G Zhang and J Zhou

Copyright copy 2014 Luo Zhong et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

With the rapid development of urban construction the number of urban tunnels is increasing and the data they produce becomemore and more complex It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnelTo solve this problem an improved parallel clustering algorithm based on k-means has been proposed It is a clustering algorithmusing the MapReduce within cloud computing that deals with data It not only has the advantage of being used to deal with massdata but also is more efficient Moreover it is able to compute the average dissimilarity degree of each cluster in order to clean theabnormal data

1 Introduction

At present with the rapid development of municipal con-struction in our country the tunnel of urban is developingprogressively The tunnel data is becoming more and morecomplex The clustering algorithm based on partition is ofsimplicity and accuracy and it has been widely applied inscientific researches and production practice As a classicalclustering algorithm 119896-means which is based on partitionis a hot topic all the time However tunnel data has alreadyentered the level of massive amounts of data taking the timespace and data volume complexity [1] into account Thetraditional clustering algorithm 119896-means has been unable todeal with these data

According to the characteristics of the 119896-means to usethe parallel computing based on MapReduce in the cloudplatform imperative The probability of cloud computing isproposed by Google it is a kind of calculation model thatcombines several technologies it also has some character-istics such as high reliability high extensibility and superlarge scale After that IBM also brought the cloud computingplatform [2] so that clients could be ready to use them

Cloud platform using Hadoop technology can realize large-scale distributed computing and processing It is mainlymade up by Hadoop distributed file system and MapReduceprogramming model [3] which was widely used in cloudcomputing

In addition the MapReduce model is a distributed par-allel programming model which was proposed by GoogleLabs It can deal with problems including large datasets withcomputer cluster and that makes it become a mainstreamof the parallel data processing model of cloud computingplatform [4] TheMapReduce model also can easily solve theproblems that 119870-means cannot handle such as processinglarge data and demanding bigger serial overhead

The city tunnel monitoring data include a variety offorms they contain the instantaneous of traffic lane speedoccupancy rate wind speed temperature humidity lightintensity and CO concentration [5] They are multidimen-sional data from different equipment so that it can ensure theeffectiveness of the experimental data [6] With the continu-ous development of data processing technology people beginto analyze huge amounts of data statistics [7] by using variousdatamining techniques and toolsThe purpose of cluster data

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 630986 6 pageshttpdxdoiorg1011552014630986

2 The Scientific World Journal

Map task1

Map task2

Map task3

Reduce task

Local storage

Shuffle task

01

01

01

HDFS

HDFS

Input block1

Input block2

Input block3

Figure 1 The interaction figure of HDFS and MapReduce

is to use relevant technology to classify the extracted datafrom various data sources while finding the abnormal data

However the tunnel monitoring system demands real-time higher At the same time the collected data storagedensity is intensive while the data storage is also large [8] Itis such a large amount of data that it cannot use serializingapproach to find out the abnormal data Moreover it is notable to cluster [9] Therefore we use the distributed process-ing technology to deal with data by cloud computing [10]

2 The Introduction of Cloud Platforms

Thecloud platformwe normally refer to is the platformwhichcan develop easily and deal with processing of mass data inparallel Hadoop [11] It is mainly composed of the distributedfile system which is called HDFS and the calculation modelwhich is called MapReduce [12]

21 Hadoop Distributed File System (HDFS) HDFS hasadopted the masterslave structure [13] which is principallycomposed by Client Name Node Secondary Name Nodeand Data Node Every HDFS cluster consists of a uniquemanagement node (the Name Node) and a certain numberof data nodes (the Secondary Node) and every node is anordinary PC HDFS is very similar to the stand alone italso can build directory create copy delete files check thefile content [14] and so forth HDFS is a distributed filesystem which possesses high fault-tolerant it can providehigh throughput data access [15] which is fairly suitablefor the application in large datasets The interaction processbetween HDFS and MapReduce is shown in Figure 1

22 The Calculation Model of MapReduce MapReduce is ahighly efficient distributed programming model [16] whichis developed by Google It can abstract problems highly andmake the problembecome simple and it ismainly particularlyused for large-scale (TB) data processing [17] The workprocess of MapReduce is mainly composed of three partsMap Shuffle and Reduce

(1) The process of Map Map reads data from the HDFS[18] and the data is divided into several independentshards (Split) and through theMap function iterationparsed into keyvalue pair ⟨keyvalue⟩

(2) The process of Shuffle Sorting provisional resultsis generated by the Map phase according to the

key value dividing the map production temporarydata (keyvalue) into several groups (partition) andfinally every partition will give a the Reduce Taskprocessing in the form of ⟨key value list⟩

(3) The process of Reduce reading from the remotenode ⟨key value list⟩ and merging them through thereduce function and in the end transferring the finalresult into the HDFS [19]

3 Traditional Clustering Algorithm andIts Parallel

31 Traditional (Serializing) 119896-Means Clustering AlgorithmSo far the commonly used clustering analysis algorithms arecomposed of the following five categories the algorithmsbased on classification the algorithms based on the hierarchythe algorithms based ondensity the algorithms based on gridand the algorithms based on model-based [20] Studies showthat tunnel data storage density is large so algorithms basedon data partition are more applicable to tunnel But 119896-meansis a typical representative of algorithms based on partition 119896-means algorithm a partition clustering method [21] whichregards an average value as the center of the cluster and thespecific steps of serializing 119896-means clustering method are asfollows

(1) selecting several objects from the original datasetrandomly and treating them as the initial clusteringcenter

(2) to calculate the distances between the other objectsand the center and assign them to the class of theclosest center

(3) for each class working out themeans of all the objectsas a new center

(4) repeat (2) and (3) until the standard measurementfunction begins to converge

32The Parallel 119896-Means Clustering Algorithm Although thesequential 119896-means algorithm is simple and convenient a lotof defects still exist The time complexity of 119896-means is119874(119899 lowast119896lowast119905) in addition 119899 is the number of all the objects 119896 denotesthe cluster number of clustering and 119905 represents the numberof iterations [22]Thus it can be seen that the time complexityof sequential algorithm is very high the iteratively replaced

The Scientific World Journal 3

Input split

k1v1

HDFSk2v2

k3v3

Map()

Map()

Map()

Partition()

Partition()

Partition()

k1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 2 The execution figure of Map

Partition()

Partition()

Partition()

Combine() Reduce()

Local storage

Keyvalue list

0

1

2

0

1

2

0

1

2

k 1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 3 The execution figure of Shuffle

part consumes the most time during this operation and it iseasier for parallel processing [23]TheMapReduce process ofparallel 119896-means is as follows

(1) The Process of Map The Map phase of the MapReduceprocess is performed by several Map functions Accordingto the number of input splits there will be Map functionwhich has some inputs to calculate the distance betweenevery data object and the current cluster center Furthermorewe can tag the corresponding cluster number for data objectsaccording to the distance The cluster number is the numberof cluster center which is closest to the object Its input is thatall the data objects waiting on clustering and the clusteringcenter during the previous iteration (or the initial center) itsinput is in the form of (keyvalue) In keyvalue pairs keyis corresponding to the offset in current data samples whichis relative to the starting point while value is correspondingto the vector of strings in the current data sample whichis composed of each dimension coordinate After the Mapphase the result will be in the form of keyvalue pairs tooutput the clustering cluster number and recorded value

Its execution process is shown in Figure 2

(2) The Process of Shuffle There are two main functionsPartition and Combine Finishing it the input during thisphase is the keyvalue pair which was worked out duringthe Map phase and it is mainly used before entering theReduce phase combining the data locally in order to reducethe consumption of data communication in the processof iteration First of all the Partition function sorts theprovisional result which was produced during theMap phase

and divides the temporary data (keyvalue pairs) which wasproduced by map into several groups (Partition) The keywhich was input is corresponding to the cluster number ofclustering while values are corresponding to vector of stringswhich are composed by dimensional coordinates Then theCombine functionwill combine group and count the numberof data samples and the sum of all the data coordinates inaddition the outputs which were in the form of keyvaluepair (keyvalue list) are given to different Reduce functionsIts execution process is shown in Figure 3

(3) The Process of Reduce Reduce function accepts thekeyvalue pairs (key and value list) which were worked outby Combine function Among them corresponding to thekey is the number of data samples and the values list iscorresponding to the vector of strings which are composedby dimensional coordinates By adding string vectors of everydimensional coordinates and combining with the number ofthe data to work out the new cluster center which is input inthe form of keyvalue pairs into the nextMapReduce process

4 The Good Center Parallel 119870-Means(GCPK) Algorithm

41 The Calculation Method of Dissimilarity Degree Thisexperiment uses Euclidean distance as a measurement stan-dard to calculate the similarity degree among data in thecluster In this article the average dissimilarity degree men-tioned refers to the average Euclidean distance between thedata within the cluster and the cluster center If the averagedissimilarity degree is lower the data within the cluster will

4 The Scientific World Journal

be more similar in other words the results of clustering aremore ideal The Euclidean computationalformula is

119889 (119909 119910) = radic

119898

sum

119894=1

(119909119894minus 119910119894)2

(1)

The ldquo119898rdquo denotes the dimension value of the array

42TheDefects of Parallel119870-Means Clustering Algorithm 119870-means parallel clustering algorithm solves the problem of theiteration time that 119896-means spends very well but it does nothandle the selection problem of 119896-means initial center Inorder to choose appropriate initial center reduce unnecessaryiteration and decrease the operation time and improveefficiency in view of the tunnel data we proposed the 119870-means parallel improved algorithm Good center parallel 119896-means algorithm whose abbreviation is GCPK algorithm Inthe dataset the algorithm can find out the algorithm firstlyfind out a point as the center which can separate data onaverage basically in other words it is appropriate to be thegood center for the initial center and then to parallelize the kmeans clustering Finally based on the clustering results theabnormal data will be analyzed

43 The Implementation of GCPK Algorithm

(1) To Get the Initial Center through Relative Centroid 119896-means clustering method is a typical method based onclustering in the 119896-means clustering method because ofreplacing the center constantly [24] the selection of initialcenter has a great influence on 119896-means clustering algorithm[25] In order to decrease the times of replacement andmakethe 119896-means clustering become more efficient it is necessaryto find the initial center which is relative to the center androughly divide the dataset

Because the tunnel data has the characteristics of non-negativity tunnel data can be regarded as multidimensionaldata graphics which is composed of 119899-dimensional dataRegarding each data as their own dimensional coordinatesfind out the point 1198761 (119894

1 1198942 119894

119899) which is closest to the

origin of coordinates and the point 1198762 (1198951 1198952 119895

119899) which

is farthest away from the originTo connect1198761 and1198762with a line segment and divide the

line segments into 119896 + 1 parts (this 119896 here is the same as the119896 in 119896-means) regard each data as a particle and there are119896 equal diversion points on the line segment (not includingthe endpoint) Furthermore the coordinate of the 119905 (0 lt 119905 lt119896 + 1 119905 isin 119885) equal diversion point is (119875

1199051 1198751199052 119875

119905119899) and

119899 denotes the number of dimensions So 119875119905119898= (119898119896)(119894

119898+

119895119898)119898 = 1 2 119899 These 119896 points obtained are the points in

the center position which can average the multidimensionaldata graphics and we regard the 119896 points as the initial centerpoints

(2) The Specific Implementation of GCPK Parallel Clustering

(1) Upload the dataset to the HDFS(2) Upload the initial center to the HDFS

(3) According to the original data and the previous itera-tion (or initial center) calculate the current iterationwith the Map function

(4) Calculate the sum of the pointsrsquo number and weightin the same cluster and recount the clustering centerof cluster

(5) Do not control the iterative process until the max-imum number of iterations is exceeding or all theclustering has been convergent at this time calculatethe average dissimilarity degree and the maximumdissimilarity degree within each cluster of all clustersin addition count the average dissimilarity degreeand the maximum dissimilarity degree in the wholecluster

(3) The Analysis of Abnormal Data Download the resultof clustering to local through HDFS and convert binaryserialization format file into a text file Analyzing the resultsfile and getting the average dissimilarity degree and themaxi-mumdissimilarity degree in every cluster counting the radiusof every dimension combining the weight of each dimensiondata for cleaning abnormal data Because the amounts of dataare huge and the data quantity is too large it is unable tocomplete artificial recognition and use the simple method toidentify the abnormal data It is necessary to use cloud com-puting technology to clean the abnormal clustering results

(1) Read the average dissimilarity degree 119863119894in every

cluster and the dimension radius 119877119895 Furthermore

119894 = 1 2 119896 and 119896 are the clusterrsquos numbers 119895 =1 2 119899 and 119899 are the dimensions

(2) Calculate the weight 119876119895 which is within the cluster

119876119895= 119905119895119899 The ldquo119905rdquo is the number of times in which

the data of dimension 119895 come out the ldquo119899rdquo is the totaldimension

(3) Work out the reasonable range of dissimilarity degreedis(119894) in clusters in addition

dis (119894) = radic119863(119894)2 +119899

sum

119895=1

119876119895lowast 1198772119895

(2)

119894 = 1 2 119896 119896 denotes the number of all clusters(4) Test the data object which is got from MapReduce

according to the data objects Euclidean distance ofthe cluster center (dissimilarity degree) and the rea-sonable range of dissimilarity degree within clustersand to judge whether the data is abnormal or notIf dissimilarity degrees are beyond reasonable rangewe can know it is the abnormal data and record thenumber

5 Experimental Environment andResults Analysis

51 Experimental Environment

511 Single-Machine Environment This paper uses MAT-LAB2010 single-machine environment the operations of the

The Scientific World Journal 5

0

20

40

60

80

3 5 7 9 10 11 12 14 16

Cluster k

Aver

age p

hase

thin

DataSet1DataSet2

DataSet3DataSet4

Figure 4 119896 values for experiments

specific environment are the Intel(R) Core(TM) i3 CPU253GHz 4GB of memory and a 64-bit operating system

512 Distributed Environment In this paper the experimentof distributed environment uses open source cloud comput-ing platform hadoop whose version is 112 Specific cloudcluster is as follows one machine runs as the Name Nodeand Job Tracker while the other 19 run as a Catano andTask Tracker The experimental cloud platform uses the VMtechnology specific software configuration is Cream wareworkstation-full-902 a 64-bit Ubuntu-12043 JDK versionis jdk170 21 and eclipse version is Linux version

513 Data Source The data for clustering analysis are fromthe tunnel monitoring data including flow temperaturewind speed illumination intensity CO concentration speedwind speed and so on

52The Experimental Data This experiment has adopted thefive datasets which were gathered from tunnel monitoringsystem ofWuhan Fruit Lake respectively they wereDataSet1DataSet2 DataSet3 DataSet4 and DataSet5 Furthermorethe seventh dimensions tunnel data record has been covered42993 times by the DataSet1 which can be used for singleexperiment and distributed experimentThe thirtieth dimen-sions tunnel data record has been covered respectively 42775times article 65993 times article 100 000 000 times andarticle 2000 000 000 times by DataSet2ndashDataSet5

53 Analysis of Experimental Result

Experiment 1 There are 42993 tunnel data whose dimensionvalue is 7 there are 42993 tunnel data whose dimension valueis 7 the ideal value of these data is 5 under the conditionof single machine MATLAB was used to complete 119896-meansserial clustering algorithm [26] Clara clustering algorithm[27] and RICA clustering algorithm [28] in the case of thedistributed cluster to accomplish 119896-means parallel clusteringalgorithm with hadoop [29] and to cluster the tunnel datawith GCPK the results are shown in Table 1 under thecondition of two kinds of the clustering algorithm

Experiment 2 For 119896-means clustering the initial value ofk has great influence on the clustering results and the

Table 1 Thin contrast of DataSet1

Serializing119896-means Clara RICA Parallel

119896-means GCPK

Cluster (119896) 5 5 5 5 5Averagephase thin 232027 218868 190924 160240 160226

Maximumphase thin 1052276 1132348 965631 782311 773419

Abnormalfinding 8322 8936 8953 9323 933

Note the average phase thin and maximum phase thin are the mean valuesof each cluster

Table 2 Clustering time

Dataset cluster119896-means

Serializing119896-means

Parallel119896-means GCPK

DataSet2 10 20min 18 s 7min 50 s 7min 22 sDataSet3 10 27min 35 s 9min 21 s 8min 9 sDataSet4 12 incalculable 18min 33 s 16min 20 sDataSet5 12 incalculable 29min 9 s 23min 15 s

traditional clustering cannot analyze the k in the hugeamounts of data This experiment adopted GCPK algorithmrespectively In terms of four datasets which respectively areDataSet1 DataSet2 DataSet3 and DataSet4 we choose thedifferent k values for them to conduct experiments And theresults are shown in Figure 4

Experiment 3 According to the tunnel monitoring datacollected from DataSet2 to DataSet5 (the dimension valueis 30) under the distributed circumstances clustering with119896-means parallel algorithm and 119896-means parallel improvedalgorithm severally the cluster number k would find idealvalue through GCPK the time cost in Clustering was shownat Table 2

6 The Conclusion and Expectation

Experimental results show that GCPK the improved algo-rithm of 119896-means parallel clustering algorithm has a betterclustering result than the traditional clustering algorithm andhas a higher efficiency It is very convenient to clean theabnormal data for the clustering results in terms of massivedata such as tunnel data it has shown its strong practicabilityAt present due to the limited data formwe are only able to dopreliminary clustering about structured document Howeverafter the relevant documents for clustering of data protectionand the processing of unstructured documents we will dofurther researches on the file data protection after clusteringand the processing of unstructured documents in the future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

2 The Scientific World Journal

Map task1

Map task2

Map task3

Reduce task

Local storage

Shuffle task

01

01

01

HDFS

HDFS

Input block1

Input block2

Input block3

Figure 1 The interaction figure of HDFS and MapReduce

is to use relevant technology to classify the extracted datafrom various data sources while finding the abnormal data

However the tunnel monitoring system demands real-time higher At the same time the collected data storagedensity is intensive while the data storage is also large [8] Itis such a large amount of data that it cannot use serializingapproach to find out the abnormal data Moreover it is notable to cluster [9] Therefore we use the distributed process-ing technology to deal with data by cloud computing [10]

2 The Introduction of Cloud Platforms

Thecloud platformwe normally refer to is the platformwhichcan develop easily and deal with processing of mass data inparallel Hadoop [11] It is mainly composed of the distributedfile system which is called HDFS and the calculation modelwhich is called MapReduce [12]

21 Hadoop Distributed File System (HDFS) HDFS hasadopted the masterslave structure [13] which is principallycomposed by Client Name Node Secondary Name Nodeand Data Node Every HDFS cluster consists of a uniquemanagement node (the Name Node) and a certain numberof data nodes (the Secondary Node) and every node is anordinary PC HDFS is very similar to the stand alone italso can build directory create copy delete files check thefile content [14] and so forth HDFS is a distributed filesystem which possesses high fault-tolerant it can providehigh throughput data access [15] which is fairly suitablefor the application in large datasets The interaction processbetween HDFS and MapReduce is shown in Figure 1

22 The Calculation Model of MapReduce MapReduce is ahighly efficient distributed programming model [16] whichis developed by Google It can abstract problems highly andmake the problembecome simple and it ismainly particularlyused for large-scale (TB) data processing [17] The workprocess of MapReduce is mainly composed of three partsMap Shuffle and Reduce

(1) The process of Map Map reads data from the HDFS[18] and the data is divided into several independentshards (Split) and through theMap function iterationparsed into keyvalue pair ⟨keyvalue⟩

(2) The process of Shuffle Sorting provisional resultsis generated by the Map phase according to the

key value dividing the map production temporarydata (keyvalue) into several groups (partition) andfinally every partition will give a the Reduce Taskprocessing in the form of ⟨key value list⟩

(3) The process of Reduce reading from the remotenode ⟨key value list⟩ and merging them through thereduce function and in the end transferring the finalresult into the HDFS [19]

3 Traditional Clustering Algorithm andIts Parallel

31 Traditional (Serializing) 119896-Means Clustering AlgorithmSo far the commonly used clustering analysis algorithms arecomposed of the following five categories the algorithmsbased on classification the algorithms based on the hierarchythe algorithms based ondensity the algorithms based on gridand the algorithms based on model-based [20] Studies showthat tunnel data storage density is large so algorithms basedon data partition are more applicable to tunnel But 119896-meansis a typical representative of algorithms based on partition 119896-means algorithm a partition clustering method [21] whichregards an average value as the center of the cluster and thespecific steps of serializing 119896-means clustering method are asfollows

(1) selecting several objects from the original datasetrandomly and treating them as the initial clusteringcenter

(2) to calculate the distances between the other objectsand the center and assign them to the class of theclosest center

(3) for each class working out themeans of all the objectsas a new center

(4) repeat (2) and (3) until the standard measurementfunction begins to converge

32The Parallel 119896-Means Clustering Algorithm Although thesequential 119896-means algorithm is simple and convenient a lotof defects still exist The time complexity of 119896-means is119874(119899 lowast119896lowast119905) in addition 119899 is the number of all the objects 119896 denotesthe cluster number of clustering and 119905 represents the numberof iterations [22]Thus it can be seen that the time complexityof sequential algorithm is very high the iteratively replaced

The Scientific World Journal 3

Input split

k1v1

HDFSk2v2

k3v3

Map()

Map()

Map()

Partition()

Partition()

Partition()

k1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 2 The execution figure of Map

Partition()

Partition()

Partition()

Combine() Reduce()

Local storage

Keyvalue list

0

1

2

0

1

2

0

1

2

k 1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 3 The execution figure of Shuffle

part consumes the most time during this operation and it iseasier for parallel processing [23]TheMapReduce process ofparallel 119896-means is as follows

(1) The Process of Map The Map phase of the MapReduceprocess is performed by several Map functions Accordingto the number of input splits there will be Map functionwhich has some inputs to calculate the distance betweenevery data object and the current cluster center Furthermorewe can tag the corresponding cluster number for data objectsaccording to the distance The cluster number is the numberof cluster center which is closest to the object Its input is thatall the data objects waiting on clustering and the clusteringcenter during the previous iteration (or the initial center) itsinput is in the form of (keyvalue) In keyvalue pairs keyis corresponding to the offset in current data samples whichis relative to the starting point while value is correspondingto the vector of strings in the current data sample whichis composed of each dimension coordinate After the Mapphase the result will be in the form of keyvalue pairs tooutput the clustering cluster number and recorded value

Its execution process is shown in Figure 2

(2) The Process of Shuffle There are two main functionsPartition and Combine Finishing it the input during thisphase is the keyvalue pair which was worked out duringthe Map phase and it is mainly used before entering theReduce phase combining the data locally in order to reducethe consumption of data communication in the processof iteration First of all the Partition function sorts theprovisional result which was produced during theMap phase

and divides the temporary data (keyvalue pairs) which wasproduced by map into several groups (Partition) The keywhich was input is corresponding to the cluster number ofclustering while values are corresponding to vector of stringswhich are composed by dimensional coordinates Then theCombine functionwill combine group and count the numberof data samples and the sum of all the data coordinates inaddition the outputs which were in the form of keyvaluepair (keyvalue list) are given to different Reduce functionsIts execution process is shown in Figure 3

(3) The Process of Reduce Reduce function accepts thekeyvalue pairs (key and value list) which were worked outby Combine function Among them corresponding to thekey is the number of data samples and the values list iscorresponding to the vector of strings which are composedby dimensional coordinates By adding string vectors of everydimensional coordinates and combining with the number ofthe data to work out the new cluster center which is input inthe form of keyvalue pairs into the nextMapReduce process

4 The Good Center Parallel 119870-Means(GCPK) Algorithm

41 The Calculation Method of Dissimilarity Degree Thisexperiment uses Euclidean distance as a measurement stan-dard to calculate the similarity degree among data in thecluster In this article the average dissimilarity degree men-tioned refers to the average Euclidean distance between thedata within the cluster and the cluster center If the averagedissimilarity degree is lower the data within the cluster will

4 The Scientific World Journal

be more similar in other words the results of clustering aremore ideal The Euclidean computationalformula is

119889 (119909 119910) = radic

119898

sum

119894=1

(119909119894minus 119910119894)2

(1)

The ldquo119898rdquo denotes the dimension value of the array

42TheDefects of Parallel119870-Means Clustering Algorithm 119870-means parallel clustering algorithm solves the problem of theiteration time that 119896-means spends very well but it does nothandle the selection problem of 119896-means initial center Inorder to choose appropriate initial center reduce unnecessaryiteration and decrease the operation time and improveefficiency in view of the tunnel data we proposed the 119870-means parallel improved algorithm Good center parallel 119896-means algorithm whose abbreviation is GCPK algorithm Inthe dataset the algorithm can find out the algorithm firstlyfind out a point as the center which can separate data onaverage basically in other words it is appropriate to be thegood center for the initial center and then to parallelize the kmeans clustering Finally based on the clustering results theabnormal data will be analyzed

43 The Implementation of GCPK Algorithm

(1) To Get the Initial Center through Relative Centroid 119896-means clustering method is a typical method based onclustering in the 119896-means clustering method because ofreplacing the center constantly [24] the selection of initialcenter has a great influence on 119896-means clustering algorithm[25] In order to decrease the times of replacement andmakethe 119896-means clustering become more efficient it is necessaryto find the initial center which is relative to the center androughly divide the dataset

Because the tunnel data has the characteristics of non-negativity tunnel data can be regarded as multidimensionaldata graphics which is composed of 119899-dimensional dataRegarding each data as their own dimensional coordinatesfind out the point 1198761 (119894

1 1198942 119894

119899) which is closest to the

origin of coordinates and the point 1198762 (1198951 1198952 119895

119899) which

is farthest away from the originTo connect1198761 and1198762with a line segment and divide the

line segments into 119896 + 1 parts (this 119896 here is the same as the119896 in 119896-means) regard each data as a particle and there are119896 equal diversion points on the line segment (not includingthe endpoint) Furthermore the coordinate of the 119905 (0 lt 119905 lt119896 + 1 119905 isin 119885) equal diversion point is (119875

1199051 1198751199052 119875

119905119899) and

119899 denotes the number of dimensions So 119875119905119898= (119898119896)(119894

119898+

119895119898)119898 = 1 2 119899 These 119896 points obtained are the points in

the center position which can average the multidimensionaldata graphics and we regard the 119896 points as the initial centerpoints

(2) The Specific Implementation of GCPK Parallel Clustering

(1) Upload the dataset to the HDFS(2) Upload the initial center to the HDFS

(3) According to the original data and the previous itera-tion (or initial center) calculate the current iterationwith the Map function

(4) Calculate the sum of the pointsrsquo number and weightin the same cluster and recount the clustering centerof cluster

(5) Do not control the iterative process until the max-imum number of iterations is exceeding or all theclustering has been convergent at this time calculatethe average dissimilarity degree and the maximumdissimilarity degree within each cluster of all clustersin addition count the average dissimilarity degreeand the maximum dissimilarity degree in the wholecluster

(3) The Analysis of Abnormal Data Download the resultof clustering to local through HDFS and convert binaryserialization format file into a text file Analyzing the resultsfile and getting the average dissimilarity degree and themaxi-mumdissimilarity degree in every cluster counting the radiusof every dimension combining the weight of each dimensiondata for cleaning abnormal data Because the amounts of dataare huge and the data quantity is too large it is unable tocomplete artificial recognition and use the simple method toidentify the abnormal data It is necessary to use cloud com-puting technology to clean the abnormal clustering results

(1) Read the average dissimilarity degree 119863119894in every

cluster and the dimension radius 119877119895 Furthermore

119894 = 1 2 119896 and 119896 are the clusterrsquos numbers 119895 =1 2 119899 and 119899 are the dimensions

(2) Calculate the weight 119876119895 which is within the cluster

119876119895= 119905119895119899 The ldquo119905rdquo is the number of times in which

the data of dimension 119895 come out the ldquo119899rdquo is the totaldimension

(3) Work out the reasonable range of dissimilarity degreedis(119894) in clusters in addition

dis (119894) = radic119863(119894)2 +119899

sum

119895=1

119876119895lowast 1198772119895

(2)

119894 = 1 2 119896 119896 denotes the number of all clusters(4) Test the data object which is got from MapReduce

according to the data objects Euclidean distance ofthe cluster center (dissimilarity degree) and the rea-sonable range of dissimilarity degree within clustersand to judge whether the data is abnormal or notIf dissimilarity degrees are beyond reasonable rangewe can know it is the abnormal data and record thenumber

5 Experimental Environment andResults Analysis

51 Experimental Environment

511 Single-Machine Environment This paper uses MAT-LAB2010 single-machine environment the operations of the

The Scientific World Journal 5

0

20

40

60

80

3 5 7 9 10 11 12 14 16

Cluster k

Aver

age p

hase

thin

DataSet1DataSet2

DataSet3DataSet4

Figure 4 119896 values for experiments

specific environment are the Intel(R) Core(TM) i3 CPU253GHz 4GB of memory and a 64-bit operating system

512 Distributed Environment In this paper the experimentof distributed environment uses open source cloud comput-ing platform hadoop whose version is 112 Specific cloudcluster is as follows one machine runs as the Name Nodeand Job Tracker while the other 19 run as a Catano andTask Tracker The experimental cloud platform uses the VMtechnology specific software configuration is Cream wareworkstation-full-902 a 64-bit Ubuntu-12043 JDK versionis jdk170 21 and eclipse version is Linux version

513 Data Source The data for clustering analysis are fromthe tunnel monitoring data including flow temperaturewind speed illumination intensity CO concentration speedwind speed and so on

52The Experimental Data This experiment has adopted thefive datasets which were gathered from tunnel monitoringsystem ofWuhan Fruit Lake respectively they wereDataSet1DataSet2 DataSet3 DataSet4 and DataSet5 Furthermorethe seventh dimensions tunnel data record has been covered42993 times by the DataSet1 which can be used for singleexperiment and distributed experimentThe thirtieth dimen-sions tunnel data record has been covered respectively 42775times article 65993 times article 100 000 000 times andarticle 2000 000 000 times by DataSet2ndashDataSet5

53 Analysis of Experimental Result

Experiment 1 There are 42993 tunnel data whose dimensionvalue is 7 there are 42993 tunnel data whose dimension valueis 7 the ideal value of these data is 5 under the conditionof single machine MATLAB was used to complete 119896-meansserial clustering algorithm [26] Clara clustering algorithm[27] and RICA clustering algorithm [28] in the case of thedistributed cluster to accomplish 119896-means parallel clusteringalgorithm with hadoop [29] and to cluster the tunnel datawith GCPK the results are shown in Table 1 under thecondition of two kinds of the clustering algorithm

Experiment 2 For 119896-means clustering the initial value ofk has great influence on the clustering results and the

Table 1 Thin contrast of DataSet1

Serializing119896-means Clara RICA Parallel

119896-means GCPK

Cluster (119896) 5 5 5 5 5Averagephase thin 232027 218868 190924 160240 160226

Maximumphase thin 1052276 1132348 965631 782311 773419

Abnormalfinding 8322 8936 8953 9323 933

Note the average phase thin and maximum phase thin are the mean valuesof each cluster

Table 2 Clustering time

Dataset cluster119896-means

Serializing119896-means

Parallel119896-means GCPK

DataSet2 10 20min 18 s 7min 50 s 7min 22 sDataSet3 10 27min 35 s 9min 21 s 8min 9 sDataSet4 12 incalculable 18min 33 s 16min 20 sDataSet5 12 incalculable 29min 9 s 23min 15 s

traditional clustering cannot analyze the k in the hugeamounts of data This experiment adopted GCPK algorithmrespectively In terms of four datasets which respectively areDataSet1 DataSet2 DataSet3 and DataSet4 we choose thedifferent k values for them to conduct experiments And theresults are shown in Figure 4

Experiment 3 According to the tunnel monitoring datacollected from DataSet2 to DataSet5 (the dimension valueis 30) under the distributed circumstances clustering with119896-means parallel algorithm and 119896-means parallel improvedalgorithm severally the cluster number k would find idealvalue through GCPK the time cost in Clustering was shownat Table 2

6 The Conclusion and Expectation

Experimental results show that GCPK the improved algo-rithm of 119896-means parallel clustering algorithm has a betterclustering result than the traditional clustering algorithm andhas a higher efficiency It is very convenient to clean theabnormal data for the clustering results in terms of massivedata such as tunnel data it has shown its strong practicabilityAt present due to the limited data formwe are only able to dopreliminary clustering about structured document Howeverafter the relevant documents for clustering of data protectionand the processing of unstructured documents we will dofurther researches on the file data protection after clusteringand the processing of unstructured documents in the future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

The Scientific World Journal 3

Input split

k1v1

HDFSk2v2

k3v3

Map()

Map()

Map()

Partition()

Partition()

Partition()

k1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 2 The execution figure of Map

Partition()

Partition()

Partition()

Combine() Reduce()

Local storage

Keyvalue list

0

1

2

0

1

2

0

1

2

k 1998400v1998400

k 2998400v2998400

k 3998400v3998400

Figure 3 The execution figure of Shuffle

part consumes the most time during this operation and it iseasier for parallel processing [23]TheMapReduce process ofparallel 119896-means is as follows

(1) The Process of Map The Map phase of the MapReduceprocess is performed by several Map functions Accordingto the number of input splits there will be Map functionwhich has some inputs to calculate the distance betweenevery data object and the current cluster center Furthermorewe can tag the corresponding cluster number for data objectsaccording to the distance The cluster number is the numberof cluster center which is closest to the object Its input is thatall the data objects waiting on clustering and the clusteringcenter during the previous iteration (or the initial center) itsinput is in the form of (keyvalue) In keyvalue pairs keyis corresponding to the offset in current data samples whichis relative to the starting point while value is correspondingto the vector of strings in the current data sample whichis composed of each dimension coordinate After the Mapphase the result will be in the form of keyvalue pairs tooutput the clustering cluster number and recorded value

Its execution process is shown in Figure 2

(2) The Process of Shuffle There are two main functionsPartition and Combine Finishing it the input during thisphase is the keyvalue pair which was worked out duringthe Map phase and it is mainly used before entering theReduce phase combining the data locally in order to reducethe consumption of data communication in the processof iteration First of all the Partition function sorts theprovisional result which was produced during theMap phase

and divides the temporary data (keyvalue pairs) which wasproduced by map into several groups (Partition) The keywhich was input is corresponding to the cluster number ofclustering while values are corresponding to vector of stringswhich are composed by dimensional coordinates Then theCombine functionwill combine group and count the numberof data samples and the sum of all the data coordinates inaddition the outputs which were in the form of keyvaluepair (keyvalue list) are given to different Reduce functionsIts execution process is shown in Figure 3

(3) The Process of Reduce Reduce function accepts thekeyvalue pairs (key and value list) which were worked outby Combine function Among them corresponding to thekey is the number of data samples and the values list iscorresponding to the vector of strings which are composedby dimensional coordinates By adding string vectors of everydimensional coordinates and combining with the number ofthe data to work out the new cluster center which is input inthe form of keyvalue pairs into the nextMapReduce process

4 The Good Center Parallel 119870-Means(GCPK) Algorithm

41 The Calculation Method of Dissimilarity Degree Thisexperiment uses Euclidean distance as a measurement stan-dard to calculate the similarity degree among data in thecluster In this article the average dissimilarity degree men-tioned refers to the average Euclidean distance between thedata within the cluster and the cluster center If the averagedissimilarity degree is lower the data within the cluster will

4 The Scientific World Journal

be more similar in other words the results of clustering aremore ideal The Euclidean computationalformula is

119889 (119909 119910) = radic

119898

sum

119894=1

(119909119894minus 119910119894)2

(1)

The ldquo119898rdquo denotes the dimension value of the array

42TheDefects of Parallel119870-Means Clustering Algorithm 119870-means parallel clustering algorithm solves the problem of theiteration time that 119896-means spends very well but it does nothandle the selection problem of 119896-means initial center Inorder to choose appropriate initial center reduce unnecessaryiteration and decrease the operation time and improveefficiency in view of the tunnel data we proposed the 119870-means parallel improved algorithm Good center parallel 119896-means algorithm whose abbreviation is GCPK algorithm Inthe dataset the algorithm can find out the algorithm firstlyfind out a point as the center which can separate data onaverage basically in other words it is appropriate to be thegood center for the initial center and then to parallelize the kmeans clustering Finally based on the clustering results theabnormal data will be analyzed

43 The Implementation of GCPK Algorithm

(1) To Get the Initial Center through Relative Centroid 119896-means clustering method is a typical method based onclustering in the 119896-means clustering method because ofreplacing the center constantly [24] the selection of initialcenter has a great influence on 119896-means clustering algorithm[25] In order to decrease the times of replacement andmakethe 119896-means clustering become more efficient it is necessaryto find the initial center which is relative to the center androughly divide the dataset

Because the tunnel data has the characteristics of non-negativity tunnel data can be regarded as multidimensionaldata graphics which is composed of 119899-dimensional dataRegarding each data as their own dimensional coordinatesfind out the point 1198761 (119894

1 1198942 119894

119899) which is closest to the

origin of coordinates and the point 1198762 (1198951 1198952 119895

119899) which

is farthest away from the originTo connect1198761 and1198762with a line segment and divide the

line segments into 119896 + 1 parts (this 119896 here is the same as the119896 in 119896-means) regard each data as a particle and there are119896 equal diversion points on the line segment (not includingthe endpoint) Furthermore the coordinate of the 119905 (0 lt 119905 lt119896 + 1 119905 isin 119885) equal diversion point is (119875

1199051 1198751199052 119875

119905119899) and

119899 denotes the number of dimensions So 119875119905119898= (119898119896)(119894

119898+

119895119898)119898 = 1 2 119899 These 119896 points obtained are the points in

the center position which can average the multidimensionaldata graphics and we regard the 119896 points as the initial centerpoints

(2) The Specific Implementation of GCPK Parallel Clustering

(1) Upload the dataset to the HDFS(2) Upload the initial center to the HDFS

(3) According to the original data and the previous itera-tion (or initial center) calculate the current iterationwith the Map function

(4) Calculate the sum of the pointsrsquo number and weightin the same cluster and recount the clustering centerof cluster

(5) Do not control the iterative process until the max-imum number of iterations is exceeding or all theclustering has been convergent at this time calculatethe average dissimilarity degree and the maximumdissimilarity degree within each cluster of all clustersin addition count the average dissimilarity degreeand the maximum dissimilarity degree in the wholecluster

(3) The Analysis of Abnormal Data Download the resultof clustering to local through HDFS and convert binaryserialization format file into a text file Analyzing the resultsfile and getting the average dissimilarity degree and themaxi-mumdissimilarity degree in every cluster counting the radiusof every dimension combining the weight of each dimensiondata for cleaning abnormal data Because the amounts of dataare huge and the data quantity is too large it is unable tocomplete artificial recognition and use the simple method toidentify the abnormal data It is necessary to use cloud com-puting technology to clean the abnormal clustering results

(1) Read the average dissimilarity degree 119863119894in every

cluster and the dimension radius 119877119895 Furthermore

119894 = 1 2 119896 and 119896 are the clusterrsquos numbers 119895 =1 2 119899 and 119899 are the dimensions

(2) Calculate the weight 119876119895 which is within the cluster

119876119895= 119905119895119899 The ldquo119905rdquo is the number of times in which

the data of dimension 119895 come out the ldquo119899rdquo is the totaldimension

(3) Work out the reasonable range of dissimilarity degreedis(119894) in clusters in addition

dis (119894) = radic119863(119894)2 +119899

sum

119895=1

119876119895lowast 1198772119895

(2)

119894 = 1 2 119896 119896 denotes the number of all clusters(4) Test the data object which is got from MapReduce

according to the data objects Euclidean distance ofthe cluster center (dissimilarity degree) and the rea-sonable range of dissimilarity degree within clustersand to judge whether the data is abnormal or notIf dissimilarity degrees are beyond reasonable rangewe can know it is the abnormal data and record thenumber

5 Experimental Environment andResults Analysis

51 Experimental Environment

511 Single-Machine Environment This paper uses MAT-LAB2010 single-machine environment the operations of the

The Scientific World Journal 5

0

20

40

60

80

3 5 7 9 10 11 12 14 16

Cluster k

Aver

age p

hase

thin

DataSet1DataSet2

DataSet3DataSet4

Figure 4 119896 values for experiments

specific environment are the Intel(R) Core(TM) i3 CPU253GHz 4GB of memory and a 64-bit operating system

512 Distributed Environment In this paper the experimentof distributed environment uses open source cloud comput-ing platform hadoop whose version is 112 Specific cloudcluster is as follows one machine runs as the Name Nodeand Job Tracker while the other 19 run as a Catano andTask Tracker The experimental cloud platform uses the VMtechnology specific software configuration is Cream wareworkstation-full-902 a 64-bit Ubuntu-12043 JDK versionis jdk170 21 and eclipse version is Linux version

513 Data Source The data for clustering analysis are fromthe tunnel monitoring data including flow temperaturewind speed illumination intensity CO concentration speedwind speed and so on

52The Experimental Data This experiment has adopted thefive datasets which were gathered from tunnel monitoringsystem ofWuhan Fruit Lake respectively they wereDataSet1DataSet2 DataSet3 DataSet4 and DataSet5 Furthermorethe seventh dimensions tunnel data record has been covered42993 times by the DataSet1 which can be used for singleexperiment and distributed experimentThe thirtieth dimen-sions tunnel data record has been covered respectively 42775times article 65993 times article 100 000 000 times andarticle 2000 000 000 times by DataSet2ndashDataSet5

53 Analysis of Experimental Result

Experiment 1 There are 42993 tunnel data whose dimensionvalue is 7 there are 42993 tunnel data whose dimension valueis 7 the ideal value of these data is 5 under the conditionof single machine MATLAB was used to complete 119896-meansserial clustering algorithm [26] Clara clustering algorithm[27] and RICA clustering algorithm [28] in the case of thedistributed cluster to accomplish 119896-means parallel clusteringalgorithm with hadoop [29] and to cluster the tunnel datawith GCPK the results are shown in Table 1 under thecondition of two kinds of the clustering algorithm

Experiment 2 For 119896-means clustering the initial value ofk has great influence on the clustering results and the

Table 1 Thin contrast of DataSet1

Serializing119896-means Clara RICA Parallel

119896-means GCPK

Cluster (119896) 5 5 5 5 5Averagephase thin 232027 218868 190924 160240 160226

Maximumphase thin 1052276 1132348 965631 782311 773419

Abnormalfinding 8322 8936 8953 9323 933

Note the average phase thin and maximum phase thin are the mean valuesof each cluster

Table 2 Clustering time

Dataset cluster119896-means

Serializing119896-means

Parallel119896-means GCPK

DataSet2 10 20min 18 s 7min 50 s 7min 22 sDataSet3 10 27min 35 s 9min 21 s 8min 9 sDataSet4 12 incalculable 18min 33 s 16min 20 sDataSet5 12 incalculable 29min 9 s 23min 15 s

traditional clustering cannot analyze the k in the hugeamounts of data This experiment adopted GCPK algorithmrespectively In terms of four datasets which respectively areDataSet1 DataSet2 DataSet3 and DataSet4 we choose thedifferent k values for them to conduct experiments And theresults are shown in Figure 4

Experiment 3 According to the tunnel monitoring datacollected from DataSet2 to DataSet5 (the dimension valueis 30) under the distributed circumstances clustering with119896-means parallel algorithm and 119896-means parallel improvedalgorithm severally the cluster number k would find idealvalue through GCPK the time cost in Clustering was shownat Table 2

6 The Conclusion and Expectation

Experimental results show that GCPK the improved algo-rithm of 119896-means parallel clustering algorithm has a betterclustering result than the traditional clustering algorithm andhas a higher efficiency It is very convenient to clean theabnormal data for the clustering results in terms of massivedata such as tunnel data it has shown its strong practicabilityAt present due to the limited data formwe are only able to dopreliminary clustering about structured document Howeverafter the relevant documents for clustering of data protectionand the processing of unstructured documents we will dofurther researches on the file data protection after clusteringand the processing of unstructured documents in the future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

4 The Scientific World Journal

be more similar in other words the results of clustering aremore ideal The Euclidean computationalformula is

119889 (119909 119910) = radic

119898

sum

119894=1

(119909119894minus 119910119894)2

(1)

The ldquo119898rdquo denotes the dimension value of the array

42TheDefects of Parallel119870-Means Clustering Algorithm 119870-means parallel clustering algorithm solves the problem of theiteration time that 119896-means spends very well but it does nothandle the selection problem of 119896-means initial center Inorder to choose appropriate initial center reduce unnecessaryiteration and decrease the operation time and improveefficiency in view of the tunnel data we proposed the 119870-means parallel improved algorithm Good center parallel 119896-means algorithm whose abbreviation is GCPK algorithm Inthe dataset the algorithm can find out the algorithm firstlyfind out a point as the center which can separate data onaverage basically in other words it is appropriate to be thegood center for the initial center and then to parallelize the kmeans clustering Finally based on the clustering results theabnormal data will be analyzed

43 The Implementation of GCPK Algorithm

(1) To Get the Initial Center through Relative Centroid 119896-means clustering method is a typical method based onclustering in the 119896-means clustering method because ofreplacing the center constantly [24] the selection of initialcenter has a great influence on 119896-means clustering algorithm[25] In order to decrease the times of replacement andmakethe 119896-means clustering become more efficient it is necessaryto find the initial center which is relative to the center androughly divide the dataset

Because the tunnel data has the characteristics of non-negativity tunnel data can be regarded as multidimensionaldata graphics which is composed of 119899-dimensional dataRegarding each data as their own dimensional coordinatesfind out the point 1198761 (119894

1 1198942 119894

119899) which is closest to the

origin of coordinates and the point 1198762 (1198951 1198952 119895

119899) which

is farthest away from the originTo connect1198761 and1198762with a line segment and divide the

line segments into 119896 + 1 parts (this 119896 here is the same as the119896 in 119896-means) regard each data as a particle and there are119896 equal diversion points on the line segment (not includingthe endpoint) Furthermore the coordinate of the 119905 (0 lt 119905 lt119896 + 1 119905 isin 119885) equal diversion point is (119875

1199051 1198751199052 119875

119905119899) and

119899 denotes the number of dimensions So 119875119905119898= (119898119896)(119894

119898+

119895119898)119898 = 1 2 119899 These 119896 points obtained are the points in

the center position which can average the multidimensionaldata graphics and we regard the 119896 points as the initial centerpoints

(2) The Specific Implementation of GCPK Parallel Clustering

(1) Upload the dataset to the HDFS(2) Upload the initial center to the HDFS

(3) According to the original data and the previous itera-tion (or initial center) calculate the current iterationwith the Map function

(4) Calculate the sum of the pointsrsquo number and weightin the same cluster and recount the clustering centerof cluster

(5) Do not control the iterative process until the max-imum number of iterations is exceeding or all theclustering has been convergent at this time calculatethe average dissimilarity degree and the maximumdissimilarity degree within each cluster of all clustersin addition count the average dissimilarity degreeand the maximum dissimilarity degree in the wholecluster

(3) The Analysis of Abnormal Data Download the resultof clustering to local through HDFS and convert binaryserialization format file into a text file Analyzing the resultsfile and getting the average dissimilarity degree and themaxi-mumdissimilarity degree in every cluster counting the radiusof every dimension combining the weight of each dimensiondata for cleaning abnormal data Because the amounts of dataare huge and the data quantity is too large it is unable tocomplete artificial recognition and use the simple method toidentify the abnormal data It is necessary to use cloud com-puting technology to clean the abnormal clustering results

(1) Read the average dissimilarity degree 119863119894in every

cluster and the dimension radius 119877119895 Furthermore

119894 = 1 2 119896 and 119896 are the clusterrsquos numbers 119895 =1 2 119899 and 119899 are the dimensions

(2) Calculate the weight 119876119895 which is within the cluster

119876119895= 119905119895119899 The ldquo119905rdquo is the number of times in which

the data of dimension 119895 come out the ldquo119899rdquo is the totaldimension

(3) Work out the reasonable range of dissimilarity degreedis(119894) in clusters in addition

dis (119894) = radic119863(119894)2 +119899

sum

119895=1

119876119895lowast 1198772119895

(2)

119894 = 1 2 119896 119896 denotes the number of all clusters(4) Test the data object which is got from MapReduce

according to the data objects Euclidean distance ofthe cluster center (dissimilarity degree) and the rea-sonable range of dissimilarity degree within clustersand to judge whether the data is abnormal or notIf dissimilarity degrees are beyond reasonable rangewe can know it is the abnormal data and record thenumber

5 Experimental Environment andResults Analysis

51 Experimental Environment

511 Single-Machine Environment This paper uses MAT-LAB2010 single-machine environment the operations of the

The Scientific World Journal 5

0

20

40

60

80

3 5 7 9 10 11 12 14 16

Cluster k

Aver

age p

hase

thin

DataSet1DataSet2

DataSet3DataSet4

Figure 4 119896 values for experiments

specific environment are the Intel(R) Core(TM) i3 CPU253GHz 4GB of memory and a 64-bit operating system

512 Distributed Environment In this paper the experimentof distributed environment uses open source cloud comput-ing platform hadoop whose version is 112 Specific cloudcluster is as follows one machine runs as the Name Nodeand Job Tracker while the other 19 run as a Catano andTask Tracker The experimental cloud platform uses the VMtechnology specific software configuration is Cream wareworkstation-full-902 a 64-bit Ubuntu-12043 JDK versionis jdk170 21 and eclipse version is Linux version

513 Data Source The data for clustering analysis are fromthe tunnel monitoring data including flow temperaturewind speed illumination intensity CO concentration speedwind speed and so on

52The Experimental Data This experiment has adopted thefive datasets which were gathered from tunnel monitoringsystem ofWuhan Fruit Lake respectively they wereDataSet1DataSet2 DataSet3 DataSet4 and DataSet5 Furthermorethe seventh dimensions tunnel data record has been covered42993 times by the DataSet1 which can be used for singleexperiment and distributed experimentThe thirtieth dimen-sions tunnel data record has been covered respectively 42775times article 65993 times article 100 000 000 times andarticle 2000 000 000 times by DataSet2ndashDataSet5

53 Analysis of Experimental Result

Experiment 1 There are 42993 tunnel data whose dimensionvalue is 7 there are 42993 tunnel data whose dimension valueis 7 the ideal value of these data is 5 under the conditionof single machine MATLAB was used to complete 119896-meansserial clustering algorithm [26] Clara clustering algorithm[27] and RICA clustering algorithm [28] in the case of thedistributed cluster to accomplish 119896-means parallel clusteringalgorithm with hadoop [29] and to cluster the tunnel datawith GCPK the results are shown in Table 1 under thecondition of two kinds of the clustering algorithm

Experiment 2 For 119896-means clustering the initial value ofk has great influence on the clustering results and the

Table 1 Thin contrast of DataSet1

Serializing119896-means Clara RICA Parallel

119896-means GCPK

Cluster (119896) 5 5 5 5 5Averagephase thin 232027 218868 190924 160240 160226

Maximumphase thin 1052276 1132348 965631 782311 773419

Abnormalfinding 8322 8936 8953 9323 933

Note the average phase thin and maximum phase thin are the mean valuesof each cluster

Table 2 Clustering time

Dataset cluster119896-means

Serializing119896-means

Parallel119896-means GCPK

DataSet2 10 20min 18 s 7min 50 s 7min 22 sDataSet3 10 27min 35 s 9min 21 s 8min 9 sDataSet4 12 incalculable 18min 33 s 16min 20 sDataSet5 12 incalculable 29min 9 s 23min 15 s

traditional clustering cannot analyze the k in the hugeamounts of data This experiment adopted GCPK algorithmrespectively In terms of four datasets which respectively areDataSet1 DataSet2 DataSet3 and DataSet4 we choose thedifferent k values for them to conduct experiments And theresults are shown in Figure 4

Experiment 3 According to the tunnel monitoring datacollected from DataSet2 to DataSet5 (the dimension valueis 30) under the distributed circumstances clustering with119896-means parallel algorithm and 119896-means parallel improvedalgorithm severally the cluster number k would find idealvalue through GCPK the time cost in Clustering was shownat Table 2

6 The Conclusion and Expectation

Experimental results show that GCPK the improved algo-rithm of 119896-means parallel clustering algorithm has a betterclustering result than the traditional clustering algorithm andhas a higher efficiency It is very convenient to clean theabnormal data for the clustering results in terms of massivedata such as tunnel data it has shown its strong practicabilityAt present due to the limited data formwe are only able to dopreliminary clustering about structured document Howeverafter the relevant documents for clustering of data protectionand the processing of unstructured documents we will dofurther researches on the file data protection after clusteringand the processing of unstructured documents in the future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

The Scientific World Journal 5

0

20

40

60

80

3 5 7 9 10 11 12 14 16

Cluster k

Aver

age p

hase

thin

DataSet1DataSet2

DataSet3DataSet4

Figure 4 119896 values for experiments

specific environment are the Intel(R) Core(TM) i3 CPU253GHz 4GB of memory and a 64-bit operating system

512 Distributed Environment In this paper the experimentof distributed environment uses open source cloud comput-ing platform hadoop whose version is 112 Specific cloudcluster is as follows one machine runs as the Name Nodeand Job Tracker while the other 19 run as a Catano andTask Tracker The experimental cloud platform uses the VMtechnology specific software configuration is Cream wareworkstation-full-902 a 64-bit Ubuntu-12043 JDK versionis jdk170 21 and eclipse version is Linux version

513 Data Source The data for clustering analysis are fromthe tunnel monitoring data including flow temperaturewind speed illumination intensity CO concentration speedwind speed and so on

52The Experimental Data This experiment has adopted thefive datasets which were gathered from tunnel monitoringsystem ofWuhan Fruit Lake respectively they wereDataSet1DataSet2 DataSet3 DataSet4 and DataSet5 Furthermorethe seventh dimensions tunnel data record has been covered42993 times by the DataSet1 which can be used for singleexperiment and distributed experimentThe thirtieth dimen-sions tunnel data record has been covered respectively 42775times article 65993 times article 100 000 000 times andarticle 2000 000 000 times by DataSet2ndashDataSet5

53 Analysis of Experimental Result

Experiment 1 There are 42993 tunnel data whose dimensionvalue is 7 there are 42993 tunnel data whose dimension valueis 7 the ideal value of these data is 5 under the conditionof single machine MATLAB was used to complete 119896-meansserial clustering algorithm [26] Clara clustering algorithm[27] and RICA clustering algorithm [28] in the case of thedistributed cluster to accomplish 119896-means parallel clusteringalgorithm with hadoop [29] and to cluster the tunnel datawith GCPK the results are shown in Table 1 under thecondition of two kinds of the clustering algorithm

Experiment 2 For 119896-means clustering the initial value ofk has great influence on the clustering results and the

Table 1 Thin contrast of DataSet1

Serializing119896-means Clara RICA Parallel

119896-means GCPK

Cluster (119896) 5 5 5 5 5Averagephase thin 232027 218868 190924 160240 160226

Maximumphase thin 1052276 1132348 965631 782311 773419

Abnormalfinding 8322 8936 8953 9323 933

Note the average phase thin and maximum phase thin are the mean valuesof each cluster

Table 2 Clustering time

Dataset cluster119896-means

Serializing119896-means

Parallel119896-means GCPK

DataSet2 10 20min 18 s 7min 50 s 7min 22 sDataSet3 10 27min 35 s 9min 21 s 8min 9 sDataSet4 12 incalculable 18min 33 s 16min 20 sDataSet5 12 incalculable 29min 9 s 23min 15 s

traditional clustering cannot analyze the k in the hugeamounts of data This experiment adopted GCPK algorithmrespectively In terms of four datasets which respectively areDataSet1 DataSet2 DataSet3 and DataSet4 we choose thedifferent k values for them to conduct experiments And theresults are shown in Figure 4

Experiment 3 According to the tunnel monitoring datacollected from DataSet2 to DataSet5 (the dimension valueis 30) under the distributed circumstances clustering with119896-means parallel algorithm and 119896-means parallel improvedalgorithm severally the cluster number k would find idealvalue through GCPK the time cost in Clustering was shownat Table 2

6 The Conclusion and Expectation

Experimental results show that GCPK the improved algo-rithm of 119896-means parallel clustering algorithm has a betterclustering result than the traditional clustering algorithm andhas a higher efficiency It is very convenient to clean theabnormal data for the clustering results in terms of massivedata such as tunnel data it has shown its strong practicabilityAt present due to the limited data formwe are only able to dopreliminary clustering about structured document Howeverafter the relevant documents for clustering of data protectionand the processing of unstructured documents we will dofurther researches on the file data protection after clusteringand the processing of unstructured documents in the future

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

6 The Scientific World Journal

Acknowledgments

This research was undertaken as part of Project 61003130funded by National Natural Science Foundation of ChinaCollaborative Innovation Center of Hubei Province forInformation Technology and Service of Elementary Educa-tion Wuhan Technology Department Innovation Project(2013070204005) Hengyang Technology Department(2012KG76) and Hunan Science and Technology Plan Pro-ject (2013GK3037)

References

[1] G J Li ldquoThe scientific value in the study of big datardquo ComputerScience Communication vol 8 no 9 pp 8ndash15 2012

[2] K Chen andW-M Zheng ldquoCloud computing system instancesand current researchrdquo Journal of Software vol 20 no 5 pp1337ndash1348 2009

[3] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[4] X P Jing C H Li and W Xiang ldquoThe effectuation of Bayesbook clustering algorithm under cloud platformrdquo ComputerApplication vol 31 no 9 pp 2551ndash2554 2011

[5] R Guo and X H Zhao ldquoThe research of urban traffic informa-tion service system based on database technologyrdquo InformationCommunication vol 2 pp 105ndash109 2013

[6] Q J Zhang L Zhong J L Yuan and Q B Wang ldquoDesignand realization of a city tunnel environment monitoring systembased on observer patternrdquo in Proceedings of the 2nd Interna-tional Conference on Electric Technology and Civil Engineering(ICETCE rsquo12) no 5 pp 639ndash642 2012

[7] D Ben-Arieh and D K Gullipalli ldquoData envelopment analysisof clinics with sparse data fuzzy clustering approachrdquo Comput-ers amp Industrial Engineering vol 63 no 1 pp 13ndash21 2012

[8] L Zhong L Li J L Yuan and H Xia ldquoThe storage method andsystem of urban tunnel monitoring data based of XMLrdquo patent2011104589952 China

[9] C Liu ldquoNo default categories for large amount of data clusteringalgorithm researchrdquo vol 4 pp 1ndash6 2012

[10] Q Deng and Q Chen ldquoCloud computing and its key technolo-giesrdquo Development and Application of High Performance Com-puting vol 1 no 2 p 6 2009

[11] S Owen and R Anil Shadoof in Action Manning Publications2012

[12] M Armbrust and A Fox ldquoAbove the clouds a Berkeley viewof cloud computingrdquo Tech Rep University of California atBerkeley Berkeley Calif USA 2009

[13] L Dong ldquoShadoof technology insiderrdquo Mechanical IndustryPublication vol 5 pp 31ndash36 2013

[14] R Chao ldquoResearch and application of Shadoof based on cloudcomputingrdquo Tech Rep Chongqing University 2011

[15] SThuG Bhang and E Li ldquoThe improvement of k-means algo-rithm based on Hadoop platform oriented file data setrdquoHarbinNormal University Journal of Natural Science vol 1 no 28 2012

[16] J Zhang H Huang and J Wang ldquoManifold learning for visu-alizing and analyzing high-dimensional datardquo IEEE IntelligentSystems vol 25 no 4 pp 54ndash61 2010

[17] R Duo A Ganglia and D Chen ldquoSpread neighbor clusteringfor large data setsrdquo Computer Engineering vol 36 no 23 pp22ndash24 2010

[18] E Ca and Y Chou ldquoAlgorithm design and implementationbased on graphsrdquo Computer Engineering vol 38 no 24 pp 14ndash16 2012

[19] D Y Ha ldquoThe new clustering algorithm design applied to largedata setsrdquo Guiyang Teachers College Journal vol 28 no 1 pp67ndash71 2011

[20] H Wang ldquoClustering algorithm based on gird researchrdquo TechRep Nanjing Normal Unicameral Nanjing China 2011

[21] B Chi ldquok-means research and improvement of the clusteringalgorithmrdquo Tech Rep 4 2012

[22] A Chou and Y Yu ldquok-means clustering algorithmrdquo ComputerTechnology and Development vol 21 no 2 pp 62ndash65 2011

[23] B L Han KWang G F Ganglia et al ldquoA improved initial clus-tering center selection algorithm based on k-meansrdquo ComputerEngineering and Application vol 46 no 17 pp 151ndash152 2010

[24] W Chao H F Ma L Yang Cu et al ldquoThe parallel k-meansclustering algorithm design and research based on cloud com-puting platform Hadooprdquo Computer Science vol 10 no 38 pp166ndash167 2011

[25] R L Ferreira Cordeiro C Traina Jr A J Machado Traina JLopez U Kang and C Faloutsos ldquoClustering very large multi-dimensional datasets with MapReducerdquo in Proceedings of the17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD rsquo11) pp 690ndash698 New York NYUSA 2011

[26] L Zhang Y Chen and J Zhang ldquoA k-means algorithm basedon densityrdquo Application Research on Computer vol 28 no 11pp 4071ndash4085 2011

[27] M X Duan ldquoClara clustering algorithm combined with theimprovementrdquo Computer Engineer and Application vol 46 no22 pp 210ndash212 2010

[28] K H Tang L Zhong L Li and G Yang ldquoUrban tunnel cleanup the dirtydata based on clara optimization clusteringrdquo Journalof Applied Sciences vol 13 pp 1980ndash1983 2013

[29] M H Zhang ldquoAnalysis and study of data mining algorithmbased on Hadooprdquo Tech Rep 3 2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article An Improved Clustering Algorithm of ...downloads.hindawi.com/journals/tswj/2014/630986.pdf · more and more complex. It results in the fact that the traditional clustering

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of