[ieee 2011 ieee international conference on image information processing (iciip) - shimla, himachal...

6
2011 International Conference on Image Information Processing (ICIIP 2011) Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011) 978-1-61284-861-7/11/$26.00 ©2011 IEEE Random Automatic Detection of Clusters Mamta Mittal, V.P.Singh Computer Science & Engineering Department Thapar University, Patiala, INDIA Sharma R. K * . School of Mathematics and Computer Applications Thapar University Patiala Email:- [email protected] Abstract—Clustering is a way to partition the database in various groups. It is being used in data mining at a very large scale. There are different clustering methods but the focus in this paper is on partitioning based clustering. In literature many algorithm including k-Means are available that require prior information from the outside world about the number of clusters into which the database is to be divided. However, now days a database requires such algorithms that can generate different clusters automatically and moreover at each run the database requires to be partitioned into different number of clusters as well as different shape and size of grouping. In this paper a new partitioning based clustering algorithm that can generate clusters automatically without any previous knowledge on the user side has been proposed. The clusters so generated may not only differ in number but also will be of different shape and size. Keywords--- KDD, Data mining, partitioning based clustering I. INTRODUCTION These days, the organizations are not using their data repository systems effectively for making profits. Volume of data in their repositories is so much large that it is difficult to mine that data. Human eyes cannot visualize the relevant information content hidden in databases. It is very desirable for a company to analyze their databases to increase their profitability. To achieve this, companies must employ some procedure to mine their databases. For this every organization has to go through knowledge discovery process [7] as shown in Figure 1.1. Data mining [5] is one of the steps in this process. Data mining is to extract or predict valuable information from any large database. It is a powerful technology with great potential to help companies and data analysts to focus on the most important information in their data repositories. Data mining is a way to analyze the data and to classify it. Data mining tools predict future trends and behaviors, allowing businesses to take profitable and knowledge-driven decisions. Data mining tools can answer business questions that traditionally were time consuming to resolve. They search databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Clustering is the process of grouping the objects of a database into meaningful subclasses such that similarity of objects in the same group is maximized and similarity of objects in different groups is minimized. Clustering is one of the analysis tools of data mining. Clustering has been a widely studied problem in knowledge discovery, pattern recognition and pattern classification [7]. Clustering determines the relationships between data objects in the database. Clustering can be achieved using different methods, namely, partitioning methods, hierarchical methods, density based methods, grid- based methods, probabilistic methods, graph theoretical methods and fuzzy technique. This paper focuses on partitioning methods [9, 10] and proposes a new clustering algorithm for automatic discovery of data clusters. In partitioning based clustering, one faces the problem of deciding whether one has prior information of the number of different clusters into which the database is to be divided or not [13, 14]. Depending upon this decision the partitioning based clustering can be further classified into two types, namely supervised and unsupervised clustering. In supervised

Upload: sharma

Post on 03-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011) 978-1-61284-861-7/11/$26.00 ©2011 IEEE

Random Automatic Detection of Clusters

Mamta Mittal, V.P.Singh

Computer Science & Engineering Department Thapar University, Patiala, INDIA

Sharma R. K* .

School of Mathematics and Computer Applications Thapar University Patiala

Email:- [email protected]

Abstract—Clustering is a way to partition the database in

various groups. It is being used in data mining at a very large

scale. There are different clustering methods but the focus in this

paper is on partitioning based clustering. In literature many

algorithm including k-Means are available that require prior

information from the outside world about the number of clusters

into which the database is to be divided. However, now days a

database requires such algorithms that can generate different

clusters automatically and moreover at each run the database

requires to be partitioned into different number of clusters as well

as different shape and size of grouping. In this paper a new

partitioning based clustering algorithm that can generate clusters

automatically without any previous knowledge on the user side

has been proposed. The clusters so generated may not only differ

in number but also will be of different shape and size.

Keywords--- KDD, Data mining, partitioning based clustering

I. INTRODUCTION

These days, the organizations are not using their data

repository systems effectively for making profits. Volume of

data in their repositories is so much large that it is difficult to

mine that data. Human eyes cannot visualize the relevant

information content hidden in databases. It is very desirable for

a company to analyze their databases to increase their

profitability. To achieve this, companies must employ some

procedure to mine their databases. For this every organization

has to go through knowledge discovery process [7] as shown in

Figure 1.1. Data mining [5] is one of the steps in this process.

Data mining is to extract or predict valuable information from

any large database. It is a powerful technology with great

potential to help companies and data analysts to focus on the

most important information in their data repositories. Data

mining is a way to analyze the data and to classify it.

Data mining tools predict future trends and behaviors,

allowing businesses to take profitable and knowledge-driven

decisions. Data mining tools can answer business questions

that traditionally were time consuming to resolve. They search

databases for hidden patterns, finding predictive information

that experts may miss because it lies outside their expectations.

Clustering is the process of grouping the objects of a

database into meaningful subclasses such that similarity of

objects in the same group is maximized and similarity of

objects in different groups is minimized. Clustering is one of

the analysis tools of data mining. Clustering has been a widely

studied problem in knowledge discovery, pattern recognition

and pattern classification [7]. Clustering determines the

relationships between data objects in the database. Clustering

can be achieved using different methods, namely, partitioning

methods, hierarchical methods, density based methods, grid-

based methods, probabilistic methods, graph theoretical

methods and fuzzy technique. This paper focuses on

partitioning methods [9, 10] and proposes a new clustering

algorithm for automatic discovery of data clusters. In

partitioning based clustering, one faces the problem of

deciding whether one has prior information of the number of

different clusters into which the database is to be divided or not

[13, 14]. Depending upon this decision the partitioning based

clustering can be further classified into two types, namely

supervised and unsupervised clustering. In supervised

Page 2: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)

clustering the number of clusters is known in advance. But in

unsupervised clustering this prior information is unknown.

Finding the known number of clusters is a common approach

in partitioning based clustering. In this paper, a new

partitioning based unsupervised clustering algorithm has been

proposed. Unlike supervised clustering this algorithm does not

reqyuire the number of clusters in advance.

Figure 1.1: Knowledge Discovery Process

This paper is organized into five sections. Next section

reviews the k-Means clustering algorithm. In Section-III, the

new partitioning based clustering algorithm is proposed. In

Section-IV it has been proved that the proposed algorithm is

faster than k-Means and in Section-V, a numerical example is

given to illustrate the proposed algorithm.

II. RELATED WORK

Clustering has been an active research area and most of

the research has been focused on effectiveness and scalability

of clustering algorithms. Many clustering algorithms exist in

the literature. The partitioning based algorithm is the major

focus of this paper which is widely used in many applications.

One commonly used partitioning based clustering algorithm is

k-Means [11] where k is a known value (number of clusters). k-

Means algorithm clusters dataset of size n into k clusters

in such a way that each data object di (where 1) belongs to only one cluster kj (where 1 ) and each

cluster kj has at least one data object. Each cluster has its own

representative known as centroid of cluster. This algorithm is

simple and easy to implement. However, there is a scope of

improvement here. In this algorithm, the number of final

clusters (k) needs to be determined beforehand and it is

sensitive to outliers [8]. This is due to the fact that small

amount of outliers can substantially influence the centroid

value. Outliers also degrade the quality of a good cluster, i.e., a

cluster with highly dense area. Also the number of iteration

steps for producing the final clusters is unknown which is not

relevant. Furthermore, k-Means finds a local optimum and may

actually miss the global optimum [3].

A lot of efforts have been made to improve scalability of

k-Means algorithms, so that it can be applied to a variety of

databases of different sizes in an effective (resulting into good

quality clusters) and efficient (producing clusters in less time)

manner. In the real world, there exists a restriction upon the

user knowledge of number of cluster to be generated,

especially when it is being done first time on any repository

system. Once there is known number of clusters, clusters can

be generated next time easily from repository. This paper

provides a methodology for creating data clusters

automatically without knowing the number of clusters in

advance; which is scalable, effective and efficient as well. This

algorithm neither suffers from the problem of local optima nor

this is sensitive to outliers and simultaneously improves the

chance of finding the global optimum. It takes known number

of iterations to generate the final clusters.

Clustering can be done at a faster rate when the proposed

algorithm is applied. This algorithm may serves as a right

candidate in partitioning based clustering algorithms.

A. Clustering Analysis of k-Means Algorithm

An important issue in clustering is how to determine the

similarity between two objects, so that clusters can be obtained

from database objects. Commonly, the distance is used as

measure to determine the similarity between two objects; two

objects nearer to each other can be clustered in a single group.

Distance measure can be extended to objects in more than two

dimensions also. For clustering analysis, Euclidean distance

formula [12] is being used because it is useful for

Page 3: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)

multidimensional data objects also. Distance between object i

and object j is denoted by d(i, j) and defined as: , √ … (1)

where , , … , and , , … , are two

n-dimensional data objects.

Suppose that the given set of n objects , ,… , in an n-dimensional space somehow has to be

partitioned into k-clusters with cluster set , , … , .

Each cluster has its own representative. It is represented by

centroid set , , . . . , corresponding to each cluster , , … , respectively. Centroid cj can be calculated as ∑ (2)

where m is the number of data objects in cluster kj and xij is the

ith object belonging to cluster kj.

One can refer [11] for further details of k-Means algorithm.

III. PROPOSED ALGORITHM

Some improvements over the k-Means algorithm have

been proposed in the present approach. As mentioned earlier,

the algorithm may improve the chances of finding the global

optimum. Here, a threshold value is taken to produce number

of clusters. The proposed clustering algorithm consists of the

following steps.

a) Assign a random object to the first cluster.

b) Consider the next random data object. This object is

assigned to either one of the existing clusters or to a new

cluster. This assignment is done based on distance criterion,

namely, the distance between the new object and the existing

clusters’ centroids. A new value of each cluster centroid is

recomputed after every addition of a new object to an existing

cluster.

c) Repeat step b) until all the data objects are clustered.

Figure 3.1 consists of the proposed clustering algorithm. In this

algorithm, objects are iteratively merged into the existing clusters that

are closest to that object. In this algorithm, a threshold value, Tth, is

used to determine if objects will be added to an existing cluster or if a

new cluster will be created.

IV. ALGORITHM ANALYSIS OF PROPOSED

ALGORITHM The algorithm proposed in this paper has been analyzed for its

time complex in this section. The time taken by this algorithm

also depends on the number of input given to the algorithm. It

is worth mentioning here that k-Means and proposed clustering

algorithms take different amounts of time to cluster same data

objects. We now analyse the k-Means and proposed algorithm

for their time complexities. Figure 4.1 describes the k-Means

algorithm and its running time complexity.

Figure 3.1: Proposed clustering algorithm

//Input:A set , , , … , of n objects to cluster and a threshold value Tth .

//Output:A set , , , … , of k subsets of D as final clusters and a set , , , … , of centroids of these clusters. Algorithm:

Proposed clustering algorithm (D,Tth) 1. let k=1

//2. Randomly choose any object from D let it be p k1 = {p} 3. K= {k1} 4. c1= p 5. C={c1} 6. Assign a constant value to Tth 7. for l= 2 to n do 8. Choose next random object from D other

than already chose points let it be q. 9. Determine distance (d) between q and centroid cm 1 in C such that distance is minimum using “ (1)”. 10. if then 11. km= km U q 12. Calculate new mean (centroid cm)

for cluster km using “(2)” 13. else k= k+1 14. kk= {q} 15. K= K U {kk} 16. ck= q 17. C=C U {ck}

Algorithm: k-Means (D, k, K, C) Sr.no. cost times 1. Repeat until (No change in centroid) m1 l>=1 2 for i= 1 to n do m2 ∑ 1 3. Determine distance , between

di and each centroid cj 1 in C such that , ) is minimum using

“(1)”. m3 ∑ ∑ 1 4. Assign di to cluster kj m4 ∑ ∑ 1 5. Calculate new mean (centroid) for

each cluster kj. 1 using “(2)”. m5 ∑ 1

Page 4: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)

Figure 4.1: Running time of k-Means algorithm

Running time of this algorithm is the sum of running time for

each statement executed, i.e., 1 2 ∑ 1 3 ∑ ∑1 4 5 1 1 2 2 3 34 5 5

For worst case running time will be where 2 3. For best case 1 and 1 it will be . For average case it will be where 1 2. The proposed algorithm is now analyzed for each statement, for each statement cost and running time of each statement as shown in Figure 4.2 below.

Figure 4.2: Running time of proposed algorithm

As such, running time of this algorithm will be 1 1 2 1 3 1 4 1 5 1 61 7 8 1 9 10 1 11

12 13 1 141 15 1 161 17 1

For worst case, value of k will always increase by one and thus ∑ will be equal to 1 2 3 … . . 1 and running time will be . For best case, value of k will always be equal to one. So ∑ will be equal to 1 1 1 . . . 1 1 and thus running time will be

. And for average case it will be where 1 2. As such, it is clear from this analysis that the proposed algorithm is faster than k-Means algorithm in worst case. It is not that fast in best and average cases but outperforms the k-Means algorithm in these cases as well.

V. PERFORMANCE EVALUATION

This section compares the efficiency of the k-Means

algorithm and the proposed clustering algorithm. It is worth

mentioning that k-Means clustering technique depends on the

number of clusters while new clustering technique depends on

Algorithm: Proposed clustering algorithm (D,Tth) Sr.No. cost time 1. let k=1 m1 1 2. //Randomly choose any //object from D let it be p m2 1 k1 ={p} 3. K= { k1 } m3 1 4. c1= p m4 1 5. C={c1} m5 1 6. Assign a constant value to Tth m6 1 7. for l= 2 to n do m7 n 8. Choose next random object

from D other than already chose points let it be q. m8 n-1

9. Determine distance(d) between q and centroid m9 ∑ cm 1 in C such that

distance is minimum using “ (1)”. 10. if then m10 n-1 11. km= km U q m11 r 12. Calculate new mean m12 r

(centroid cm) for cluster km using “(2)”; 13. else k= k+1 m13 n-1-r 14. kk= {q} m14 n-1-r 15. K= K U {kk} m15 n-1-r 16. ck= q m16 n-1-r 17. C=C U {ck} m17 n-1-r

Page 5: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)

the threshold value in order to find the number of clusters. In

this section, analysis of these algorithms has been carried out.

This analysis is based on the performance of these two

algorithm based on the data of the mutual funds activity in the

year of 2001 in Indian stock market. The data for analysis is

taken from the site http://www.moneycontrol.com/india/

stockmarket/foreigninstitutionalinvestors/13/00/activity/MF in

terms of net flow of investment every month in that year and

here negative value indicates that the investors pulled back

their investment during that month which is shown in Table

5.1.

Table 5.1: Average monthly investment by foreign institutional investors in Indian stock market during 2001 in mutual funds.

Sr.No. Month Rs.(Crore) d1 Jan -924.68 d2 Feb -1,181.11 d3 Mar -357.18 d4 Apr -296.74 d5 May -475.63 d6 Jun -108.82 d7 Jul -446.27 d8 Aug -373.22 d9 Sep 108.69 d10 Oct -677.1 d11 Nov -343.16 d12 Dec 63.92

It can be noted that both techniques involve computation

of centroids and those centroids are used to cluster data

objects. In k-Means, the number of clusters (k) is known in

advance and in the proposed clustering algorithm threshold

value (Tth) should be known in advance. In k-Means algorithm,

for initial k number of clusters we assume first k data objects as

centroids of those clusters. But in the proposed clustering

algorithm data objects are picked randomly. We have analysed

the proposed clustering algorithm on the random sequence 5, 6, 2, 4, 7, 9, 11, 8, 12, 10, 1, 3 for the objects shown in

Table 5.1. let us define the quality of a cluster as the average

distance of all the data objects within the cluster from centroid

of that cluster and quality of clustering is defined as average

quality of all clusters. We have compared quality of clustering

obtained by k-Means algorithm and the proposed. The results

of this comparison are given Table 5.2.

From Table 5.2 it can be noted that when we generate two

clusters from k-Means algorithm, quality of clustering is

174.39 whereas from the proposed clustering algorithm, it is

157.71. As such, an improvement of 9.56% is achieved.

Table 5.2: Comparison of quality of clustering in k-Means

algorithm and the proposed algorithm

Number

of

clusters

Value of

Threshold

in

proposed

clustering

algorithm

Quality of clustering Improvement

in Percentage k-

Means

Proposed

algorithm

2 450 174.39 157.71 9.56

3 400 101.04 87.28 13.61

4 290 80.66 80.66 0.0

5 240 41.37 39.75 3.91

6 215 32.41 22.40 30.88

7 175 16.15 8.73 45.94

8 50 13.69 7.66 44.04

9 40 5.02 2.92 41.83

10 29.5 3.05 2.27 25.57

11 25 2.04 0.73 64.21

From this table, it is also evident that the proposed clear that

the proposed clustering algorithm exhibits improvement for all

values of k except the case when we generate four clusters.

VI. CONCLUSION

This paper presents the results of an experimental study

on k-Means algorithm and a new clustering algorithm. The k-

Means technique depends on the number of clusters while

Page 6: [IEEE 2011 IEEE International Conference on Image Information Processing (ICIIP) - Shimla, Himachal Pradesh, India (2011.11.3-2011.11.5)] 2011 International Conference on Image Information

2011 International Conference on Image Information Processing (ICIIP 2011)

Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)

proposed clustering technique depends on the threshold value.

Proposed clustering algorithm requires no prior information

about the number of clusters to be generated. It has been

shown that quality of cluster is improved when one uses the

proposed clustering algorithm over the k-Means algorithm. The

proposed clustering algorithm is simple and is more efficient

than k-Means algorithm.

REFERENCES [1] Anderberg, M.R. Cluster analysis for applications, Academic Press, New York, 1973, pp. 162-163 [2] Bottou, L., and Bengio, Y., Convergence Properties of the k-Means algorithm, The MIT Press, Cambridge, MA, 1995,pp. 585-592. [3] Bradley, P.S., and Fayyad, U.M., Refining initial points for k-Means clustering. Proc. of the 15th International Conference on Machine Learnining (ICML98), J.Shavlik (ed.) Morgan Kaufmann, San Francisco, 998, pp 91-99. [4] Deelers, S., and Auwatanamongkol, S., Enhancing k-Means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance, PWASET vol 26, 2007, pp. 323-328. [5] Duda, R.O., and Hart, P.E., Pattern classification and scene analysis, John Wiley Sons, New York, 1973 [6] Fahim, A.M., Salem, A.M., Tokey, F.A., and Ramadan, M., An efficient enhanced k-Means clustering algorithm, Journal of Zhejiang University Science A, vol 7(10), 2006, pp. 1626-1633. [7] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press. [8] Hautamaeki, V., Cherednichenko, S,, Kaerkkaeinen, I., Kinnunen, T., and Fraenti, P., Improving k-Means by outlier Removal, SCIA 2005, LNCS 3540, 2005, pp. 978-987 [9] Jain, A.K., Dubes, R.C. 1998. Algorithms for clustering data Prentice-Hall Inc. [10] Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster analysis. John Wiley & sons. [11] MacQueen, J.B., 1967. Some Methods for Classification and analysis of multivariate obseratins. Proc. 5th Symp. Mathematical Statistics and Probability, Berkeley, CA, Vol(1), 1967, pp. 281-297. [12] Merz. P., 2003. An Iterated local search approach for minimum sum of squares clustering. IDA 2003, pp. 286-296. [13] Pham, D.T., Dimov, S.S., and Nguyen, C.D., Selection of k in K-Means clustering. Mechanical Engineering Science, vol(219), 2004, pp. 103-119. [14] Ray, S. and Turi, R.H., Determination of number of clusters in k-Means clustering and application in color image segmentation, In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp 137-143.