[ieee 2011 ieee international conference on image information processing (iciip) - shimla, himachal...
TRANSCRIPT
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011) 978-1-61284-861-7/11/$26.00 ©2011 IEEE
Random Automatic Detection of Clusters
Mamta Mittal, V.P.Singh
Computer Science & Engineering Department Thapar University, Patiala, INDIA
Sharma R. K* .
School of Mathematics and Computer Applications Thapar University Patiala
Email:- [email protected]
Abstract—Clustering is a way to partition the database in
various groups. It is being used in data mining at a very large
scale. There are different clustering methods but the focus in this
paper is on partitioning based clustering. In literature many
algorithm including k-Means are available that require prior
information from the outside world about the number of clusters
into which the database is to be divided. However, now days a
database requires such algorithms that can generate different
clusters automatically and moreover at each run the database
requires to be partitioned into different number of clusters as well
as different shape and size of grouping. In this paper a new
partitioning based clustering algorithm that can generate clusters
automatically without any previous knowledge on the user side
has been proposed. The clusters so generated may not only differ
in number but also will be of different shape and size.
Keywords--- KDD, Data mining, partitioning based clustering
I. INTRODUCTION
These days, the organizations are not using their data
repository systems effectively for making profits. Volume of
data in their repositories is so much large that it is difficult to
mine that data. Human eyes cannot visualize the relevant
information content hidden in databases. It is very desirable for
a company to analyze their databases to increase their
profitability. To achieve this, companies must employ some
procedure to mine their databases. For this every organization
has to go through knowledge discovery process [7] as shown in
Figure 1.1. Data mining [5] is one of the steps in this process.
Data mining is to extract or predict valuable information from
any large database. It is a powerful technology with great
potential to help companies and data analysts to focus on the
most important information in their data repositories. Data
mining is a way to analyze the data and to classify it.
Data mining tools predict future trends and behaviors,
allowing businesses to take profitable and knowledge-driven
decisions. Data mining tools can answer business questions
that traditionally were time consuming to resolve. They search
databases for hidden patterns, finding predictive information
that experts may miss because it lies outside their expectations.
Clustering is the process of grouping the objects of a
database into meaningful subclasses such that similarity of
objects in the same group is maximized and similarity of
objects in different groups is minimized. Clustering is one of
the analysis tools of data mining. Clustering has been a widely
studied problem in knowledge discovery, pattern recognition
and pattern classification [7]. Clustering determines the
relationships between data objects in the database. Clustering
can be achieved using different methods, namely, partitioning
methods, hierarchical methods, density based methods, grid-
based methods, probabilistic methods, graph theoretical
methods and fuzzy technique. This paper focuses on
partitioning methods [9, 10] and proposes a new clustering
algorithm for automatic discovery of data clusters. In
partitioning based clustering, one faces the problem of
deciding whether one has prior information of the number of
different clusters into which the database is to be divided or not
[13, 14]. Depending upon this decision the partitioning based
clustering can be further classified into two types, namely
supervised and unsupervised clustering. In supervised
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
clustering the number of clusters is known in advance. But in
unsupervised clustering this prior information is unknown.
Finding the known number of clusters is a common approach
in partitioning based clustering. In this paper, a new
partitioning based unsupervised clustering algorithm has been
proposed. Unlike supervised clustering this algorithm does not
reqyuire the number of clusters in advance.
Figure 1.1: Knowledge Discovery Process
This paper is organized into five sections. Next section
reviews the k-Means clustering algorithm. In Section-III, the
new partitioning based clustering algorithm is proposed. In
Section-IV it has been proved that the proposed algorithm is
faster than k-Means and in Section-V, a numerical example is
given to illustrate the proposed algorithm.
II. RELATED WORK
Clustering has been an active research area and most of
the research has been focused on effectiveness and scalability
of clustering algorithms. Many clustering algorithms exist in
the literature. The partitioning based algorithm is the major
focus of this paper which is widely used in many applications.
One commonly used partitioning based clustering algorithm is
k-Means [11] where k is a known value (number of clusters). k-
Means algorithm clusters dataset of size n into k clusters
in such a way that each data object di (where 1) belongs to only one cluster kj (where 1 ) and each
cluster kj has at least one data object. Each cluster has its own
representative known as centroid of cluster. This algorithm is
simple and easy to implement. However, there is a scope of
improvement here. In this algorithm, the number of final
clusters (k) needs to be determined beforehand and it is
sensitive to outliers [8]. This is due to the fact that small
amount of outliers can substantially influence the centroid
value. Outliers also degrade the quality of a good cluster, i.e., a
cluster with highly dense area. Also the number of iteration
steps for producing the final clusters is unknown which is not
relevant. Furthermore, k-Means finds a local optimum and may
actually miss the global optimum [3].
A lot of efforts have been made to improve scalability of
k-Means algorithms, so that it can be applied to a variety of
databases of different sizes in an effective (resulting into good
quality clusters) and efficient (producing clusters in less time)
manner. In the real world, there exists a restriction upon the
user knowledge of number of cluster to be generated,
especially when it is being done first time on any repository
system. Once there is known number of clusters, clusters can
be generated next time easily from repository. This paper
provides a methodology for creating data clusters
automatically without knowing the number of clusters in
advance; which is scalable, effective and efficient as well. This
algorithm neither suffers from the problem of local optima nor
this is sensitive to outliers and simultaneously improves the
chance of finding the global optimum. It takes known number
of iterations to generate the final clusters.
Clustering can be done at a faster rate when the proposed
algorithm is applied. This algorithm may serves as a right
candidate in partitioning based clustering algorithms.
A. Clustering Analysis of k-Means Algorithm
An important issue in clustering is how to determine the
similarity between two objects, so that clusters can be obtained
from database objects. Commonly, the distance is used as
measure to determine the similarity between two objects; two
objects nearer to each other can be clustered in a single group.
Distance measure can be extended to objects in more than two
dimensions also. For clustering analysis, Euclidean distance
formula [12] is being used because it is useful for
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
multidimensional data objects also. Distance between object i
and object j is denoted by d(i, j) and defined as: , √ … (1)
where , , … , and , , … , are two
n-dimensional data objects.
Suppose that the given set of n objects , ,… , in an n-dimensional space somehow has to be
partitioned into k-clusters with cluster set , , … , .
Each cluster has its own representative. It is represented by
centroid set , , . . . , corresponding to each cluster , , … , respectively. Centroid cj can be calculated as ∑ (2)
where m is the number of data objects in cluster kj and xij is the
ith object belonging to cluster kj.
One can refer [11] for further details of k-Means algorithm.
III. PROPOSED ALGORITHM
Some improvements over the k-Means algorithm have
been proposed in the present approach. As mentioned earlier,
the algorithm may improve the chances of finding the global
optimum. Here, a threshold value is taken to produce number
of clusters. The proposed clustering algorithm consists of the
following steps.
a) Assign a random object to the first cluster.
b) Consider the next random data object. This object is
assigned to either one of the existing clusters or to a new
cluster. This assignment is done based on distance criterion,
namely, the distance between the new object and the existing
clusters’ centroids. A new value of each cluster centroid is
recomputed after every addition of a new object to an existing
cluster.
c) Repeat step b) until all the data objects are clustered.
Figure 3.1 consists of the proposed clustering algorithm. In this
algorithm, objects are iteratively merged into the existing clusters that
are closest to that object. In this algorithm, a threshold value, Tth, is
used to determine if objects will be added to an existing cluster or if a
new cluster will be created.
IV. ALGORITHM ANALYSIS OF PROPOSED
ALGORITHM The algorithm proposed in this paper has been analyzed for its
time complex in this section. The time taken by this algorithm
also depends on the number of input given to the algorithm. It
is worth mentioning here that k-Means and proposed clustering
algorithms take different amounts of time to cluster same data
objects. We now analyse the k-Means and proposed algorithm
for their time complexities. Figure 4.1 describes the k-Means
algorithm and its running time complexity.
Figure 3.1: Proposed clustering algorithm
//Input:A set , , , … , of n objects to cluster and a threshold value Tth .
//Output:A set , , , … , of k subsets of D as final clusters and a set , , , … , of centroids of these clusters. Algorithm:
Proposed clustering algorithm (D,Tth) 1. let k=1
//2. Randomly choose any object from D let it be p k1 = {p} 3. K= {k1} 4. c1= p 5. C={c1} 6. Assign a constant value to Tth 7. for l= 2 to n do 8. Choose next random object from D other
than already chose points let it be q. 9. Determine distance (d) between q and centroid cm 1 in C such that distance is minimum using “ (1)”. 10. if then 11. km= km U q 12. Calculate new mean (centroid cm)
for cluster km using “(2)” 13. else k= k+1 14. kk= {q} 15. K= K U {kk} 16. ck= q 17. C=C U {ck}
Algorithm: k-Means (D, k, K, C) Sr.no. cost times 1. Repeat until (No change in centroid) m1 l>=1 2 for i= 1 to n do m2 ∑ 1 3. Determine distance , between
di and each centroid cj 1 in C such that , ) is minimum using
“(1)”. m3 ∑ ∑ 1 4. Assign di to cluster kj m4 ∑ ∑ 1 5. Calculate new mean (centroid) for
each cluster kj. 1 using “(2)”. m5 ∑ 1
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
Figure 4.1: Running time of k-Means algorithm
Running time of this algorithm is the sum of running time for
each statement executed, i.e., 1 2 ∑ 1 3 ∑ ∑1 4 5 1 1 2 2 3 34 5 5
For worst case running time will be where 2 3. For best case 1 and 1 it will be . For average case it will be where 1 2. The proposed algorithm is now analyzed for each statement, for each statement cost and running time of each statement as shown in Figure 4.2 below.
Figure 4.2: Running time of proposed algorithm
As such, running time of this algorithm will be 1 1 2 1 3 1 4 1 5 1 61 7 8 1 9 10 1 11
12 13 1 141 15 1 161 17 1
For worst case, value of k will always increase by one and thus ∑ will be equal to 1 2 3 … . . 1 and running time will be . For best case, value of k will always be equal to one. So ∑ will be equal to 1 1 1 . . . 1 1 and thus running time will be
. And for average case it will be where 1 2. As such, it is clear from this analysis that the proposed algorithm is faster than k-Means algorithm in worst case. It is not that fast in best and average cases but outperforms the k-Means algorithm in these cases as well.
V. PERFORMANCE EVALUATION
This section compares the efficiency of the k-Means
algorithm and the proposed clustering algorithm. It is worth
mentioning that k-Means clustering technique depends on the
number of clusters while new clustering technique depends on
Algorithm: Proposed clustering algorithm (D,Tth) Sr.No. cost time 1. let k=1 m1 1 2. //Randomly choose any //object from D let it be p m2 1 k1 ={p} 3. K= { k1 } m3 1 4. c1= p m4 1 5. C={c1} m5 1 6. Assign a constant value to Tth m6 1 7. for l= 2 to n do m7 n 8. Choose next random object
from D other than already chose points let it be q. m8 n-1
9. Determine distance(d) between q and centroid m9 ∑ cm 1 in C such that
distance is minimum using “ (1)”. 10. if then m10 n-1 11. km= km U q m11 r 12. Calculate new mean m12 r
(centroid cm) for cluster km using “(2)”; 13. else k= k+1 m13 n-1-r 14. kk= {q} m14 n-1-r 15. K= K U {kk} m15 n-1-r 16. ck= q m16 n-1-r 17. C=C U {ck} m17 n-1-r
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
the threshold value in order to find the number of clusters. In
this section, analysis of these algorithms has been carried out.
This analysis is based on the performance of these two
algorithm based on the data of the mutual funds activity in the
year of 2001 in Indian stock market. The data for analysis is
taken from the site http://www.moneycontrol.com/india/
stockmarket/foreigninstitutionalinvestors/13/00/activity/MF in
terms of net flow of investment every month in that year and
here negative value indicates that the investors pulled back
their investment during that month which is shown in Table
5.1.
Table 5.1: Average monthly investment by foreign institutional investors in Indian stock market during 2001 in mutual funds.
Sr.No. Month Rs.(Crore) d1 Jan -924.68 d2 Feb -1,181.11 d3 Mar -357.18 d4 Apr -296.74 d5 May -475.63 d6 Jun -108.82 d7 Jul -446.27 d8 Aug -373.22 d9 Sep 108.69 d10 Oct -677.1 d11 Nov -343.16 d12 Dec 63.92
It can be noted that both techniques involve computation
of centroids and those centroids are used to cluster data
objects. In k-Means, the number of clusters (k) is known in
advance and in the proposed clustering algorithm threshold
value (Tth) should be known in advance. In k-Means algorithm,
for initial k number of clusters we assume first k data objects as
centroids of those clusters. But in the proposed clustering
algorithm data objects are picked randomly. We have analysed
the proposed clustering algorithm on the random sequence 5, 6, 2, 4, 7, 9, 11, 8, 12, 10, 1, 3 for the objects shown in
Table 5.1. let us define the quality of a cluster as the average
distance of all the data objects within the cluster from centroid
of that cluster and quality of clustering is defined as average
quality of all clusters. We have compared quality of clustering
obtained by k-Means algorithm and the proposed. The results
of this comparison are given Table 5.2.
From Table 5.2 it can be noted that when we generate two
clusters from k-Means algorithm, quality of clustering is
174.39 whereas from the proposed clustering algorithm, it is
157.71. As such, an improvement of 9.56% is achieved.
Table 5.2: Comparison of quality of clustering in k-Means
algorithm and the proposed algorithm
Number
of
clusters
Value of
Threshold
in
proposed
clustering
algorithm
Quality of clustering Improvement
in Percentage k-
Means
Proposed
algorithm
2 450 174.39 157.71 9.56
3 400 101.04 87.28 13.61
4 290 80.66 80.66 0.0
5 240 41.37 39.75 3.91
6 215 32.41 22.40 30.88
7 175 16.15 8.73 45.94
8 50 13.69 7.66 44.04
9 40 5.02 2.92 41.83
10 29.5 3.05 2.27 25.57
11 25 2.04 0.73 64.21
From this table, it is also evident that the proposed clear that
the proposed clustering algorithm exhibits improvement for all
values of k except the case when we generate four clusters.
VI. CONCLUSION
This paper presents the results of an experimental study
on k-Means algorithm and a new clustering algorithm. The k-
Means technique depends on the number of clusters while
2011 International Conference on Image Information Processing (ICIIP 2011)
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
proposed clustering technique depends on the threshold value.
Proposed clustering algorithm requires no prior information
about the number of clusters to be generated. It has been
shown that quality of cluster is improved when one uses the
proposed clustering algorithm over the k-Means algorithm. The
proposed clustering algorithm is simple and is more efficient
than k-Means algorithm.
REFERENCES [1] Anderberg, M.R. Cluster analysis for applications, Academic Press, New York, 1973, pp. 162-163 [2] Bottou, L., and Bengio, Y., Convergence Properties of the k-Means algorithm, The MIT Press, Cambridge, MA, 1995,pp. 585-592. [3] Bradley, P.S., and Fayyad, U.M., Refining initial points for k-Means clustering. Proc. of the 15th International Conference on Machine Learnining (ICML98), J.Shavlik (ed.) Morgan Kaufmann, San Francisco, 998, pp 91-99. [4] Deelers, S., and Auwatanamongkol, S., Enhancing k-Means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance, PWASET vol 26, 2007, pp. 323-328. [5] Duda, R.O., and Hart, P.E., Pattern classification and scene analysis, John Wiley Sons, New York, 1973 [6] Fahim, A.M., Salem, A.M., Tokey, F.A., and Ramadan, M., An efficient enhanced k-Means clustering algorithm, Journal of Zhejiang University Science A, vol 7(10), 2006, pp. 1626-1633. [7] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press. [8] Hautamaeki, V., Cherednichenko, S,, Kaerkkaeinen, I., Kinnunen, T., and Fraenti, P., Improving k-Means by outlier Removal, SCIA 2005, LNCS 3540, 2005, pp. 978-987 [9] Jain, A.K., Dubes, R.C. 1998. Algorithms for clustering data Prentice-Hall Inc. [10] Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster analysis. John Wiley & sons. [11] MacQueen, J.B., 1967. Some Methods for Classification and analysis of multivariate obseratins. Proc. 5th Symp. Mathematical Statistics and Probability, Berkeley, CA, Vol(1), 1967, pp. 281-297. [12] Merz. P., 2003. An Iterated local search approach for minimum sum of squares clustering. IDA 2003, pp. 286-296. [13] Pham, D.T., Dimov, S.S., and Nguyen, C.D., Selection of k in K-Means clustering. Mechanical Engineering Science, vol(219), 2004, pp. 103-119. [14] Ray, S. and Turi, R.H., Determination of number of clusters in k-Means clustering and application in color image segmentation, In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp 137-143.