comparison the various clustering and classification...
TRANSCRIPT
© 2013, IJARCSSE All Rights Reserved Page | 866
Volume 3, Issue 12, December 2013 ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
Comparison the Various Clustering and Classification
Algorithms of WEKA Tools Sonam Narwal Mr. Kamaldeep Mintwal
M.tech(S/W Engg.)Scholar, Assistant Professor,
UIET, M.D.U, Rohtak, Haryana UIET, M.D.U, Rohtak, Haryana
Abstract - Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data
pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from
huge data banks. The development of data-mining applications such as classification and clustering has shown the need
for machine learning algorithms to be applied to large scale data. In this paper we present the comparison of different
classification and clustering techniques using Waikato Environment for Knowledge Analysis or in short, WEKA. The
algorithm or methods tested are DBSCAN,EM & K-MEANS clustering algorithms.J48,ID3 and BAYES NETWORK
CLASSIFIER classification algorithms.
Keywords — Machine Learning, Data Mining, WEKA, Classification, Clustering.
I. INTRODUCTION Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data
items. Data mining often involves the analysis of data stored in a data warehouse. Three of the major data mining techniques
are regression, classification and clustering. In this research paper we are working with the clustering and classification
because it is most important process, if we have a very large database. Weka tool being used for clustering and classification.
Clustering is an initial and fundamental step in data analysis. It is an unsupervised classification of patterns into groups or we
can say clusters. Intuitively, patterns within a valid cluster are more similar to each other and dissimilar when [1] compared
to a pattern belonging to other cluster. Clustering is useful in several fields such as pattern analysis, machine learning
situation also pattern classification and many other fields.
Clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more
similar (in some sense or another) to each other than to those in other clusters. Clustering is a main task of explorative data
mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern
recognition, image analysis, information retrieval, and bioinformatics.
II.WEKA WEKA is a data mining system developed by the University of Waikato in New Zealand that implements data mining
algorithms using the JAVA language. WEKA is a state-of-the-art facility for developing machine learning (ML) techniques
[17]and their application to real-world data mining problems. It is a collection of machine learning algorithms for data
mining tasks. The algorithms are applied directly to a dataset.
Fig. 1 Front View of WEKA Tools
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 867
WEKA implements algorithms for data preprocessing, classification, regression, clustering and association rules; It also
includes visualization tools. The new machine learning schemes can also be developed with this package.
WEKA is open source software issued under General Public License [10]. The data file normally used by Weka is in ARFF
file for-mat, which consists of special tags to indicate different things in the data file (foremost: attribute names, attribute
types, and attribute values and the data). The main interface in Weka is the Explorer. It has a set of panels, each of which can
be used to perform a certain task. Once a dataset has been loaded, one of the other panels in the Explorer can be used to
perform further analysis. Simulation is done by using the clustering and classification algorithms of WEKA for comparison
of algorithms, For this purpose I am taking data from BRO (Borer Road Organization’s GREF Center) medical data
repositories.. For working of WEKA we not need the deep knowledge of data mining that’s reason it is very popular data
mining tool. Weka also provides the graphical user interface of the user and provides many facilities [3,6].
The GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.The buttons
can be used to start the following applications:
Explorer: An environment for exploring data with WEKA.
Experimenter: An environment for performing experiments and conducting statistical tests between learning
schemes.
Knowledge Flow: This environment supports essentially the same functions as the Explorer but with a drag-and-
drop interface. One advantage is that it supports incremental learning.
Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for
operating systems that do not provide their own command line interface.
III. METHODOLOGY
Methodology is very simple. The past project data has been taken from the repositories and applied on the WEKA. In Weka
different- different clustering and classification algorithms applied and a useful result predicted that will be very helpful for
the new users and new researchers.
IV. PERFORMING CLUSTERING IN WEKA
For performing cluster analysis in weka. the data set is loaded in weka that is shown in the figure. For the weka the data set
should have in the format of CSV or .ARFF file format. If the data set is not in arff format we need to be converting it.
Fig. 2 Load Data Set In to the WEKA
After that we have many options shown in the figure. We perform clustering [9] so we click on the cluster button. After that
we need to choose which algorithm is applied on the data. It is shown in the figure 4. And then click ok button.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 868
Fig. 3 Various clustering algorithms in WEKA.
V. DBSCAN CLUSTERING ALGORITHM
DBSCAN (for density-based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jorge Sander and Xiaowei Xu in 1996 It is a density-based clustering algorithm because it finds a
number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN [3] is one of the most
common clustering algorithms and also most cited in scientific literature.
OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the Ɛ Ɛ parameter with a
maximum search radius. The analysis of dbscane [11] in the weka is shown in the figure.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 869
Fig. 4 DBSCANE Algorithm
Advantages
1. DBSCAN does not require you to know the number of clusters in the data a priori, as opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely surrounded by (but not connected
to) a different cluster. Due to the MinPts parameter, the so-called single-link [13] effect (different clusters being
connected by a thin line of points) is reduced.
3. DBSCAN has a notion of noise.
4. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (Only
points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is
changed, and the cluster assignment is unique only up to isomorphism.
Disadvantages
1. DBSCAN can only result in a good clustering [7] as good as its distance measure is in the function region Query (P,
Ɛ). The most common distance metric used is the Euclidean distance measure. Especially for high-dimensional data,
this distance metric can be rendered almost useless due to the so called "Curse of dimensionality", rendering it hard
to find an appropriate value for Ɛ .This effect however is present also in any other algorithm based on the Euclidean
distance.
2. DBSCAN cannot cluster data sets well with large differences in densities, since the MinPts-combination cannot be
chosen appropriately for all clusters then.
Result of dbscane is shown in form of graph:
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 870
Fig. 5 Result of dbscane algorithms
VI. EM ALGORITHM
EM algorithm [2] is also an important algorithm of data mining. We used this algorithm when we are satisfied the result of k-
means methods. an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or
maximum a posteriori(MAP) estimates of parameters in statistical models, where the model depends on unobserved latent
variables. The EM[10] iteration alternates between performing an expectation (E) step, which computes the expectation of
the log-likelihood evaluated using the current estimate for the parameters, and maximization (M) step, which computes
parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to
determine the distribution of the latent variables in the next E step.The result of the cluster analysis is written to a band
named class indices. The values in this band indicate the class indices, where a value '0' refers to the first cluster; a value of
'1' refers to the second cluster, etc.
The class indices are sorted according to the prior probability associated with cluster, i.e. a class index of '0' refers to the
cluster with the highest probability.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 871
Fig. 6 EM algorithm
This figure 7 shows that the result of EM algorithm. Next figure show the result of EM algorithm in form of graph. We have
seen the various clusters in the different- different colors.
Advantages
1. Gives extremely useful result for the real world data set.
2. Use this algorithm when you want to perform a cluster analysis of a small scene or region-of-interest and are not
satisfied with the results obtained from the k-means algorithm.
Disadvantage
1. Algorithm is highly complex in nature.
Fig. 7 Result of EM Algorithm
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 872
VII. K-MEANS CLUSTERING ALGORITHMS In data mining, k-means clustering [5] is a method of cluster analysis which aims to partition n observations into k clusters
in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into
Verona cells. K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of
clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should
be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much
as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the
nearest centroid. When no point is pending, the first step is completed and an early group age is done. At this point we need
to re-calculate k new centroids as bar centers of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been
generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes
are done. In other words centroids do not move any more.
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that are being clustered. These points represent initial group
centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which
the metric to be minimized can be calculated.
The problem is computationally difficult (NP-hard), however there are efficient heuristic algorithms that are commonly
employed that converge fast to a local optimum.[12] These are usually similar to the expectation-maximization algorithm for
mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms.
Additionally, they both use cluster centers to model the data, however k-means clustering tends to find clusters of comparable
spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.
Fig.8 k- means clustering algorithms
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 873
Fig. 9 Result of k-means clustering
This figure show that the result of k-means clustering methods. After that we saved the result, the result will be saved in the
ARFF file format. We also open this file in the ms-excel. And sort the data according to clusters.
Advantages to Using this Technique
1. With a large number of variables, K-Means may be computationally faster than hierarchical clustering [8]
(if K is small).
2. K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages to Using this Technique
1. Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K
affect outcome).
2. Fixed number of clusters can make it difficult to predict what K should be.
3. Does not work well with non-globular clusters.
Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as
well as different K values, to compare the results achieved.
VIII. CLASSIFICATION
Classification is process of grouping together documents or data that have similar properties or are related. Our understanding
of the data and documents become greater and easier once they are classified. We can also infer logic based on the
classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results.
Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten parent classes which are
further divided into ten further divisions which also are in turn divided into ten sections. Each book is assigned a number
according to its class, division and section alphabetically. Dewey Decimal Classification is very successful in libraries but
Fig. 10 Classification of Books according to Subject
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 874
unfortunately it can’t be implemented in Information Retrieval. Somebody needs to have a central catalogue of all the
documents in the web and whenever a new document is added the central committee would have to look at it classify it
assign a number and publish it in the web. This is in strong violation of the way the internet works. Some authority
controlling the contents of the web will restrict the amount of data that can be added into the web. We need a web that allows
everyone to upload their content in the web together with a Machine Learning technique that finds these new data and
classifies them as they come. We are using Weka data mining tools for this purpose. It provides a batter interface to the user
than compare the other data mining tools.
IX. PERFORMING CLASSIFICATION IN WEKA
For performing cluster analysis in weka , the data set is loaded in weka that is shown in the figure. For the weka the data set
should have in the format of CSV or .ARFF file format. If the data set is not in arff format we need to be converting it.
After that we have many options shown in the figure. We perform clustering [9] so we click on the cluster button. After that
we need to choose which algorithm is applied on the data. It is shown in the figure 11. And then click ok button.
Fig. 11 Various Classification Algorithms in WEKA
X. J48 CLASSIFICATION ALGORITHM
J48 is slightly modified C4.5 in WEKA. The C4.5 algorithm generates a classification-decision tree for the given data-set by
recursive partitioning of data. The decision is grown using Depth-first strategy. The algorithm consid-ers all the possible tests
that can split the data set and selects a test that gives the best information gain. For each discrete attribute, one test with
outcomes as many as the number of distinct values of the attribute is considered. For each continuous attribute, binary tests
involving every distinct values of the attribute are considered. In order to gather the entropy gain of all these binary tests
efficiently, the training data set belonging to the node in consideration is sorted for the values of the continuous attribute and
the entropy gains of the binary cut based on each distinct values are calculated in one scan of the sorted data. This process is
repeated for each continuous attributes. For a deeper introduction of this method, readers can refer to (Mitchell 1997; Quinlan
1986).
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 875
Fig. 12 J48 Classification Algorithm
Fig. 13: Result of J48 Classification Algorithm
X1. ID3 CLASSIFICATION ALGORITHM
ID3 algorithm is an example of Symbolic Learning and Rule Induction. It is also a supervised learner which means it looks at
examples like a training data set to make its decisions. It was developed by J. Ross Quinlan back in 1979. It is a decision tree
that is based on mathematical calculations. A decision tree classifies data using its attributes. It is upside down. The tree has
decision nodes and leaf nodes.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 876
ID3 algorithm is a supervised learner. It needs to have training data sets to make decisions. The training set lists the attributes
and their possible values. ID3 doesn’t deal with continuous, numeric data which means we have to descretize them.
Attributes such age which can values like 1 to 100 are instead listed as young and old.
Fig. 14 ID3 Classification Algorithm
Weaknesses of ID3 Algorithm
ID3 uses training data sets to makes decisions. This means it relies entirely on the training data. The training data is input by
the programmer. Whatever is in the training data is its base knowledge. Any adulteration of the training data will result in
wrong classification. It cannot handle continuous data like numeric values so values of the attributes need to be discrete. It
also only considers a single attribute with the highest attribute. It doesn’t consider other attributes with less gain. It also
doesn’t backtrack to check its nodes so it is also called a greedy algorithm. Due to its algorithm it results in shorter trees.
Sometimes we might need to consider two attributes at once as a combination but it is not facilitated in ID3. For example in a
bank loan application we might need to consider attributes like age and earnings at once. Young applicants with fewer
earnings can potentially have more chances of promotion and better pay which will result in a higher credit rating.
XII. BAYES NETWORK CLASSIFIER
Bayesian networks are a powerful probabilistic representation, and their use for classification has received consider-able
attention. This classifier learns from training data the conditional probability of each attribute Ai given the class label C
[14,15]. Classification is then done by applying Bayes rule to compute the probability of C given the particular instances of
A1…..An and then predicting the class with the highest posterior probability.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 877
. Fig. 15 Naïve Bayes Classification Algorithm
The goal of classification is to correctly predict the value of a designated discrete class variable given a vector of predictors
or attributes [16]. In particular, the naive Bayes classifier is a Bayesian network where the class has no parents and each
attribute has the class as its sole parent [15,16].
XV. RESULT AND CONCLUSION
Weka is the data mining tools. It is the simplest tool for classify the data various types. It is the first model for provide the
graphical user interface of the user. The main aim of this paper is to provide a detailed introduction of weka clustering
algorithms. For perform the clustering we used the promise data repository. It provides the past project data for analysis.
With the help of figures we are showing the working of various algorithms used in weka. we are showing advantages and
disadvantages of each algorithm. Every algorithm has their own importance and we use them on the behavior of the data, but
on the basis of this research we found that k-means clustering algorithm is simplest algorithm as compared to other
algorithms. In clustering J48 shows the best performance considering both accuracy and speed We can’t required deep
knowledge of algorithms for working in weka. That’s why weka is more suitable tool for data mining applications. This
paper shows the clustering and classification operations in the weka.
REFERENCES
[1] Narendra Sharma, Aman Bajpai, Mr. Ratnesh Litoriya ―Comparison the various clustering algorithms of weka
tools‖, International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 5, May 2012.
[2] A. P. Dempster; N. M. Laird; D. B. Rubin ―Maximum Likelihood from Incomplete Data via the EM Algorithm”,
Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp.1-38.
[3] Slava Kisilevich, Florian Mansmann, Daniel Keim ―P-DBSCAN: A density based clustering algorithm for
exploration and analysis of attractive areas using collections of geo-tagged photos, University of Konstanz.
[4] Fei Shao, Yanjiao Cao ―A New Real-time Clustering Algorithm‖,Department of Computer Science and Technology,
Chongqing University of Technology Chongqing 400050, China.
[5] Jinxin Gao, David B. Hitchcock ―James-Stein Shrinkage to Improve K-meansCluster Analysis”, University of
South Carolina,Department of Statistics November 30, 2009.
[6] V. Filkov and S. kiena. Integrating microarray data by consensus clustering. International Journal on Artificial
Intelligence Tools, 13(4):863–880, 2004.
[7] N. Ailon, M. Charikar, and A. Newman. ―Aggregating inconsistent information: ranking and clustering”, In
Proceedings of the thirty-seventh annual ACM Symposium on Theory of Computing, pages 684–693, 2005.
[8] E.B Fawlkes and C.L. Mallows,‖ A method for comparing two hierarchical clusterings”, Journal of the American
Statistical Association, 78:553–584, 1983.
[9] M. and Heckerman, D. (February, 1998),‖ An experimental comparison of several clustering and intialization
method”, Technical Report MSRTR-98-06, Microsoft Research, Redmond, WA.
[10] Celeux, G. and Govaert, G. (1992),‖ A classification EM algorithm for clustering and two stochastic versions‖,
Computational statistics and data analysis, 14:315–332.
[11] Microsoft academic search: most cited data mining articles: DBSCAN is on rank 24, when accessed on: 4/18/2010.
[12] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data
Mining and Knowledge Discovery, 2:283–304, 1998.
Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 866-878
© 2013, IJARCSSE All Rights Reserved Page | 878
[13] R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method". The Computer
Journal (British Computer Society) 16 (1): 30–34.
[14] Bouckaert, R.R. (1994) ,―Properties of Bayesian network Learning Algorithms. In R. Lopex De Mantaras & D.
Poole (Eds.)”, In Press of Proceedings of the Tenth Conference on Uncertainty in Artificial In-telligence(pp. 102-
109). San Francisco, CA.
[15] Buntine, W. (1991). ―Theory refinement on Bayesian networks‖. In B. D. D’Ambrosio, P. Smets, & P.P. Bonissone
(Eds.), In Press of Proceedings of the Seventh Annual Conference on Uncertainty Artificial Intelligent(pp. 52-60).
San Francisco, CA.
[16] Daniel Grossman and Pedro Domingos (2004),” Learning Bayesian Network Classifiers by Maximizing Conditional
Likelihood”, In Press of Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.
[17] WEKA at http://www.cs.waikato. ac.nz/~ml/weka.