comparison the various clustering and classification...

© 2013, IJARCSSE All Rights Reserved Page | 866

Volume 3, Issue 12, December 2013 ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Comparison the Various Clustering and Classification

Algorithms of WEKA Tools Sonam Narwal Mr. Kamaldeep Mintwal

M.tech(S/W Engg.)Scholar, Assistant Professor,

UIET, M.D.U, Rohtak, Haryana UIET, M.D.U, Rohtak, Haryana

Abstract - Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data

pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from

huge data banks. The development of data-mining applications such as classification and clustering has shown the need

for machine learning algorithms to be applied to large scale data. In this paper we present the comparison of different

classification and clustering techniques using Waikato Environment for Knowledge Analysis or in short, WEKA. The

algorithm or methods tested are DBSCAN,EM & K-MEANS clustering algorithms.J48,ID3 and BAYES NETWORK

CLASSIFIER classification algorithms.

Keywords — Machine Learning, Data Mining, WEKA, Classification, Clustering.

I. INTRODUCTION Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data

items. Data mining often involves the analysis of data stored in a data warehouse. Three of the major data mining techniques

are regression, classification and clustering. In this research paper we are working with the clustering and classification

because it is most important process, if we have a very large database. Weka tool being used for clustering and classification.

Clustering is an initial and fundamental step in data analysis. It is an unsupervised classification of patterns into groups or we

can say clusters. Intuitively, patterns within a valid cluster are more similar to each other and dissimilar when [1] compared

to a pattern belonging to other cluster. Clustering is useful in several fields such as pattern analysis, machine learning

situation also pattern classification and many other fields.

Clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more

similar (in some sense or another) to each other than to those in other clusters. Clustering is a main task of explorative data

mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern

recognition, image analysis, information retrieval, and bioinformatics.

II.WEKA WEKA is a data mining system developed by the University of Waikato in New Zealand that implements data mining

algorithms using the JAVA language. WEKA is a state-of-the-art facility for developing machine learning (ML) techniques

[17]and their application to real-world data mining problems. It is a collection of machine learning algorithms for data

mining tasks. The algorithms are applied directly to a dataset.

Fig. 1 Front View of WEKA Tools

http://www.ijarcsse.com/

Narwal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),

December - 2013, pp. 866-878


WEKA implements algorithms for data preprocessing, classification, regression, clustering and association rules; It also

includes visualization tools. The new machine learning schemes can also be developed with this package.

WEKA is open source software issued under General Public License [10]. The data file normally used by Weka is in ARFF

file for-mat, which consists of special tags to indicate different things in the data file (foremost: attribute names, attribute

types, and attribute values and the data). The main interface in Weka is the Explorer. It has a set of panels, each of which can

be used to perform a certain task. Once a dataset has been loaded, one of the other panels in the Explorer can be used to

perform further analysis. Simulation is done by using the clustering and classification algorithms of WEKA for comparison

of algorithms, For this purpose I am taking data from BRO (Borer Road Organization’s GREF Center) medical data

repositories.. For working of WEKA we not need the deep knowledge of data mining that’s reason it is very popular data

mining tool. Weka also provides the graphical user interface of the user and provides many facilities [3,6].

The GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.The buttons

can be used to start the following applications:

Explorer: An environment for exploring data with WEKA.

Experimenter: An environment for performing experiments and conducting statistical tests between learning

schemes.

Knowledge Flow: This environment supports essentially the same functions as the Explorer but with a drag-and-

drop interface. One advantage is that it supports incremental learning.

Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for

operating systems that do not provide their own command line interface.

III. METHODOLOGY

Methodology is very simple. The past project data has been taken from the repositories and applied on the WEKA. In Weka

different- different clustering and classification algorithms applied and a useful result predicted that will be very helpful for

the new users and new researchers.

IV. PERFORMING CLUSTERING IN WEKA

For performing cluster analysis in weka. the data set is loaded in weka that is shown in the figure. For the weka the data set

should have in the format of CSV or .ARFF file format. If the data set is not in arff format we need to be converting it.

Fig. 2 Load Data Set In to the WEKA

After that we have many options shown in the figure. We perform clustering [9] so we click on the cluster button. After that

we need to choose which algorithm is applied on the data. It is shown in the figure 4. And then click ok button.


December - 2013, pp. 866-878


Fig. 3 Various clustering algorithms in WEKA.

V. DBSCAN CLUSTERING ALGORITHM

DBSCAN (for density-based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin

Ester, Hans-Peter Kriegel, Jorge Sander and Xiaowei Xu in 1996 It is a density-based clustering algorithm because it finds a

number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN [3] is one of the most

common clustering algorithms and also most cited in scientific literature.

OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the Ɛ Ɛ parameter with a

maximum search radius. The analysis of dbscane [11] in the weka is shown in the figure.


December - 2013, pp. 866-878


Fig. 4 DBSCANE Algorithm

Advantages

1. DBSCAN does not require you to know the number of clusters in the data a priori, as opposed to k-means.

2. DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely surrounded by (but not connected

to) a different cluster. Due to the MinPts parameter, the so-called single-link [13] effect (different clusters being

connected by a thin line of points) is reduced.

3. DBSCAN has a notion of noise.

4. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (Only

points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is

changed, and the cluster assignment is unique only up to isomorphism.

Disadvantages

1. DBSCAN can only result in a good clustering [7] as good as its distance measure is in the function region Query (P,

Ɛ). The most common distance metric used is the Euclidean distance measure. Especially for high-dimensional data,

this distance metric can be rendered almost useless due to the so called "Curse of dimensionality", rendering it hard

to find an appropriate value for Ɛ .This effect however is present also in any other algorithm based on the Euclidean

distance.

2. DBSCAN cannot cluster data sets well with large differences in densities, since the MinPts-combination cannot be

chosen appropriately for all clusters then.

Result of dbscane is shown in form of graph:


December - 2013, pp. 866-878


Fig. 5 Result of dbscane algorithms

VI. EM ALGORITHM

EM algorithm [2] is also an important algorithm of data mining. We used this algorithm when we are satisfied the result of k-

means methods. an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or

maximum a posteriori(MAP) estimates of parameters in statistical models, where the model depends on unobserved latent

variables. The EM[10] iteration alternates between performing an expectation (E) step, which computes the expectation of

the log-likelihood evaluated using the current estimate for the parameters, and maximization (M) step, which computes

parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to

determine the distribution of the latent variables in the next E step.The result of the cluster analysis is written to a band

named class indices. The values in this band indicate the class indices, where a value '0' refers to the first cluster; a value of

'1' refers to the second cluster, etc.

The class indices are sorted according to the prior probability associated with cluster, i.e. a class index of '0' refers to the

cluster with the highest probability.


December - 2013, pp. 866-878


Fig. 6 EM algorithm

This figure 7 shows that the result of EM algorithm. Next figure show the result of EM algorithm in form of graph. We have

seen the various clusters in the different- different colors.

Advantages

1. Gives extremely useful result for the real world data set.

2. Use this algorithm when you want to perform a cluster analysis of a small scene or region-of-interest and are not

satisfied with the results obtained from the k-means algorithm.

Disadvantage

1. Algorithm is highly complex in nature.

Fig. 7 Result of EM Algorithm


December - 2013, pp. 866-878


VII. K-MEANS CLUSTERING ALGORITHMS In data mining, k-means clustering [5] is a method of cluster analysis which aims to partition n observations into k clusters

in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into

Verona cells. K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known

clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of

clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should

be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much

as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the

nearest centroid. When no point is pending, the first step is completed and an early group age is done. At this point we need

to re-calculate k new centroids as bar centers of the clusters resulting from the previous step. After we have these k new

centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been

generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes

are done. In other words centroids do not move any more.

The algorithm is composed of the following steps:

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group

centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which

the metric to be minimized can be calculated.

The problem is computationally difficult (NP-hard), however there are efficient heuristic algorithms that are commonly

employed that converge fast to a local optimum.[12] These are usually similar to the expectation-maximization algorithm for

mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms.

Additionally, they both use cluster centers to model the data, however k-means clustering tends to find clusters of comparable

spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

Fig.8 k- means clustering algorithms


December - 2013, pp. 866-878


Fig. 9 Result of k-means clustering

This figure show that the result of k-means clustering methods. After that we saved the result, the result will be saved in the

ARFF file format. We also open this file in the ms-excel. And sort the data according to clusters.

Advantages to Using this Technique

1. With a large number of variables, K-Means may be computationally faster than hierarchical clustering [8]

(if K is small).

2. K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages to Using this Technique

1. Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K

affect outcome).

2. Fixed number of clusters can make it difficult to predict what K should be.

3. Does not work well with non-globular clusters.

Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as

well as different K values, to compare the results achieved.

VIII. CLASSIFICATION

Classification is process of grouping together documents or data that have similar properties or are related. Our understanding

of the data and documents become greater and easier once they are classified. We can also infer logic based on the

classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results.

Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten parent classes which are

further divided into ten further divisions which also are in turn divided into ten sections. Each book is assigned a number

according to its class, division and section alphabetically. Dewey Decimal Classification is very successful in libraries but

Fig. 10 Classification of Books according to Subject


December - 2013, pp. 866-878


unfortunately it can’t be implemented in Information Retrieval. Somebody needs to have a central catalogue of all the

documents in the web and whenever a new document is added the central committee would have to look at it classify it

assign a number and publish it in the web. This is in strong violation of the way the internet works. Some authority

controlling the contents of the web will restrict the amount of data that can be added into the web. We need a web that allows

everyone to upload their content in the web together with a Machine Learning technique that finds these new data and

classifies them as they come. We are using Weka data mining tools for this purpose. It provides a batter interface to the user

than compare the other data mining tools.

IX. PERFORMING CLASSIFICATION IN WEKA

For performing cluster analysis in weka , the data set is loaded in weka that is shown in the figure. For the weka the data set

should have in the format of CSV or .ARFF file format. If the data set is not in arff format we need to be converting it.

After that we have many options shown in the figure. We perform clustering [9] so we click on the cluster button. After that

we need to choose which algorithm is applied on the data. It is shown in the figure 11. And then click ok button.

Fig. 11 Various Classification Algorithms in WEKA

X. J48 CLASSIFICATION ALGORITHM

J48 is slightly modified C4.5 in WEKA. The C4.5 algorithm generates a classification-decision tree for the given data-set by

recursive partitioning of data. The decision is grown using Depth-first strategy. The algorithm consid-ers all the possible tests

that can split the data set and selects a test that gives the best information gain. For each discrete attribute, one test with

outcomes as many as the number of distinct values of the attribute is considered. For each continuous attribute, binary tests

involving every distinct values of the attribute are considered. In order to gather the entropy gain of all these binary tests

efficiently, the training data set belonging to the node in consideration is sorted for the values of the continuous attribute and

the entropy gains of the binary cut based on each distinct values are calculated in one scan of the sorted data. This process is

repeated for each continuous attributes. For a deeper introduction of this method, readers can refer to (Mitchell 1997; Quinlan

1986).


December - 2013, pp. 866-878


Fig. 12 J48 Classification Algorithm

Fig. 13: Result of J48 Classification Algorithm

X1. ID3 CLASSIFICATION ALGORITHM

ID3 algorithm is an example of Symbolic Learning and Rule Induction. It is also a supervised learner which means it looks at

examples like a training data set to make its decisions. It was developed by J. Ross Quinlan back in 1979. It is a decision tree

that is based on mathematical calculations. A decision tree classifies data using its attributes. It is upside down. The tree has

decision nodes and leaf nodes.


December - 2013, pp. 866-878


ID3 algorithm is a supervised learner. It needs to have training data sets to make decisions. The training set lists the attributes

and their possible values. ID3 doesn’t deal with continuous, numeric data which means we have to descretize them.

Attributes such age which can values like 1 to 100 are instead listed as young and old.

Fig. 14 ID3 Classification Algorithm

Weaknesses of ID3 Algorithm

ID3 uses training data sets to makes decisions. This means it relies entirely on the training data. The training data is input by

the programmer. Whatever is in the training data is its base knowledge. Any adulteration of the training data will result in

wrong classification. It cannot handle continuous data like numeric values so values of the attributes need to be discrete. It

also only considers a single attribute with the highest attribute. It doesn’t consider other attributes with less gain. It also

doesn’t backtrack to check its nodes so it is also called a greedy algorithm. Due to its algorithm it results in shorter trees.

Sometimes we might need to consider two attributes at once as a combination but it is not facilitated in ID3. For example in a

bank loan application we might need to consider attributes like age and earnings at once. Young applicants with fewer

earnings can potentially have more chances of promotion and better pay which will result in a higher credit rating.

XII. BAYES NETWORK CLASSIFIER

Bayesian networks are a powerful probabilistic representation, and their use for classification has received consider-able

attention. This classifier learns from training data the conditional probability of each attribute Ai given the class label C

[14,15]. Classification is then done by applying Bayes rule to compute the probability of C given the particular instances of

A1…..An and then predicting the class with the highest posterior probability.


December - 2013, pp. 866-878


. Fig. 15 Naïve Bayes Classification Algorithm

The goal of classification is to correctly predict the value of a designated discrete class variable given a vector of predictors

or attributes [16]. In particular, the naive Bayes classifier is a Bayesian network where the class has no parents and each

attribute has the class as its sole parent [15,16].

XV. RESULT AND CONCLUSION

Weka is the data mining tools. It is the simplest tool for classify the data various types. It is the first model for provide the

graphical user interface of the user. The main aim of this paper is to provide a detailed introduction of weka clustering

algorithms. For perform the clustering we used the promise data repository. It provides the past project data for analysis.

With the help of figures we are showing the working of various algorithms used in weka. we are showing advantages and

disadvantages of each algorithm. Every algorithm has their own importance and we use them on the behavior of the data, but

on the basis of this research we found that k-means clustering algorithm is simplest algorithm as compared to other

algorithms. In clustering J48 shows the best performance considering both accuracy and speed We can’t required deep

knowledge of algorithms for working in weka. That’s why weka is more suitable tool for data mining applications. This

paper shows the clustering and classification operations in the weka.

REFERENCES

[1] Narendra Sharma, Aman Bajpai, Mr. Ratnesh Litoriya ―Comparison the various clustering algorithms of weka

tools‖, International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 5, May 2012.

[2] A. P. Dempster; N. M. Laird; D. B. Rubin ―Maximum Likelihood from Incomplete Data via the EM Algorithm”,

Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp.1-38.

[3] Slava Kisilevich, Florian Mansmann, Daniel Keim ―P-DBSCAN: A density based clustering algorithm for

exploration and analysis of attractive areas using collections of geo-tagged photos, University of Konstanz.

[4] Fei Shao, Yanjiao Cao ―A New Real-time Clustering Algorithm‖,Department of Computer Science and Technology,

Chongqing University of Technology Chongqing 400050, China.

[5] Jinxin Gao, David B. Hitchcock ―James-Stein Shrinkage to Improve K-meansCluster Analysis”, University of

South Carolina,Department of Statistics November 30, 2009.

[6] V. Filkov and S. kiena. Integrating microarray data by consensus clustering. International Journal on Artificial

Intelligence Tools, 13(4):863–880, 2004.

[7] N. Ailon, M. Charikar, and A. Newman. ―Aggregating inconsistent information: ranking and clustering”, In

Proceedings of the thirty-seventh annual ACM Symposium on Theory of Computing, pages 684–693, 2005.

[8] E.B Fawlkes and C.L. Mallows,‖ A method for comparing two hierarchical clusterings”, Journal of the American

Statistical Association, 78:553–584, 1983.

[9] M. and Heckerman, D. (February, 1998),‖ An experimental comparison of several clustering and intialization

method”, Technical Report MSRTR-98-06, Microsoft Research, Redmond, WA.

[10] Celeux, G. and Govaert, G. (1992),‖ A classification EM algorithm for clustering and two stochastic versions‖,

Computational statistics and data analysis, 14:315–332.

[11] Microsoft academic search: most cited data mining articles: DBSCAN is on rank 24, when accessed on: 4/18/2010.

[12] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data

Mining and Knowledge Discovery, 2:283–304, 1998.


December - 2013, pp. 866-878


[13] R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method". The Computer

Journal (British Computer Society) 16 (1): 30–34.

[14] Bouckaert, R.R. (1994) ,―Properties of Bayesian network Learning Algorithms. In R. Lopex De Mantaras & D.

Poole (Eds.)”, In Press of Proceedings of the Tenth Conference on Uncertainty in Artificial In-telligence(pp. 102-

109). San Francisco, CA.

[15] Buntine, W. (1991). ―Theory refinement on Bayesian networks‖. In B. D. D’Ambrosio, P. Smets, & P.P. Bonissone

(Eds.), In Press of Proceedings of the Seventh Annual Conference on Uncertainty Artificial Intelligent(pp. 52-60).

San Francisco, CA.

[16] Daniel Grossman and Pedro Domingos (2004),” Learning Bayesian Network Classifiers by Maximizing Conditional

Likelihood”, In Press of Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.

[17] WEKA at http://www.cs.waikato. ac.nz/~ml/weka.

http://www.cs.waikato/

comparison the various clustering and classification...

Documents