development of soft computing models for data...

14
Indian Journal of Engineering & Materials Sciences Vol. 8, December 2001, pp. 327-340 Development of soft computing models for data mining S N Sivanandam, A Shanmugam, S Sumathi & K Usha Department of Electrical and Electronics Engineering, PSG College of Technology, Coimbatore 641 004, India Received 28 June 2000; accepted 3 September 2001 The increasing amount and complexity of today's data available in science, business, industry and many other areas creates an urgent need to accelerate discovery of knowledge in large databases. Such data can provide a rich resource for knowledge discovery and decision support. To understand, analyze and eve ntua ll y use this data, a multidisciplinary approach called data mining has been proposed . Technically, data mining is the process of finding correlation or patterns amo ng dozens of fields in large relati ona l databases. Pattern classification is one particular category of data mining, which enables the discovery of knowledge from very large databases (VLDB). In this paper, mining the database through patte rn classification has been d one by utilizing two important mining tools called K-Nearest Neighbour algo rithm and Decision trees. The K-Nearest Neighbour (K-NN) is the popularly used conventional statistical approach for data mining. K-NN is a technique that cla ssifies each record in a data set based on a combination of the classes of K-records most similar to it in a hi storical data set. The fuzzy version of K-NN, crisp and fuzzy versions of nearest prototype classifiers ha ve also been proposed. Dec ision tree is one of the best machine lea rning approaches for data mining. A decision tree is a predictive model that as its name implies, can be viewed as a tree. Briefly, decision trees are tree shaped structures that represent sets of decisions. Th ese decisions ge nerate rules for classification of a data set. Classification and Regression Tree (CA RT), 103 are the two decision tree methods used in this paper. The classification rules have been ext racted in the form of IF THEN rul es . The performance analysis of K-NN methods and tree-based classifiers has been done. The proposed methods have been tested on three applications such as land sat imagery, letter image recognition and opti ca l recognition of hand written digits data. The simulation algorithms have been implement ed us in g C++ under UNIX platform. With the wide use of advanced data base technology developed during past decades ' - 4 it is not difficult to efficiently store huge volume of data in computers and retrieve them whenever needed. Although the stored data are a valuable asset of any application, more people face the problem of data rich but knowledge poor, sooner or later. This situation aroused the recent surge of research interest in the area of data mining. Pattern classification is a well- recopized data mining operation used extensively for decision-making. Classification is a process of finding the common properties among different entities and classifies them into classes. The popularly used methods by researchers in the area of machine lea rning include neural networks and decision trees. Decision trees are predictive models that are extensively used for pattern classification and decision-making. Decision trees provide easily under- standable and perfect decision making needed for real-life applications. The decision tree's intelligence is obtained in the form of production rules, which enables the tree to have lower classification time and better classification capability. Nearest Neighbour classifier-a statistical approach, which is a widely used conventional method for classification has also been used. In this paper, the K-NN and Nearest Prototype classifiers with crisp and fuzzy versions are proposed for data mining. The data mining process consists of three major steps: (i) Data preparation: Data is selected, cleaned and preprocessed under the guidance and knowled ge of domain experts who capture and integrate both the internal and external data into a compre- hensive view that encompasses the whole organization. (ii) Data mining algorithm: Data mining algorithm is used to mine the integrated data to make easy to identify any valuable information. (iii) Data analysis phase: Data mining output is evaluated to see if domain knowledge was discover ed is in the form of rules extracted out of the network. The overall data mining process IS shown in the Fig. 1. Data pr eparation Data mining algoritlun Fig I--Overall data mining process

Upload: others

Post on 25-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

Indian Journal of Engineering & Materials Sciences Vol. 8, December 2001, pp. 327-340

Development of soft computing models for data mining

S N Sivanandam, A Shanmugam, S Sumathi & K Usha Department of Electrical and Electronics Engineering, PSG College of Technology, Coimbatore 641 004, India

Received 28 Jun e 2000; accepted 3 September 2001

The increasing amount and complexity of today's data available in science, business, industry and many other areas creates an urgent need to accelerate discovery of knowledge in large databases . Such data can provide a rich resource for knowledge discovery and decision support. To understand, analyze and eventuall y use this data, a multidisciplinary approach called data mining has been proposed. Technically, data mining is the process of finding correlation or patterns among dozens of fields in large relational databases. Pattern classification is one particular category of data mining, which enables the discovery of knowledge from very large databases (VLDB). In this paper, mining the database through pattern classification has been done by utili zing two important mining tools called K-Nearest Neighbour algorithm and Decision trees. The K-Nearest Neighbour (K-NN) is the popularly used conventional statist ical approach for data mining. K-NN is a technique that classifies each record in a data set based on a combination of the classes of K-records most simi lar to it in a historical data set. The fuzzy version of K-NN, crisp and fuzzy versions of nearest prototype classifiers ha ve also been proposed. Decision tree is one of the best machine learning approaches for data mining. A decision tree is a predictive model that as its name implies, can be viewed as a tree. Briefly, decision trees are tree shaped structures that represent sets of decisions. These decisions generate rules for classification of a data set. Classification and Regression Tree (CART), 103 are the two decision tree methods used in this paper. The classification rules have been ext rac ted in the form of IF THEN rules . The performance analysis of K-NN methods and tree-based classifiers has been done. The proposed methods have been tested on three applications such as land sat imagery, letter image recognition and optical recognition of hand written digits data. The simulation algorithms have been implemented using C++ under UNIX plat form .

With the wide use of advanced data base technology developed during past decades '-

4 it is not difficult to efficiently store huge volume of data in computers and retrieve them whenever needed. Although the stored data are a valuable asset of any application, more people face the problem of data rich but knowledge poor, sooner or later. This situation aroused the recent surge of research interest in the area of data mining. Pattern classification is a well­recopized data mining operation used extensively for decision-making. Classification is a process of finding the common properties among different entities and classifies them into classes.

The popularly used methods by researchers in the area of machine learning include neural networks and decision trees. Decision trees are predictive models that are extensively used for pattern classification and decision-making. Decision trees provide easily under­standable and perfect decision making needed for real-life applications. The decision tree' s intelligence is obtained in the form of production rules , which enables the tree to have lower classification time and better classification capability. Nearest Neighbour classifier-a statistical approach, which is a widely

used conventional method for classification has also been used. In this paper, the K-NN and Nearest Prototype classifiers with crisp and fuzzy versions are proposed for data mining.

The data mining process consists of three major steps :

(i) Data preparation : Data is selected, cleaned and preprocessed under the guidance and knowledge of domain experts who capture and integrate both the internal and external data into a compre­hensive view that encompasses the whole organization .

(ii) Data mining algorithm: Data mining algorithm is used to mine the integrated data to make easy to identify any valuable information .

(iii) Data analysis phase: Data mining output is evaluated to see if domain knowledge was discovered is in the form of rules extracted out of the network . The overall data mining process IS

shown in the Fig. 1.

Data preparation

Data mining algoritlun

Fig I--Overall data mining process

Page 2: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

328 INDIAN J. ENG. MATER. SCI., DEC EMBER 2001

Data mining techniquesJ

A variety of data mining techniques and algorithms are avail able with each one having their own strength and weakness. Among the data mining tools available namely , (i) Stati stical approaches, (ii ) Machine learning approaches, (iii) Neural networks, (iv) Rule Induction , (v) Database systems, (v i) Rough sets and (v ii ) Data vi sualization.

K-NN method - a modern stati stical approach and Decision trees - the most commonly used machine learning approach are used as data mining tools.

Data mining categories3,4

The goal of data mining or knowledge discovery is to determine explicit hidden relationships, patterns , or correlation from data stored in a database. There are fi ve categories in data mining process, viz., (i) Summarization, (ii ) Class ification, (iii) Clustering, (iv) Association and (v) Trend analys is.

In thi s paper, class ification category - one of the important problems in data mining is dealt with.

Pattern Classification using K-Nearest Neighbour Methods5

-7

Class ificati on of objects is an important area of research and application in a variety of fi elds. The K­Nearest Neighbour (NN ) decision rule has often been used in pattern recogllltIOn and cl ass ification problems. The bas ic idea behind the K-NN is very straightforward. For training, all input and output pairs in the training set are stored into a database. When a class ification is needed on a new input pattern, the answer is based on the K-nearest training patterns in the database. It is the simplest method used for pattern class ification, which classifies the unknown record based on the K-nearest records from the hi storical database. K-NN requires no training time other than the time required for the prepro­cess ing and storing of the entire training set. It is very memory intensive since the entire training set is stored. Classification is slow since the distance between input pattern and all patterns in the training set must be computed. The flow chart (Fi g. 2) shows the simple class ification procedure used by K-NN method.

Common training parameters f or K-NN

(i) Number of Nearest Neighbours: This is the number of nearest Neighbours, (K), used to classify the input patterns.

(ii ) Input compression : Since K-N is very storage intensive data patterns can be compressed as a preprocessing step before class ification. Typi­cally, using input compress ion will result in slightly worse performance (as resolution in the input data is last) . Sometimes using compress ion will improve performance because it performs automatic normali zation of the data, which can equalize the effects, each input in the Euclidean di stance measure. K-NN algori thm should be used without any compress ion unless there is a memory problem.

The Fuzzy version of K-NN algorithm is also used to improve the class ification performance. In addition, the Cri sp and Fuzzy Nearest prototype class ifier techniques are also proposed . In Nearest Prototype technique, a typical pattern of each class is chosen, and the unknown vector is ass igned to the class of its closest prototype.

Crisp K-NN pattern cIassifierS•6

The crisp K-NN class ification rule ass igns an input sample vector y, which is of unknown class ificati on, to the class of its nearest Neighbour. This idea has been ex tended to the K-nearest Ne ighbours with the vecto r y being ass igned to the class that is represented by a majority amongst the K-neares t Neighbours. If a tie exists, the sample vector is ass igned to the class , of those classes that ti ed, of which the sum of distances from the sample to each Neighbour in the class is a minimum. Due to space limitations the cri sp K-NN algorithm5 is not li sted here .

Fuzzy K-NN pattern cIassifierS•6

The theory of fuzzy sets is introduced into the K­Nearest Neighbour technique to develop a fuzzy version of the algorithm. This decision rule prov ides a

Store all input/output pairs in the training set.

Training

Tes ting I For each pattern in the test set I ..

Search fo r the k-nearest patterns to the input pattern using a Euclidean distance measure.

J._ For classification, compute the confidence fo r each a C;lk. where C, is the number of patterns among the k nearest patterns belonging to class i. The classification for the input pattern is the class with hi ghest confidence.

Fig. 2-Flow chart for K-Nearest Ne ighbour Al gorithm

Page 3: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

SlY ANANDAM et at.: SOFf COMPUTING MODELS FOR DATA MIN ING 329

simple non-parametric procedure for the ass ignment of class label to the input pattern based on the membership values in each class using the class labels represented by the K-closest Neighbours of the vector.

Reasons for introducing fuzzy into K-NN algorithm (i) Each of the sample vectors is considered equally

important in the assignment of the class label to the input vector. This frequently causes difficulty in those places where the sample sets overlap.

(ii) Once input vector is assigned to a class, there is no indication of its strength of membership in that class.

It is these two problems in the K-NN algorithm that are leading to incorporate fuzzy set theory into the K­NN algorithm.

A fuzzy K-NN algorithm is developed utilizing fuzzy memberships of the sample sets and thus producing a fuzzy classification rule. Two different methods of ass igning fuzzy memberships for the training sets are used.

Classification using fuzzy K-NN The fuzzy K-NN algorithm assigns class member­

ship to a sample vector rather than assigning the vector to a particular class. The advantage is that the algorithm makes no arbitrary assignments. In addition, the vector's membership values should provide a level of assurance to accompany the resultant classification. The basis of the algorithm is to assign membership as a function of vector's distance from its K-nearest neighbours and those neighbour memberships in the possible classes. The fuzzy algorithm is similar to the crisp version in the sense that it must also search the labelled sample set for the K-nearest neighbours. Two methods of assigning memberships to the labelled samples are:

Membership functions:

#1. Iii (x)= 1, XE

0, xei

#2. Iii (x)=x2 / (x2 + 1), x >=0

0, x<O

Here first membership function assigns complete membership in their known class and non­membership in all other classes. The second one assigns the samples membership based on distance from their class mean, and then to use the resulting

memberships in the classifier. The fuzzy K-NN algorithm5 is used here.

K fl i (x) = L flij (1111 x - x j W/

( 111 - 1) )/

j=1

K L (11 II x - x . 11 2/(111- 1) )

j=1 J ... (I )

As seen in the Eq. (1), the assigned memberships of x are influenced by the inverse of the distances from the nearest neighbours and their class memberships. The inverse distance serves to weight a vector membership more if it is closer and fewer if it is farther from the vector under consideration. The labeled samples can be assigned class memberships in several ways.

Here variable 'm' plays an important role in assigning membership values in each class for the unclassified pattern. Variable 'm' determines how heavily the distance is weighted when calculating each neighbour's contribution to the membership value. If 'm' is two then the contribution of each neighbouring point is weighted by the reciprocal of its distance from the point being classified. As 'm'

increases, the neighbours are more evenly weighted, and their relative distances from the point being classified have less effect. As 'm' approaches one, the closer neighbours are weighted far more heavily than those farther away, which has the effect of reducing the number of points that contribute to the member­ship value of the point being classified. Fuzzy algorithm dominates crisp version in terms of lower error rates; the resulting memberships give a confidence measure of classification.

Nearest prototype classifiers (N PC) 5.7

These classifiers bear a marked resemblance to the one-nearest neighbour (l-NN) class ifier. Actually, the only difference is that for the nearest prototype class ifier the labeled samples are a set of class prototypes, whereas in the nearest neighbour classifier we use a set of labeled Samples that are not necessarily prototypical. The prototypes used for these routines are taken as the class means of the labeled sample set.

Crisp nearest prototype classifier (NPC) Crisp NPC classifies the new pattern into the class

to which the nearest prototype among the c prototypes by calculating the distance between the new pattern and c prototypes and finding the nearest prototype to

Page 4: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

330 INDIAN J. ENG. MATER. SCI., DECEMBER 2001

the pattern. Due to space limitations the cri sp NPC algorithmS is not li sted here.

Fuzzy nearest prototype classifier (NPC) Thi s is similar to the crisp nearest prototype

cl assifier and the only difference is that here membership values are assigned to the prototypes. Depending on the nearest prototype the new input pattern is assigned membership values in all the classes and will be classified into the class of max imum membership. The fuzzy NPC algorithmS is used.

Membership function :

,Ll i (x) = ,Ll ij (l / II x - Z j 112f

( m - I ) ) /

K L (1/ II x - Z 11

2f(m- I) )

j=1 J . . . (2)

In both K-NN and nearest prototype techniques, the fuzzy version dominates its crisp counterpart in lower error rates and this statement has been proved experimentally in the forthcoming sections.

Induction of DecisionTrees 8-16

Decision trees are popular for pattern recogmtlOn because the models they produce are easier to understand. Deci sion tree based models have a simple top down tree structure where decisions are made at each node. These decisions generate rules for the classification of a data set. Decision tree model is built, starting with a root node. Traini ng data is partitioned to the children nodes using a splitting rule. The nodes at the bottom of the resulting tree provide the final classification.

In classification training data contains sample vectors that have one or several measurement vari­ables (or features) and one variab le that determines the class of the sample. A splitting rule can be of the form: If A < C then s belongs to L, otherwise to R. Here A is selected variable, c is a constant, s is the data sample and Land R are the left and right branches of the node. Here splitting is done using one variable and a node has two branches and thus two chi ldren. A node can also have more branches and the spli tting can be done based on several variables. The tree is constructed until the purity of the data in each leaf node is at predefined level, or until leaf nodes contain a predefined minimum number of data samples. Each leaf node is then labe lled with a class. Usually the class of the node is determined based on

majority rule: Node is labelled to the class to which majority of the training data belong .

Key issues for any decision trees Th e criteria used to build the tree: Thi s is the first

step in the process of classification. Any algorithm seeks to create a tree that works as perfectly as possible on all the data that is available. Determining which variables to use and what splits to use at each point in the tree to divide the entire data available at a node. The best splitting for each node is searched based on a "purity" function calculated from the data. Most frequently used purity functions are entropy, gini-index . The data portion that falls into each children node is partitioned again in effort to maximize the purity function .

The criteria/or stopping the growth a/the tree, i.e., when does the branching at a g iven node stops: Construction of the tree can be stopped easi ly using some condition to the purity function , or a fully constructed tree can be pruned afterwards.

The criteria to prune the tree for maximum classification effectiveness, i.e., which branches of the tree shou ld be removed: The branches, which are not contributing much for classification is deleted from the full tree to reduce the size of the tree.

Binary decision trees

Binary decision trees are architectures for cl ass i­fication that draw decision region boundaries through constructing a binary tree. Traversing the tree beginning at the root node, and ending at a leaf node does class ification of input vector. Each node of the tree computes an inequality based on a single variable. If the inequality is sati sfied , the left child is the next node traversed otherw ise the right node is the next. Each leaf is assigned to a particular class, and input vectors are classified as the class assoc iated with the leaf at wh ich the trajectory ends. The popul ar algorithm to construct binary decision tree is Class ification and Regression Tree (CART).

Classification and regression trees and Iterative Dichotomiser 3 (103)10-15

The top down induction of dec ision trees is a popular approach in which c lass ification starts from a root node and proceeds to generate sub trees until leaf nodes are created. A decision tree is a representation of a deci sion procedure for determining the class of a given instance. This approach uses attribute based descriptions and the learned concepts are represented

Page 5: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

SlY ANANDAM et al.: SOFf COMPUTING MODELS FOR DATA MINING 331

by decision trees. It is possible to categorize conjunctive and disjunctive descriptions of concepts with 'if-then' rules , which can be lifted from the trees. These rules often offer a more flexible representation than the decision trees themselves. Classification and Regression Tree (CART) is a data exploration and prediction algorithm with a structure of a simple binary tree .. Classification and Regression Trees are binary prediction trees with a structure, which is easy to understand, interpret and use. CART splits a single variable at each node. Iterative Dichotomiser (ID) 3 is an improved version over CART, which generates a complex tree and will have number of branches for each node equal to the range of a particular attribute, which seem to be the best for splitting at that node. For both the trees a pruning algorithm is used to prune the tree back to reduce the size of the tree by maintaining the classification accuracy. The tree building process of CART with splitting criterion8

.13

is used in this paper.

Iterati ve Dicholomiser(ID) 314•15

Iterative Dichotomiser is a simple and widely used symbolic algorithm for learning from a set of training instances. ID3 is the basis of several commercial rule induction systems. ID3 has been augmented with techniques for handling numerically valued features, noisy data, and missing information. ID3 Algorithm creates from a given data set an efficient description of a classifier by means of a decision tree. At each step a new node is added to the decision tree by partitioning the training examples based on their value along a single most informati ve attribute. Each resulting partition is processed recursively , unless it contains examples of only a s ingle category. The information gain criterion that determines the "splitting" attribute acts as a hill-climbing heuristic , which tends to minimize the size of the tree. When the data set is consistent (i.e. no contradictions between data objects) , the resulting decis ion tree describes the data exactly . Thi s means if the original data set is examined via the resulting tree, the results from the tree will be exactly that which is ex pected. In addition, the tree can be used for the prediction of new data objects. The fundamental assumption is that the data set given to the algorithm is representative of the total data set.

The structure of ID 3 tree is basically a non-binary tree. The branches of the tree are equal to the range of values of the patterns for the particular attribute selected at that node. Essenti ally, a deci sion tree

defines a set of paths from the root node to the leaf nodes. Which path to take is determined by the descriptors on the non-leaf nodes. The descriptor describes which branch to f6llow . When one reaches a leaf node, then there are no more questions and the result is given. This is the output of the tree.

Training phase The production of the decision tree from the data

set is called the training phase. During this phase the original data set is examined, the best descriptor is found and then the set is divided into subsets. ID3 picks predictors and their splitting values on the basi s of the gain in information that the split(s) provide. The descriptor is then made into a node of the decision tree. The same procedure is performed on the remaining subsets (dividing the set even more and creating more descriptor nodes) until all the elements in the set have the same classifier value (unless no descriptor can tell the elements apart, i.e. the original data set has contradictions).

Key concepts involved in building the tree Two major concepts are involved within the ID3

method: (i) Entropy and (ii) Decision tree.

Entropy-The entropy concept is used to find the

most significant parameter in characterizing the classifier. The concept of entropy is used to order the list of descriptors with respect to the data set and the classifier. Entropy provides a definition of the most significant descriptor and is one of the major concepts within the ID3 method .

Decision tree-The deci sion tree built using the entropy concept to select the best descriptor will have the number of branches equal to the range of the values of the descriptor selected at that particular node. The basic structure of ID3 is iterative. At every new node created by the tree a descriptor is selected and again the branches are formed, so the process is iterative and builds a huge tree and the node is labeled as a leaf when all the patterns arrived at that node come under the same category. Here the tree stops growing.

Reduced error pruning9.13

Obtaining the optimum size neural network is important for good generalization. If the network is very large, it tends to memorize the training data and thus generalize poorly . The size of the network depends on the number of training data and the degree of generalization required . Reduced error prunll1g

Page 6: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

332 INDI AN J. ENG. MATER. SC I. , DECEM BER 200 1

method is a simple and direct method . Assume a separate test set is used, each 'case' which is class ified by the orig inal tree (T). For every non-leaf sub-tree S of T the change in mi sclass ifications is examined over the test set that would occur if S were repl aced by the best poss ible leaf. If the new tree would give an equal or fewer numbers of errors and S contains no sub-tree with the same property, S is replaced by the leaf. The process continues unti I any further replacements would increase the number of errors over the test set. Its rationale is c learer, though, since the final tree is the most accurate sub-tree of the original tree with respect to the test set and is the smalles t tree with that accuracy . The d isadvantage of this method is that parts of the ori ginal tree con'esponding to rare special cases not represented by the test set may be exercised. Thi s method is used to prune both the CART and 103 trees to get a right sized tree.

103 has been improved to handl e high-cardinality predictors . A high cardinality predictor is one that has many different possible values and hence many di ffe rent ways of performing a split. With a decision tree structure used by CART, the metric for choosing splits might also erroneously a llow high-cardinality splits to occur. To avoid this a pruning process is used to test the tree aga inst held-as ide data, so it is likely that an erroneous split would be eliminated during this phase.

Data Representation Schemes and Rule Extraction 16·18

Data representation can also be ca lled as data preprocess ing. Data preparati on is done to maximi ze the influence of abso lute scale. All inputs to a network should be scaled and normali zed so that they correspond to roughly the same range of values. The goal of data preprocessing is to reduce the non­linearity when its character is known and let the network resolve the hidden non-linearity that are not understood.

Data scaling Data scaling is a type of analog preprocess ing

method . It has the advantage of mapping the desired range of a variable (with a range of the minimum and max imum values) to the complete working range of the network input. Thi s linear scaling follo ws the procedure below:

y=mx + c

where m is the slope and c is the y intercept. If the values between which the scaling is to be done is 0. 1-0.9. Then

Y=O.1 when X=Xlllill Y=0.9 when X=XlII llX

For the assumed range 0.1-0.9, m=0.8/6.;

b=0.9-0.8 Xll1ax /6..;

Y=(0 .8/6.. ) X + (0.9-0.8 Xll1ax //1); where 6.=XIII{/X -Xlllill

Thermometer coding Thermometer coding is used to code the data in

d igital fo rm, i.e., binary form . In whi ch the number of bits used to code are one less than the ranae of the b

pattern va lues and each value is coded by making that many number of bits to one from the right hand side of the complete code vecto r.

Rule generation procedure fo r K-NN method Rule extraction from K-Nearest neighbour method

is simple and generates the rules, wh ich can then be tested for validation against the orig inal data set. Here, depending on the Nearest Neighbours fo und for a pattern and the value of each attribute fo r those Nearest Neighbours, the rules are given. For a particular class the range of each attribute is to be computed using the patterns, which are class ified as that particular class .

Rule generation from decision trees Many inducti ve knowledge acquisition algorithms

generate class ifiers in the form of decision trees. Transformin g such trees to small se ts of producti on rules is a common formali sm fo r ex press ing knowledge in expert systems. The method makes use of the training set o f cases from which the decision tree was generated, first to generali ze and assess the reliability of indi vidual rules ex tracted fro m the tree , and subsequently refine the collection of rul es as a whole. The method for express ing the decision tree as a succinct collection of production rules of the form is: IF left-hand side THEN class.

Reasons for transforming the decision trees to production rules

Production rules are a widely used and well­~lI1derstood vehicle for representing knowledge 111 expert systems. A decision tree can be difficult for human expert to understand and modify, whereas the ex treme modularity of production rules makes them relati vely transparent.

Page 7: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

SIYANANDAM el al.: SOFf COMPUTING MODELS FOR DATA MINING 333

• Transforming the decision trees to production rules improves classification performance by eliminating test in the decision tree attributable to pecul iarities of the training set and making it possible to combine different decision trees for the same task.

Extracting individual rules

Following a path through the root of the tree to one of the leaves effects classification of a new pattern using decis ion tree. This path from the root of the tree to a leaf establishes conditions, in terms of specified outcomes for the tests along the path that must be satisfied by any case classified by that leaf. Every leaf of a dec ision tree thus corresponds to a primitive production rule of the form: IF XI I\X 21\X3 1\ .. ,Xn THEN class C where the Xi'S are cond itions and C is the class of the leaf.

Simulation Results The results obtained in testing, pruning and rule

generation stages for the three applications considered are presented.

Da taset descrip tions l9: The data are collected from UCI machine learning database site.

Title of Database: Optical Recognition of Hand­written Digits Datall)

Number of Instances: Training:

Testing :

3823.

1797. Number of Attributes: 64 input + 1 class attribute

Title of database : Land Sat MSS imagery Data 19

Number of examples: Training set: 4435.

Test set: 2000.

Number of classes: There are 6 decision classes.

Number of attributes: 36

(= 4 spectral bands x 9 pixels in Neighbourhood)

Title of database: Letter Image Recognition Data l9

Number of Instances: 20,000. Number of attributes:

17 (letter category and 16 numeric features)

K-Nearest Neighbour method results

The crisp and fuzzy versions of k-Nearest Neighbour method and Nearest Prototype C lassifiers have been tested using two different applications such as Optical recognition of hand written digits data and Letter image recognition data.

Application 1 (Optica l Recognit ion of Hand

written Digits Da ta)-The Optical Recognition of hand written digits data contains 5620 samples with 10 classes. There are tota ll y 64 attributes defining each sample. In order to evaluate the optimum performance of the classifier models, 35% of data samples are used for training.

Results obtainedfor crisp K-NN Table I shows the variation in overall percentage

of accuracy as the number of train ing samples increases. For 30% of the training samples, the %

Table I-Train/test results

Train/test Overall % accuracy Testing time (s)

200/5420 89.50 2764

500/5120 94.92 2673

1000/4620 96.47 2564

2000/3620 97.66 1987

3000/2620 98.02 1856

4000/1620 99.31 525

5000/620 100.0 272

Table 2-Results obtained for crisp K-NN for different input data

K Overall % Overall % Overall % accuracy accuracy accuracy Raw i/p Anal og i/p Di gital i/p

I 97.67 97.68 96.20 2 97.68 97.69 96.85 3 97.91 97.91 96. 85 4 97.92 97.88 96.88 5 97.68 97.89 96.9 1 6 97 .80 97.82 96.98 7 97.70 97.8 1 96. 8 1 8 97.60 97.66 96.65 9 97.40 97.78 96.75 10 97.59 97.66 96.57 II 97.4 1 97 .88 96.65 12 97.46 97.69 96.55

~ 100

" 90 u u ...: 80 " r 0

70 t " u

t.:: 60 ~ ~ 50 " u ... 10 5 10 10 10 1510 20 12 5 1 30 1 35 1 40 1 451 501551 601

0 0000 00000

'$. No OfTrainillg Samples

Fig. 3--Yariat ion of nlllnber of tra ining samples with % accuracy

Page 8: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

334 IND IAN J. ENG. MATE R. SCI.. DECEMBER 200 1

class ification accuracy is 97.66%, and the testing time is 1987 s.

Fig. 3 shows the vari ation in the accuracy as the number of training samples increases . It can be observed that in the case of K-NN, without training, testing the new patterns is not possible. Because in case of K-NN training IS nothing but loading the patterns in order to find the di stance between the new patterns and loaded patterns fo r class ification. So, at least a minimum number of samples are needed for training.

Table 2 shows the overall classification accuracy obtained for cri sp K-NN using raw input data, analog input data and digital input data, for diffe rent values

of K rang1l1g from I to 12. Here analog data IS nothing but the scaled data and digital data is the raw data coded using thermometer coding.

Results obtained for fuzzy K-NN with membership func tion # 1

Table 3 shows the class ification accuracy fo r di fferent types of input data and weighting factor va lues, vi z., m=2, 1.5 and 3 for di ffe rent values of K (Nearest Neighbours) ranging from 1-12.

Results obtained using f uzzy membership fun ction # 2 Table 4 shows the vari ation in overall percentage

of classification accuracy US1l1g the membership

Table 3-- Results obtai ned with membershi p function # I for diffe rent '111 ' va lues

K

2

3

4

5

6 7

8

9 10

II

12

K

2

3

4

5

6 7

8 9 10

II

12

111=2

98.34

98.4 1

98. 10

98.27

98.29

98.40

98.30

98.23

98.23

98 .41

98.40

98.35

111=2

98.10

98.23

97.89

98.10

98 .23

98. 10

98.23

97.89

98. 10

98 .34

98 .23

98.34

Overa ll % accuracy

Raw data

111= 1. 5

87.70

88 .10

87.7 1

87.70

87.75

87 .88

87.96

88.07

88. 10

88 .10

87.71

87 .70

111=3

87.79

87.98

88.10

87.98

87.79

87.96

87.92

87.78

87.65

87.79

87.65

87.23

111=2

98.36

98 .36

98. 10

98.28

98.28

98. 29

98.40

98.35

98.44

98 .42

97.25

97.95

Overall % accuracy

Analog data

111= 1.5

87.7 1

88 .07

87 .70

87.90

87.75

87.71

87.96

88. 10

88. 10

88 .07

87.96

87.71

111=3

87.79

87 .79

88.07

87.98

87.98

87.79

87.96

87.92

87.65

87.65

87.79

87.23

111=2

97.88

98.87

97.66

97.66

98.22

98.0 1

97.99

97.87

97.40

97.54

97.4 1

97.46

Overall % acc uracy

Digital data

111= 1. 5

85.98

86.23

86.09

85.98

86.45

86.23

86.23

86. 13

86. 13

85.78

85.98

85.78

Table 4-- Results obtained with membership function #2 fo r d ifferent ' 111 ' values

Overall % accuracy

Raw data

111=1. 5

87 .65

87.79

88. 10

87.78

87.96

88.07

87.92

88.10

88 .07

88. 10

88.07

87.79

111=3

87 .98

87.79

87.96

88.07

87.79

87.65

87.96

87.79

87.65

87 .79

87.65

87.34

111=2

98 .23

98.23

98 .1 0

98. 10

98.23

98.23

98.34

98.02

98 .34

98.02

97.25

98.40

Overa ll % accuracy

Analog data

111=1 .5

87.65

88.07

88.07

87.92

87.92

87.96

87.96

88. 10

88 .10

88. 10

87.96

87.7 1

111=3

87.79

87.79

88.07

88. 10

87 .96

87.79

87.92

87.96

87.79

87.65

87.79

87.34

111=2

97.88

97.87

97.66

97.66

98.22

98 .0 1

97.99

97.87

97.40

97 .54

97.23

97 .34

Overall % accuracy

Digita l data

111=1.5

85.98

86.09

86. 13

86 .23

86 .23

86 .09

86.23

86. 13

86.09

86.23

86.09

85.78

1Il=3

86.23

86. 13

85.98

86.13

85.78

86.09

85.98

86.13

85.78

86.09

86.23

86.09

111=3

86.09

86.09

85.98

86.09

85.65

86. 13

86.23

86.09

85.78

86. 13

86.23

86.09

Page 9: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

SlY ANANDAM et al.: SOFf COMPUTING MODELS FOR DATA MINING 335

98,5

(5 >. - Fuzzy # 1 ~

(J 1\1 0 ..

--Crisp = ::l 97,5 1\1 (J .. (J

-Fuzzy#2 ~« 97 0

96,5 ~ cry lJ") I"- m ~

~

K

Fig.4-Compari son between c ri sp and fu zzy methods for raw

input

99.000

(5 98.000 >.

~ (J 0

~ ~ :> 97.000

(J <l> (J

0-< 96.000

95.000

6-... ,.,\ .......... _J' \..~-.

/ -........ --

"<:t I'- 0

K

__ Crisp

_ Fuzzy#1

_ _ • Fuzzy # 2

Fig. 5-Comparison between crisp and fuzzy methods for digital

input

K

• • • Crisp

-- Fuzzy #1 ._-- Fuzzy #2

function # 2 for different types of input data and weighting values 'm' as the K value ranges from 1-1 2.

Figs 4-6 shows the variation in overall classi­fication accuracy with different types of methods such as crisp, fuzzy membership # 1 and fuzzy membership # 2 as K changes from 1-12 for raw, digital and analog input respectively. Fig. 7 shows the variation in accuracy with' m' .

Results obtained using nearest prototype classifier Table 5 shows the results obtained for nearest

prototype classifier where the prototypes are calculated considering the complete vectors as the input patterns and the selected features of the vector as the input patterns . The selected features for optical recognition of hand written digits data are: 2-8, 10-24, 26-39,41-63.

Application 2 (Letter Image Recognition Data)

Results obtained using crisp K-NN Table 6 shows the results obtained using the crisp

K-NN method for different types of input data and the K value ranges from 1-12 .

Table 6-Results obtained using crisp K-NN for different input data

K Overall % Overall % O verall %

accuracy accuracy accuracy

Raw i/p Analog i/p Digital i/p

Fig. 6--Comparison between crisp and fuzzy methods for analog I 93 .34 93.12 93.23 input 2

~ 100 ,------------------, --.--------§ ~ r_----------------~ « o 00 I------------------~ ~"" . ·J .• j '(" aSE.Li£ Ai . ~.,." . .-.' ... ~~

~ ffi~------------------4

~ oo~----------------~ o

K

- - .Fcr m=2

---.Fcr m=1 .5

---FcrITF3

Fig.7--Yariation in % accuracy with weighting factor 'm'.

3 4

5

6

7

8

9

10

11

12

93 .23 93 .20

93.33 93 .27

93.34 93.43

92.96 92 .83

92.90 92.90

92.38 92.46

92.34 92.31

91.83 91.83

91.96 91.84

91.83 91.40

91.83 91.36

Table 5--Yariation in classification accuracy with complete vector and selected features of the vector

Type of Input Data Crisp of % accuracy Fuzzy # I % accuracy Fuzzy # 2% accuracy

Com* Se l** Com* Sel ** Com* Sel **

Raw 88.89 87.07 90.13 88 . 10 90.05 88.07

Analog 88.34 87. 13 90.1 3 88. 10 89.98 88. 10

Dig ital 87.86 86.65 88.23 87.34 88.23 87.23

Com*--Complete vector Sel **--Selected features of vec tor

93.01

93.34

93.23

93.23

92 .96

92 .89

92.25

9 1.96

9 1.83

91.45

91.45

Page 10: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

336 INDIAN 1. ENG. MATER. SCI. , DECEMBER 2001

Results obtained for fu zzy K-NN using fuzzy Results obtained using fu zzy membership function #2 membership function #1

Table 8 shows the variation in overall percentage Table 7 shows the variation in overall percentage of classification accuracy using Fuzzy membership of classification accuracy using Fuzzy membership

function # 1 with weighting factor 111=2, I .S and 3 for function #2 with weighting factor m=2, 1.S and 3 for

different types of input data. different types of input data.

Table 7-Results obta ined with membership function # I for different 'III ' values

K Overall % accuracy Overall % accuracy Overall % accuracy

Raw data Analog data Digital uat a

111=2 111= 1.5 111=3 111=2 111=1 .5 111=3 111=2 111= 1.5 111=3

I 94. 12 87.65 88.05 94. 12 87.65 88 .13 94.02 85.9H 87.34

2 94.02 87 .79 89.27 94.12 88.07 89.34 93 .89 86.09 87.23

3 93.89 88 .10 90.05 94.02 88.07 90.13 93 .96 86.1 3 88. 13

4 94.02 87.78 90.89 93.96 87.92 90.89 93 .89 86.23 89.34

5 93.96 87.96 91.27 94.02 87 .92 9 1.27 93 .34 86.23 89.23

6 93 .89 88.07 91.83 93 .89 87.96 9 1.83 93 .23 86.09 88.23

7 93.46 87.92 90.45 93.96 87.96 90.45 92.96 86.23 89.27

8 93.34 88.10 90.27 93.89 88. 10 90.27 92 .90 86. 13 88 .13

9 93.34 88.07 90.05 93.46 88.10 90.05 91.83 86.09 87.23

10 93.23 88. 10 89.27 93.34 88. 10 89 .27 91.83 86.23 87.23

II 92.96 88.07 89.87 93.25 87.96 89.34 92.90 86.09 87.34

12 92.90 87.79 88.05 92 .96 87.7 1 89. 10 91.83 H5 .78 87.0 1

Table 8-Results obtained with membership function #2 for different 'm' values

K Overall % accuracy Overa ll % accuracy Overall % accuracy Raw data Analog data Digital data

111=2 111=1 .5 111=3 111=2 111=1.5 111=3 111=2 111=1 .5 111=3

I 94.02 87.23 89.27 94.12 87.89 88.05 93 .89 87 .1 3 87.23

2 94.12 87.89 88 .13 94.02 88.99 89.27 94 .02 87.23 87. 13

3 93 .89 88 .99 90.05 93 .96 88 .90 90. 13 93.34 88. 13 88.05

4 94.02 90.13 90.13 94.12 90.27 90.05 93.23 88.34 88. 13

5 93.89 9l.l3 91.27 94.02 91.59 90.45 92.90 88.23 89.34

6 94.12 91.59 91.34 93.89 90.65 91.83 92.96 88 .1 3 88.23

7 93.46 90.05 91.83 93.34 90.05 90.45 92.90 89.27 88 .05

8 93.34 89.90 90.05 93.89 90.33 90.27 92.96 88.23 88.05

9 93.23 89.27 90.27 93.46 90.05 90.05 9 1.83 88. 13 87.13

10 93.34 89.34 89.34 93.23 89.27 89.27 91.83 87. 13 87.23

II 92.90 87.34 90.05 93.23 88.23 89.87 91.78 87.13 87.34

12 92.90 88.23 88.05 92 .96 87.89 89. 10 91.83 87.05 87.98

Table 9--Variat ion in cl assifica tion accuracy with complete vector and selected features of the vector

Type of input data Crisp % accuracy Fuzzy# I % accuracy Fuzzy#2 % accuracy

Com* Sel* * Com* Sel** Com* Sel **

Raw 86.95 87 .01 87.34 87.23 87.23 87.13

Analog 86.95 87.0 1 87.23 87.34 87 .23 87.34

Digital 85 .78 86.23 86 .89 86.79 86.89 86.79

Com*-Complete vec tor, Sel **- Selected features of vector

Page 11: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

SIVANANDAM el at.: SOFr COMPUTING MODELS FOR DATA MI NING 337

95~----------------,

~ ~ ~~I~~~~~ .. ~~~--------~ o ..-... -'~~ _~".

g 93 - .. -'" ~ ~ ..... ... ~ G 92 -1----------.. --'--...---=--1 > U ...

o ~ 91 r---------------~

- - .Crisp , • •• 101_ ..... F1.IZZf #1

00 ~------~--------J

Fig. &--Compari son between crisp and fuzzy methods fo r raw input

"- w ~----------------~ o >-. ~ +--=-::,,:~.......::::::;>......:;;;::::=---j ~ u ___ ..... _ .... """" ---........ - - . Crisp

e 92+---------~-~~~~ ~ § 90 +-------------------1

--Fuzzy # 1

(; « 88 +-,......,.-,-...,......,--,-..,...-.-...,.....,......,--1 ----- Fuzz y # 2

K

Fig. 9--Comparison between crisp and fuzzy methods fo r analog input

~ ~ : -r' .-~~, .. ~,-:::~. ~::;;:;:::::...:---"" - I ~§92 - ..... ... ~.~ ~ <{ 91 ... -o 9OL-______ _

1 2 3 4 5 6 7 8 9 10 11 12

K

• •• Qisp

- F1.IZZf#1 -- F1.IZZf #2

Fig. IO-Comparison between crisp and fuzzy methods for digi tal inpu t

Figs.8-1 0 shows the vari ation in overall % of accuracy using di fferent methods with raw, analog and dig ital data as K value ranges from 1- 12 . Fig. II shows the vari ation in accuracy with 'm'.

Results obtained using nearest prototype classifier Table 9 shows the resul ts obtained for nearest

prototype class ifier where the prototypes are calculated considering the complete vectors as the input patterns and the selected features of the vector as the input patterns. The selected features 6-1 3.

~ 96 ~------------------~

--------------------3 ~ ~ 92

~ : , -~,~~~~:~ 'iii 86 W 84 > ~------------------~ o

K

[

~ For m=2 1 __ .Form=3

1IltII1ltl&ll1Ul1II11IIH11II1III For m= 1. 5 J

Fig. II-Variati on in % accuracy with weighting factor ' 11/ '.

Table 10-- Trainltest resul ts

Train/test Overall % of accuracy Testing time (s)

3000/ 17000 86.45 5786

5000/ 15000 90.84 5345

7000/1 3000 92.54 4897

9000/ 11 000 94. 14 4 132

11 00019000 95.89 3546

1300017000 98 .76 3 189

15000/5000 99.89 2876

17000/3000 100 2487

Table II-Sample ru les fo r optical recogniti on of handwritten d igits data set

If 0<=A2<=2 and O<=A 7<=3 and A8=0 and A9==0 and O<=A 10<= II and 2<=A 12<= 16 and 2<=A 13<= 16 and O<=A 15<= I 0 and A 16=0 and A 17=0 and O<=A 18<= 12 and 0<=A23<= 12 and A24=0 and I <=A3 1 <= 12 and 0<=A34<= 12 and 0<=A36<= 14 and 0<=A42<= II and 6<=A43<= 16 and A48=0 and A49=0 and 3<=A54<= 16 and A56=0 and 0<=A58<=3 and 3<=A60<= 16 and 3<=A6 1 <= 16 and 0<=A63<=7 and A64=0 Then digit ' 0'

If 0<=A2<=7 and 0<=A7<= 13 and A8=0 and A9=0 and 4<=A I2<=1 6 and A 16=0 and A17=0 and 0<=A23<=1 2 and A24=0 and 0<=A26<= 12 and 0<=A3 1 <=9 and 0<=A34<=5 and 0<=A42<=9 and 0<=A47<= 11 and 0<=A49<=5 and 4<=A52<=1 6 and 0<=A56<=8 and 0<=A58<=1 0 and 0<=A64<=1 4 Then digit ' 2'

IfO<=A2<=2 and 0<=A3<=1 4 and 0<=A7<= 11 and A8=0 and A 9=0 and O<=A I 0<= I 0 and A 16=0 and O<=A I 7 <=5 and 0<=A23<= 12 and A24=0 and 2<=A29<= 16 and 0<=A31 <=8 and 0<=A34<=1 2 and 0<=A39<= 10 and 0<=A42<=9 and 0<=A47<= 13 and A48=0 and A49=0 and 0<=A50<= 10 and 0<=A56<=8 and 0<=A58<=2 Then digit ' l '

If 0<=A2<=8 and 3<=A4<=1 6 and A8=0 and A9=O and O<=A 16<=7 and A 17=0 and 0<=A23<= 1 0 and A24=0 and 0<=A26<=8 and 0<=A3 1<=7 and 0<=A34<=7 and 0<=A39<=1 4 and 0<=A42<=5 and 0<=A43<=9 and 0<=A48<=2 and 0<=A56<=3 and 0<=A58<= 10 and 3<=A60<= 16 and 0<=A64<=2 Then digit '3 '

Page 12: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

338 INDIAN J. ENG . MATER. SCI., DECEMBER 2001

Table 12-Sample rules for letter image recognition data set

Rule 1 I f I <=A I <=9 and 2<=A3<=6 and 0<=A4<=6 and Rule 2 0<=A5<= 7 and O<=A 7 <=6 and 1<=A8<=8 and 0<=A9<=6 and 3<=AIO<=1O and 0<=A I 1<=8 and

If O<=A 1<= II and 1<=A3<=8 and 2<=A4<=6 and 0<=A5<=8 and 6<=A6<=10 and 4<=A7<=9 and 3<=A8<=9 and 3<=A9<=7 and 5<o=A IO<=11 and 3<=A II <=8 and 5<=A 12<=9 and 2<=A 13<=8 and 4<=AI4<=11 and 3<=AI5<= 12 and 2<=AI6<=1 0 Then Letter B

1<=A 13<=9 and 2<=A 14<=8 and O<=A 15<=8 and 1<=A I6<=8) Then Letter A

Rule 3 If 1<=A 1<= I 0 and 1<=A3<=9 and 0<=A4<=9 and Rule 4 0<=A5<=8 and 2<=A6<=9 and 5<=A 7<= II and 3<=A8<=9 and 3<=A9<=8 and 4<=A I 0<= II and 4<=A I I <=8 and 6<=A 12<= 13 and 2<=A 13<=8 and 6<=A 14<= I 0 and 2<=A 15<=8 and 5<=AI6<=12

If O<=A 1<= I 0 and 1<=A3<=9 and 0<=A4<=9 and 0<=A5<=11 and 5<=A6<=1 3 and 2<=A7<= 10 and 2<=A8<= II and 4<=A9<= I I and 4<=A I 0<= 12 and 1<=A I 1<=7 and 2<=A 12<=9 and 2<=A 13<=8 and 4<=A 14<= 12 and I <=A 15<=12 and 3<=A 16<= 10 Then Letter 0

Then Letter C

Trainltest Overall % of accuracy

1000/5435 70.01 2000/4435 93 .33 3000/3435 94.10 4000/2435 94.40 500011435 95.60 6000/435 95.89

T rainl test Overall % of accuracy

1000119000 98.10

3000/ 17000 96.08

5000/ 15000 92.37

8000/ 12000 9 1. 21

I 00001 I 0000 ~n.85

1100019000 86.85

Table 13- Trainl test results

Number of nodes formed

477 11 58 1300 1413 1489 1502

T raining time (5)

1020 1365 1476 1527 1610 1697

Tab le 14- Trainltest results

Number of Training time nodes formed (s)

607 398

1517 599

2209 787

2986 1089

3469 1270

379 1 1401

Test ing time (s)

60 54 48 43 32 3 1

Testing time (s)

71

65

57

47

38

33

Number of nodes pru ned

65 46 52 38 43 48

Number of nodes pruned

37

29

28

34

51

40

Table 15-Sample rules obtained using CART for land sat imagery data

If A 16>73 && A 17>97 && AI8<= 11 5&&AI9>90&& A20<=83 && A2 1> 101 && A22> 11 2&&A23>92&& A24<=757 Then grey soil

If A 16<=73 && A 17<=69 && A 18>94 && A 19> 108 Then class 2

If A 16>73 && A 17>97 && A 18> 11 5&&A 19>98&& A20>83 && A21>11 3 && A22> 123 Then class 3

Table 16- Sample rules obtained using CA RT for letter recognition data

If AI> l && A2<=10 && A3<=4 && M<=4 && A5> i && A6> 7 && A7<=7 && A8<=5 && A9<=3 Then the letter A

If AI> l && A2<=10 && A3>4 && A4<=7 && A5<=5 && A6<=7 && A7>9 && A8>5 && A9>5 && AIO<=IO && AI1>9 && A12>3 Then the letter is P

If A I> I && A2>10 && A3<=6 && M>7 && A5>3 && A6>7 && A7>8 && A8>5 && A9>5 && A IO>6 Then the letter is B

If AI> l && A2<=10 && A3>4 && A4<=7 && A5<=5 && A6<=7 && A7<=9 && A8<=5 && A9<=7 && A I 0>9 && A I I> I 0 && A 12>4 && AI3<=8 Then the letter is M

If A1<=1 && A2<=1 && A3>1 && A4> 1 Then the letter is J

If Al>1 && A2<= 10 && A3>4 && M<=7 && A5<=5 && A6>7 && A7>7 && A8<=5 && A9<=7 && A IO<=9 && A 11<=10 && A 12<=8 && A13>4 && A14> 10 && AI5<=3 Then the lelter is W

Page 13: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

S[Y ANANDAM et al. : SOFf COMPUTING MODELS FOR DATA M[N[NG 339

Table 10 shows the variation in accuracy with the change in training and testing samples keeping the K value as 3, i.e., for nearest Neighbours 3.

Rules obtained using K-NN method The sample rules extracted for optical handwritten

digits data and letter image recognition data are li sted in the Tables 11 and 12 respectively.

Resu[ts obtained using classification and regression trees

The CART is tested with Land Sat Imagery and letter image recognition data collected from UCI machine learning site. The training and pruning phase results for these benchmark problems are presented in Tables 13 and 14.

Figs 12 and 13 shows the variation in classification accuracy with change in the number of training samples for land sat and letter recognition data respectively.

The sample rules extracted for two benchmark data are listed in Tables 15 and 16.

Results obtained using ID 3 Table 17 shows the variation in the overall

percentage of accuracy with the number of training samples and the nodes formed and the number of pruned nodes after the pruning process. The application used is the letter image recognition data.

Sample rules extracted from ID3 The sample rules ex tracted from ID3 algorithm for

letter recognition data is listed in Table 18.

j I: r~--_---_-_-_-_~_~ .. ~.~~~~. ~~.==~.~M~~==~ ... =t. ;

o - .---.-_... .j 't. 40 . ---.------ .. - . ·1 s 20 1-------------- .... - -- .. --'---1. o 0 .~ ________ -. ____________________ -J

o 1000 2000 3000 4000 5000 6000

No Of Training samples

Fig 12-Yari ation of % accuracy with trai ning samples

~ 00 ~ ____________________________ ~

o I)Q() 200 300 400 m 600 100 BOO 900 I)Q() IDO 1200 !l00 o 0 0 0 0 0 0 0 0 0 0 0

I'b.OTrainrg Sarrpes Fig. 13-----Yariat ion of % accuracy wi th train samples

1=::1 _ IDJ

o 2500 5000 7500 1000 1250 1500 00 0

No.of training samp les

Fig. 14--Comparison of K-NN, CART and 103 methods

Table 17-Train/test results

Train/test

3000117000

5000/ 15000

700011 3000

9000111000

1100019000

[3000/7000

Ru[e 1 [f A[3=3 && A15=2 && A II=12 Then Letter V

Ru[e 5 [f A13=2 && AI=2 && A9=0 Then Letter A

Overall % Training time Testing time Number of Number of of accuracy (s) (s) nodes formed nodes pruned

75.01 42 272 1532 89

8 1.93 76 267 3 124 109

90.28 89 261 3671 11 6

92.90 97 251 4520 11 9

94.4 1 11 3 247 5526 137

85.75 123 240 6527 142

Table 18- Sample rules extracted from ID3

Ru[e 2 If A[3=[ && A 12=4 && A9=5 && A4=4 Then Letter I

Ru[e 6 If A13=4 && A[5=6 && A16=7 && A4=6 Then Letter B

Ru[e3 [f A[3=0 && A 10=9&& A7=7 Then Letter Z

Ru[e7 If A13=2 && Al [=7 && A 10=9&& A12=10 Then Letter G

Ru[e 4 If A13=2 &&A II=6 && A15=2 && A2= 1 && AI=I Then Letter 0

Ru[eS If A[3=3 && A15=5 && AI=IO && AI =7 Then Letter S

Page 14: Development of soft computing models for data miningnopr.niscair.res.in/bitstream/123456789/24454/1/IJEMS 8(6... · 2013. 12. 4. · Indian Journal of Engineering & Materials Sciences

340 INDIAN 1. ENG . MATER. SCI., DECEMBER 2001

After comparing all the classifier methods used for pattern classification it is noted that the 103 algorithm works wel l and the training time is also less. For K­NN method training time is zero and testing time is more. Fuzzy version of K-NN performs well compared to the crisp ve rsion. Fig. 14 shows the performance comparison of all methods.

Conclusions Thi s paper deals mainly with K-Nearest Nei ghbour

method and decision tree approaches for pattern classification problems. The patterns belonging to different cl asses are classified and finally presented in the form of rul es. The fuzzy log ic concept is added to improve the pattern classification task of the K-NN method . Three appl ications have been considered for simulation. Crisp and fuzzy vers ions of K-NN method and crisp and fuzzy vers ions of Nearest Prototype methods are the four different approaches used. The two decision tree approaches considered are class i­fication and regress ion trees (CART) and iterative dichotemiser (10) 3.

The fuzzy K-NN method is proved to be effective when compared to the crisp version . The training time for K-NN method is zero. Rules have been extracted from the method depending on the nearest neighbours and va lidated aga inst the original data. The (CART) method performs well for large data sets also and the number of nodes formed reduced to the half after the pruning. The rules have been extracted in the form of If-Then rules from the tree as the patterns traverse through the tree and reach the leaf nodes upon classification. The (10)3 Algorithm performs better compared to CART. 103 algorithm builds a complex tree and the classification time is less for this compared to CART. The pruning algorithm used is same for ID3 and CART. Rules have been extracted from the ID3 and are checked for validation against

the original data. Hence in this paper, different patterns belonging to the diversi fied classes are classifi ed using the machine learning and statistical methods and then the rules are generated. An in-depth analysis of the database is done with the variation of the various constant parameters , number of nodes and total training samples .

References 1 Kennedy R.L, Solving Data mining problems through

pattern recognition (PHI India) , ( 1997). 2 Fu Yongjian , Data mining-Tasks, techniques and

applications, (IEEE Potentials), (1997), 18-20. 3 Fayyad U.M, Data mining and knowledge discovery: Making

sense out ofda ta (IEEE Expert ), ( 1996),20-25. 4 Agrawal R, Imielinski T & Swami A, 1£££ Trans.

Knowledge Daw £ng ., 5(6) (1993 ), 9 14-925. 5 Keller 1. M,. Gray M.R &. Gi vens 1. A, IE££ Trans. Systems

Man Cybernet, 15(4), (1985) 580-585. 6 Short R.D, 1£££ Trans. Info rlll Th eory, 27 (5) ( 1981) 622-

627. 7 Kuncheva L.J. & Bezdek 1. C, 1£££ Trans System, Man ,

Cybernel - Part C: Appl Rev, 28 ( 1) ( 1998) 160- 169. 8 Quin lan 1. R, Machine Learning , ( 1) (1986) 8 1-1 06. 9 Quinlan 1. R" Int. J Man-Machin e Siud. 27 ( 1987) 22 1-234. 10 Crawford S. L, Inl. J Man-Machine Stud, 3 1 ( 1989), 197-

2 17. 11 Gelfand S.B, Rav ishankar C.S & Delp E.1. 1£££ Trans.

Pattern Anal Mach Intell , 13(2) ( 199 1) 163- 174. 12 Esposito F, Malerba 0 & Semeraro G, 1£££ Trans Pattern

Anal Mach Intel! , 19(5), ( 1997) 476-491. 13 Chou P. A, 1£££ Trans Pattern Anal Mach fl1I ell , 13(4 )

( 1991) 340-344. 14 Colin A, Dr. Dobb's Journal, ( 1996) 107- 124. 15 Monson L, Dr. Dobb's Journal, ( 1997) 117-13 1. 16 Quinlan 1.R., Knowledge Acquisition , (1990) 304-307. 17 LlI H,.Setiono R & Liu H, Proc. Of 2rl VLDB Conf, Zurich,

Switzerland, ( 1995), pp 478-489. 18 Tsokalas L.H & Uhrig R.E, Fuzzy and neural approacizes in

engineering, (John Wiley & Sons Inc. , SingaporefNew York), 1997.

19 VCI Reposilory of Machine Learning Databases r Machine Readable Database Repository]. ftp site, ftp://ftp.ics. uei.edu/pllb/machine-Iearni ng-databases.