geometric comparison of clarifications and rule sets...geometric comparison of classifications and...

Geometric Comparison of Classifications and Rule Sets,

¯ Trevor J. Monk, R. Scott Mitchell, Lloyd A. Smith and Geoffrey Holmes

Department of Computer ScienceUniversity of Waikato

Hamilton, New Zealandtmonk @cs.waikato.ac.nz, rsm 1 @ cs.waikato.ac.nz, las @waikato.ac.nz, geoff@ waikato.ac.nz

AbstractWe present a technique for evaluating classifications by geometric com-

parison of rule sets, Rules ~e represented as objects in an n-dimensionalhyperspace. The similarity of classes is computed from the overlap of thegeometric class descriptions, The system produces a correlation matrix that¯ indicates the degree of similarity between each pair of classes. The tech-nique can be applied to classifications generated by different algorithms,with different nt~mbersof classes and different attribute sets. Experimentalresults from a case study in a medical domain are included.

Machine Learning, Classification, Rules, Geometric Comparison

1. IntroductionInductive learning algorithms fall into two

broad categories based on the learning strat,egy that is used. In supervised learning(learning from examples) the system is pro-vided with examples, each of which belongs

¯ to one of a finite number of classes. The taskof the learning algorithm is to induce descrip-tiojas of each class that will correctly classifyboth the training examples and the unseen testeases. Unsupervised learning, on the otherhand, does not require pre-classified exam-pies. An unsupervised algorithm will attemptto discover its own classes in the examples byclustering the data. This learning mechanismis often referred to as ’learning by observationand discovery.’

One of the most important criteria for eval-uating alearning scheme is the quality of theclass descriptions it produces. In general,many descriptions can be found that coverthe examples equally well, but most performbadly on unseen cases. Techniques are re-quired for evaluating the quality of classifica-tions, either with respect to a classification orSimply relative to each other.

This paper presents anew method for com-paring classifications, using a geometric rep-resentation of class descriptions. The similar-ity of classes is determined from their overlap

* This project was funded by the NewZealand Foundation for Research in Scienceand Technology

in an n-dimensional hyperspace. The tech-nique has a number of advantages over ex-isting statistical or instance-based methods ofcomparison:

Descriptions are produced that indicate howtwo classifications are different, rather thansimply how much they differ.

¯ The algorithm bases its evaluation on de-scriptions of theclassifications (expressedas production rules), not on the instances inthe training set. This approach will tend tosmooth out irregular or anomalous data that

¯ might otherwise give misleading results.

Classifications using differing numbers orgroups of attributes can be compared, pro-vided there is some overlap between theattribute sets.

The technique will work with any clusteringscheme whose output can be expressed as aset of production rules.

In this study, the geometric comparisontechnique is used to evaluate the performanceof :a clustering algorithm (AuToCLAsS) in amedical domain. Two other techniques are:also used to compare the automatically gen-erated classifications against a clinical classi-fication produced by experts.

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 395

From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

1.1. Comparlng Classlflcatlons and RuleSets

Regarding evaluation of unsupervised’clustering’ type methods, Michalsld & Stepp(1983) state that:

The problem of how to judge the qual-ity of a clustering is difficult; and thereseems to be no universal answer to it.

¯One can, however, indicate two majorcriteria. The first is that the descriptionsformulated for clusters (classes) shouldbe ’simple’, so that it is easy to assign

¯ objects to classes and’ to differentiate be-tween the classes. This criterion alone,however, could lead to trivial and arbi-trary classifications. The second crite-rion is that class descriptions should ’fitwell’ the actual data. To achieve a veryprecise ’fit’, however, a description mayhave to be complex. Consequently, thedemands for simplicity and good fit areconflicting, and the solution is to find abalance between the two."

The CLUS’m~2 algorithm used a combinedmeasure of cluster quality based on a num,her Of elementarY criteria including the ’sim-

: plie~ty of description’ and ’goodness of fit’mentioned above (Michalski & Stepp, 1983).

Hansen & Bauer (1987) use an information-the’oretic measure of cluster quality in theirWITT system. This cohesion metric evalu-ates clusters in terms of their within-class and

¯ between-class similarities, using the trainingexamples that have been assigned to eachclass. Other measures are based on the classdescriptionsmin the form of decision trees,rules or a ’concept hierarchy’ as used byO~MEM (Lebowitz, 1987) or COBWEB (Fisher,1987). Some clustering systems produce classdescriptions as part of their normal, opera-tion while others, such as AUTOCLASS (Cheese,man, et. al., 1988) merely assign examples toclasses. In this Case, a supervised learningalgorithm such as C4.5 (Quinlan, I992) canbe used to induce descriptions for the classi-fication. This is the method used in this studyfor evaluating AUTOCLASS clusterings.

Mingers (1989) uses the two criteria sizeand accuracy for evaluating a decision tree i(or an equivalent set of rules). Followingthe principle of Occam’s Razor, it is gener-ally accepted that the fewer terms in a modelthe better; therefore, in general, a small treeor rule set will perform better on test data.

Accuracy is a measure of the predictive abil-ity of the class description when classifyingunseen test cases. It is usually measuredby the error rate--the proportion of incorrectpredictions on the test set (Mingers, 1989).Accuracy is often used to measure classifi-cation quality, but it is known to have sev-eral defects (Mingers, 1989; Kononenko Bratko, 1991). An information-based mea-sure of classifier performance developed byKononenko & Bratko (1991) eliminates theseproblemsand provides a more useful measureof quality in a variety of domains.

¯/

The remainder of this paper is organised asfollows. The next section briefly describesWEKA,a machine learning workbench cur-rently under development at the University ofWaikato. This includes an overview of theAUTOCLASS and C4.5 algorithms used in our

experiment. Section 3 describes our experi-memal methodology and the algorithm usedby the geometric rule set comparison system.Results of the experiment, using a diabetesdata set, are presented in Section 4. Theseresults are discussed in Section 5, includingsome analysis of the performance of the geo-metric comparison algorithm. Section 6 con-tains some concluding remarks and ideas forfurther research in this area.

2, The WEKA Workbench

WEKA,~ the Waikato Environment forKnowledge Analysis, is a¯ machine learningworkbench currently under development at¯ the University of Waikato (McQueen, et. al,1994). The purpose of the workbench is togive users access to many machine learningalgorithms, and to apply these to real-worlddata.

WEKA provides a uniform interactive in-terface tea variety of tools, including ma-

chine learning schemes, data manipulationprograms, and the LtsP-S’rAT statistics andgraphics package (Tierney, 1990). Data setsto be manipulated by the workbench use theARFF (Attribute-Relation File Format) inter-mediate file format. An ARFF file records in-formation about a relation such as its name, at-tribute names, types and values, and instances(examples). The WEKA interface is imple-mented usingthe TK X-Window widget setunder the "ICE scripting language (Ousterhout,

1 The name is taken from the weka, a small,inquisitive native New Zealand bird related tothe well-known Kiwi.

Page 396 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

Choose ¯ file;, . ~. . ,.,~en the c1~1;..~ ’ "¯ "

¯ ~" Z z ¯ ’1": ,1.02137

¯ RELATION: dlMmt~ ’&TTRIOUTE8~ i7 important Included -INSTANCES: 146 ~ . rl

- ii [] . .! #O02~mlaUve_welght

1~ dlltMlM rout N In ME- Group [] #O09:aclass !:

: ~ [~ : ;l’?1#011~las~:

...then a scheme.

C4~i NVol

Spreadsheet

View Data

Histogram

Stathales

Quit

Figure 1. The WEKA Workbench

1993). ARFF filters and data manipulation ID3 (Quinlan, 1986). The basic ID3 algorithm

programs are written in C. WEKA runs under has been extensively described, tested and

UNIX on Sun workstations, Figure 1 shows an modified since its invention (Mingers, 1989;

example display presented by the workbench. Utgoff, 1-989) ’and will not be discussed indetail here. However, C4.5 adds a number of

2.1. AutoClass enhancements to ID3, which are worth exam-ining.

AUTOCLASS is an unsupervised induction al-~orixhrn that automatically discovers classes C4.5 uses a new ’gain ratio’ criterion to de-In a database using a Bayesian statistical tech- termine how tO split the examples at each nodenique. The Bayesian approach has several of the decision tree. C4.5. This removes ID3’sadvantages over other methods (Cheeseman, strong bias towards tests with many outcomeset.:al., 1988). The number of classes is deter- (Quinlan, 1992), Additionally, C4.5 allowsmined automatically; examples are assigned¯ splits to be made on the values of continuouswith a probability to each class rather than ab’ (real and integer) attributes as well as enumer-i s01utely to a single class; all attributes are po- ations.tentially significant to the classification; and Decision trees induced by I D3 are often verythe example data can be real or discrete. complex, with a tendency to ’over-fit’ the data

An AUTOCLASS run proceeds entirely with- (Quinlan, 1992). c4.5 provides a solution out supervision from the user. The program this¯ problem in the form of pruned decisioncontinuously generates classifications until a trees or production rules. These are deriveduser-speeifiedtime has elapsed. The best clas-, from the original decision tree, and lead toSification found is saved at this poinL A va- structures that generally cover the training setriety of reports can be produced-from saved tess thoroughly but perform better on unseenclassifications. A WEKA filter has been writ- Cases. Pruned trees and rules are roughly

~ten that extracts the most probable class for equivalent in terms of their classification ac-each instance, and outputs this information in curacy; the advantage of a rule representa-a form suitable for inclusion in an ARFF file. tion is that it is more comprehensible to peo-

¯ This allows theAuTOCLASS classification to be ple than a decision tree (Cendrowska, 1987;used as input to other programs, such as rule Quinlan, 1992).and decision tree inducers. The first stage in rule generation is to turn

2.2. C4.5 , the initial decision tree ’inside-out’ and gen-erate a rule corresponding to each leaf. The

04.5 (Quinlan, 1992) is a powerful tool for resulting rules are then generalised to removeinducing decision trees and production rules conditions that do not contribute to the accu-from a set of examples. Much of C4.5 is de- racy of the classification. A side-effect of thisrived from Quinlan’s earlier induction system, process is that the rules are no longer exhaus-

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases 97

tive or mutually exclusive (Quinlan, 1992).C4.5 copes with this by using ’ripple-downrules’ (Compton, et. al., 1992). The rules areOrdered, and any example is then classified bythe first rule that covers it. In fact, only theclassesare ranked, with the twin advantagesthat the final rule set is more intelligible andthe order of rules within each class becomesirrelevant (Quinlan, 1992). c4.5 also defines default class that is used to classify examplesnot covered by any of the rules.

3. Methodology

3.1. The Data Set

Reaven and Miller (1979) examined the re-!ationship between Chemical and Overt dia-betes in 145 non-obese subjects. The dataset used in this study involves six attributes:patient age, patient relative weight, fast-ing ¯ plasma Glucose, Glucose, Insulin andsteady state plasma Glucose (SSPG). Thedata set also involves two classifications, la-beled CClass and EClass--presumably repre-senting ’clinical classification’ and ’electronicclassification’. Each classification describesthree classes: Overt diabetics, those requiringInsulin injections; Chemical ¯diabetics, whosecondition may be controlled by diet; and a~

Normal group, those without any form ofdiabetes. The same data set is used in thepresent study with the omission of patient age,

Reaven and Miller found the three attributesGlucose, Insulin, and SSPG to be more sig-nificant than any of the others.

x~x~x

~Xx 480X

x

SSPG.: x

xXx% ~**. ,~x’ e¢.

~.~?.... ¯ 29v

7489

¢¢

Insulin¯ ." .,0

.-: g...~,." ¯

10,X ~’ XX

1.57e+03 XX X XxX X X

X Xx~x

Glucose x xX X

X >kX

269 :.~.,.~..,,,., :,.’~e~",~.:

Figure 2. Scatterplotmatrixofdiabetesdata

¯ Normal classo Chemical class× Overt class

and SSPG), giving a single point in 3-space.Each instance is assigned to the class whosemean lies closest to it (in terms of the mini-mum Euclidean distance). After each instancehas been assigned to a class, the class meansare recalculated. The process of assigninginstances to classes and recalculating classmeans repeats until no patients are reassigned.The algorithm assumes prior knowledge of theclass means:and number of classes, to definethe initial classes. Reaven and Miller used theresults¯ of a previous study (Reaven et. al.),1976) to determine suitable starting means.

Both the clinical and electronic classifica-tions assume¯ the presence of three classes.The scatter plot below (Figure 2) shows thedata set and its clinical classification. Eachof the six Small plots shows a different pairof variables (the plots at the lower right aresimply reflections of those at the upper left).For example, the plot in the upper left cor-ner shows Glucose vs. SSPG. The clinicalclassification appears to be highly related tothe Glucose measurement: in fact, the threeclasses appear to be divided by the Glucose

3.3. Classification using AutoClass

One objectiye of this study was to determineif the patients inthe data set naturally fall intogroups,¯ without prior knowledge of the num-ber of groups or the attributes of the groups.We also wished to determine the effectivenessof AUTOCLASS’S classification of the diabetesdata, and compare it to the clinical classifica-tion.

It is difficult to determine which attributes

measurement alone+ This is particularly well from the data set are reasonable ones to gener-illustrated in the plot of Glucose vs. SSPG in atea classification. Some may be irrelevant,the upper left plot. : and a classification generated using them may

: produce insignificant classes. Reaven andThe Electronic classification (EClass) Miller considered only Glucose, Insulin and

~enerated using a clustering algorithm by SSPG to be relevant attributes. They foundedman and Rubin (1967). Each class ¯that fasting plasma Glucose exhibited a high

described by the mean of the three variables degree of linear association with Glucose (r of the instances in that class (GlucOse, Insulin 0.96), indicating that these two variables are


Classification AttributesAClass Glucose, Insulin, SSPGAClassl Relative weight, Fasting

Plasma Glucose, Glucose,Insulin, SSPG

AClass2 Glucose, SSPGAClass3 Glucose, Insulin

Table 1. AUTOCLASS classifications

Classification Error RateCClass 7.2%EClass 5.6%AClass 5.6%AClassl 7.2%AClass2 3.2%AClass3 4.8%

Table 2. Predicted error rate of rule sets

essentially equivalent. Four classificationswere ’made in the present study, using Auto,CLASS. Each used different selected attributes,as. shown in Table 1. Since AUTOCLASS com=pletesits classification after a specified time-has elapsed, an arbitrary execution time ofone hour was chosen. Classes did not appearto change significantly with longer executiontimes. However, a Comprehensive study ofthe effects of differing execution time has notbeen performed.

Although different in detail, all the classifi-

tion, and the differences were tallied. A differ-ence in the classification of an instance fromthe clinical classification is assumed to be amisclassification. The percentage misclassi-fication gives some indication of the ’good-ness’ of a classification. Some misclassifica-tions aremore important than others, and a

¯ single error statistic does not illustrate this. Alarge number of Normal patients misclassifiedas Chemical diabetics may not be important,since they are in no danger of dying from thisclassification error, however if Overt patientsare misclassified as Normal then death couldresult. The Friedman and Rubin automaticclassification (EClass) has 20 misclassifica-tions, giving an error statistic of 13.8%. Thiswasdeemed acceptable by Reaven and Miller.

3.3.2. Class comparison using twosample comparlson of means

Each class may be described by the meansof the attributes of the instances it contains.This is similar to the generation of a classifi-cation using Friedman and Rubin’s clusteringalgorithm, where the means characterize theclasses. Each class is described by the meansland standard deviations of the three main at-tributes: Glucose, Insulin, and SSPG. Forexample, the Glucose mean for the Normalclass of the electronic classification (EClass)

will be compared with the Glucose mean forthe Normal class of the clinical classification(CClass).

tion. Thus we were able to assign the names’Normal’, ’Chemical’ and ’Overt’ to the gen-erated classes. This is unlikely to be easy toachieve in general. C4.5 was used to generate

¯ rule sets describin~ allsix classifications. Theattributes used to reduce the rule sets were inall cases the same as those used to generate theinitial classification. Cross’validation checks

¯ cations divide the data set into classes that bear We used a two way t-test (Nie, et. al., 1975)an obvious similarity to the clinical classifica- to compare the class means. The null hypoth-

esis was that there is no difference between thetwo means. The level of similarity betweeni two classifications is given by the number ofrejected t-tests. All tests were performed atthe 95% level of significance. For exam-ple, the SSPG mean of the Chemical classfor EClass is significantly different from theSSPG meanofthe Chemical class for CClass.Assuming that the means of the classes in one

were performed on the rule sets to derive a classification must be the same as the means ofreliable estimate of their accuracy. The pre- the classes in another classification for the twodieted error rates of each of the rule sets onunseen cases is shown in Table 2. Three tech,

classes to be considered equivalent, then we

niques were used to compare the generatedcannot say that the Chemical class for CClass

classifications with the clinical classification:is equivalent to the Chemical class generated

classification differences by instance, classi-by Friedman and Rubin’s classification algo-

fieation differences by comparison of means, rithm (EClass).

and classification differences by comparison3.3.3. Comparing rules:for classificationof rules.

3.3.1. Comparing individual Instances

The automatic classification of each in-stance was compared to the clinical classifica-

comparison.

Neither technique described above allowsus toaccurately compare different classifica-tions. Comparing classifications by examin-ing misclassified instances is dependent on


the two individual sets of data being com-pared. Examining misclassifications may notprovide an accurate estimate of the ciassifica-tion errorrate for unseen data.

Assuming that a rule set accurately de-scribes the classification of a set of data, thenby comparing two sets of rules we :are effec-tively comparing the two classifications. Thetechnique presented in this paper for compar,ing rules produces a new set of rules describ-ing the differences between the two classifica-tions. An analysis of this kind allows machinelearning researchers to ask the question "Whyare these two classifications different?" Pre-viously it has only been possible to ask "Howdifferent are these two classifications?"

3.4. Multidimensional Geometric Rule~ Comparlson

This technique represents each rule as a ge-ometric object in n-space. A rule set producedusing 04.5 is represented as a set of such ob-jects. As a geometric object, a rule forms aboundary within which all instances are clas-sified according to that rule. The domain cov-

¯ erage of a set of rules is the proportion ofthe entire domain which they cover. Domaincoverage is calculated by determining the hy-pervolume (n-dimensional volume) of a set rules as a proportion of the hypervolume ofthe entire domain. The ’ripple down’ rulesof 04.S must be made mutually exclusive toen~ure that no part of the domain is counted

¯ more than once. The size of the overlap be-tween two sets of rules provides an indicationof their similarity. The non-overlapping por-tions of the two rule sets are converted into aset of rules describing the differences betweenthe two rule sets.

3.4.1. Geometric Representation of Rules

A production rule can be considered todelimit a region of an n-dimensional space,where n is the ’dimension’ of the rulemthenumber of distinct attributes used in the termson the left hand side of the rule. The ’volume’of a rule is then simply the volume of the re~gion it encloses. Any instance lying insidethis region will be classified according to theright hand side of the rule. There is a problem,however, with rules that specify only a singlebound for some attributes. For example, thetwo-dimensional rule

Glucose < 418 A SSPG < 145 ~ NORM

does not give lower bounds for either of theattributes. The volume of this rule is effec-tively infinite, making it very hard to compare

against anything else. Unfortunately, mostof the rules induced from our classificationstake this form, as G4.5 works by splitting at-tributes rather than explicitly defining regionsof space. In this study, our solution to thisproblem has been to define absolute upper~d lower limits for each attribute. Rules thatdo not specify a particular bound are then as-sumed to use the appropriate absolute limit.We have used the maximum and minimumvalues in the data set for each attribute as ourabsolute limits. The limits for the Glucoseattribute are zero and 1600; for SSPG theyare zero and 500. The rule above can then bere-written as:

Glucose > 0 A Glucose < 418 A SSPG > 0A SSPG < 145 =~ NORM.

The geometric representation of the aboverule is shown in Figure 3. The solid dots rep-resent the Normal examples from the CClassclassification.

00

00

0

oo o o

o

o

o

oo

¯ o o o

0 400 800 1.2e+03 1.6e+03Glucose

Figure 3. Geometric representation of a rule

An entire set of rules describing a data setcan be represented as a collection of geomet-ric objects of dimension m, where m is themaximum dimension of any rule in the set.Any rule with dimension less than the maxi-mum in the set is promoted to the maximumdimension. For example, if another rule,

Glucose > 741 =:, OVERT

were added to the first, ¯the dimension of thisrule would have to be increased to two thecurrent maximum in the set. This is becausetwo squares may be easily compared, but aline and a square, or a square and a cube,

Page 400 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94

may not. Promotion to a higher dimension isachieved by adding another attribute to a rule,but not restricting the range of that attribute.The rule

Glucose > 741 =:, OVERT

in one dimension is equivalent to

Glucose > 741 A SSPG < 500 A SSPG > 0=:, OVERT

in two dimensions. The range of SSPG is nota factor in determining if a patient is Overtsince no patient lies outside the range of SSPGspecified in this rule.

Consider a comparison between the normalclasses of EClass and AClass. The input rulesets are as follows:

EClass:

Glucose < 376 =~ NORMGlucose < 557 ^ SSPG < 204 =~ NORM

AClass:

Glucose < 503 ^ SSPG < 145 => NORMGlucose < 336 ~ NORMGlucose < 418 ^ SSPG < 165 ~ NORM

Figure 4 shows the geometric representa-tion of the two rule sets. The solid boxesrepresent the EClass rules; the dotted onesrepresent the AClass rules.

8In

8tDff’lno’~

t’~

8,,’4

o

ioii!|!i

¯ooi

i oi o ooi ¯

i oi ¯ ¯ °oi ~’ °olii oo

oI o. ¢ o

"41+I "~" Io°

o , t~,°

¯ ,~ o o’1" "I~o..... ,,,*++:

+.++i i! ii ii i

0 400 800Glucose

oe o o

o

o

qo

o o o

1.2e+03 1.6e+03

Figure 4. Geometric input rules

3.4.2, The Cutting Function

Making rules mutually exclusive, and de-termining their similarities and differences,

B

A

Figure 5. Before cutting

all involve the cutting function. Consider thek two dimensional example shown in Figure 5.

Rule A overlaps Rule B in every dimension.Rule A is the cutting rule, and Rule B is therule being cut.

Each dimension in turn is examined; thefirst being the z dimension. The minimumbound (left hand edge) of rule A does notoverlap rule B, so it need not be considered.The maximum bound (right hand edge) of ruleA cuts rule B into two segments. Rule B be-comes the section of rule B which was over-lapped by A in the z dimension. A new rule,B I, is created which is the section of rule Bnot overlapped by Rule A in the x dimension(Figure 6).

B B1

A

Figure 6. After cutting z dimension

Considering dimension 2, the y dimension:All references to rule B refer to the newlycreated rule B at the last cut. The minimumbound of rule A (the bottom edge) does notoverlap rule B, so no cut is made. The maxi-mum bound of rule A (the top edge) overlapsrule B, creating a new rule B2 which is the


section of rule B not overlapped by rule A.Rule B becomes the section of rule B which

is overlapped by rule A (Figure 7). The re-maining portion of the original rule B afterall dimensions have been cut is the overlapbetween the two rules.

B2 B1I

A

B

Figure 7. After cutting y dimension

The new rules A, B, B 1, and B2 are all mu-tually exclusive. Rules B 1 and B2 describethe part of the domain covered by the originalRule B, and not by Rule A. The cutting func-tion generalises to N dimensions, since eachdimension is cut independently of every otherdimension.

Rule A is assumed to have a higher ’prior-ity, than rule B. Any instances which lie inthg overlap between rule A and rule B willremain within rule A after the cut.

3.4.3, Generating mutually exclusiverules

The ’ripple down’ rules of C4.5 may be+thought of as a priority ordering of rules--those that appear first are more important thanthose that follow. Any instances that fall intothe overlap of two rules Would be correctlyclassified by the higher priority rule, that is,the one which appears first in the list.

Obviously higher priority rules must not becut by lower priority rules; the result wouldno longer correctly classify instances. In anordered list of rules from highest to lowestp.riority, such as those output from 04.5, therule at the head of the list will be used as acutter for all those following it+ Each cut rule:is replaced by the segments not overlapped bythe cutter. B is replaced by B 1 and B2 in theexample above. Once all the rules below thecutter have been cut, the rule following thecutter is chosen as the new cutter. When no

cutters remain, the result is a list of mutuallyexclusive rules--which classify all instancesexactly as before.

The mutually exclusive rules for the exam-ple comparison between EClass and AClassare shown below (Figure 8). As before, thes01id boxes represent the EClass rules and thedotted boxes represent the AClass rules. Therules within each classification do not overlap.

o

0

:p,’0*0 * ,o:.¢,

¯ ¢o,:.,, [....... ~~ 1." .

! Io ! I

I iI I |

0 400 800Glucose

o

° o °

o

o o ¯

°° o

0o * * *

1.2e+03 1.6e+03

Figure 8. Mutually exclusive rules

3.4.4. Rule Set Differences

The objective of the geometric algorithm isto compare two sets of rules and determinetheir differences. Each class in the first setof rules is compared with each class in thesecond, generating a new rule set describing¯ the differences between the two classes. Con-sider two rule subsets: A and B, describingtwo classes within different rule sets. Eachrule from A is used to cut each rule from B

*+ and the resulting set of rules describes the partof the domain described by rule set B and notby rule set A. The same process is used inreverse (set B cutting set A) to describe thepart of the domain covered by set A that is notcovered by set B.

The difference matrix shows rule set cover-ageofthe differences between rules describ-ing two classes. This is not an accurate statis-tic since bigger rules cover more of the do-main and their differences will therefore ap-pear more significant than smaller rules, eventhough their relative differences may be sim-ilar.


For example, the rules below describe thepart of the domain covered by EClass (classNorm) but not by AClass (class Norm). Fig,ure 9 shows the geometric representation ofthese rules in relation to the data set.

Glucose > 336 A Glucose < 376 ASSPG > 165 =~ NORM

Glucose > 503 A Glucose > 557 ASSPG < 204 ~ NORM

Glucose > 418 A Glucose < 503 ASSPG > 145 A SSPG < 204 =¢. NORM

Glucose > 376SSPG > 165

A Glucose < 418 AA SSPG < 204 =~ NORM

to

oO

8t.9t’lI1.to

oo

o

¯ o¯ Qo0

o

o

I

o

,o o°

o

° o o

°0 °oo,o ,o°

o o o o

° o

o

o

0 400 BOO 1.2e+03 1.8e+03.Glucose

Figure 9. Output rules.

The EClass rules entirely encompass those ~¯ ofACIass, so there are no rules describing thepart of the domain covered by AClass and notby EClass.

the hypervolume of the rule from set A. Thisstatistic indicates the proportion of the rulein set A overlapped by the one in set B. Thesum of the proportions indicates the similaritybetween the two rule sets. Comparing a setof rules to itself will produce a correlationmatrix with 100% along the main diagonal,and 0% everywhere else, indicating that eachclass completely overlaps itself and each classis mutuallyexclusive (since all the rules aremutually exclusive).

4. Results

This section presents the results producedby the three classification comparison meth-¯ ods described earlier--instance comparison,tWO sample comparison of means and geo-~metric rule comparison. For each method,the five automatically generated classifica-tions (EClass, AClass, AClassl, AClass2 and

¯ AClass3) are compared against the originalclinical classification, CClass.

Table 3 shows the results of the instancecomparison test. The numbers in each column¯ indicate the number of examples classifieddifferently to CClass. These values are alsoexpressed as a percentage of the total trainingset of 145 instances. The rows of this tablebreak the misclassifications down further toshow what types of misclassification are oc-curring, For example, the row ’Norm~Chem’¯ ¯represents instances classified as Normal byCClass, but as Chemical by the automaticclassifications.

Results of the two-sample t-test comparisonare given in Table 4. Here the entries in thetable indicate the attributes for which the t-testfailed, for each class in the five classifications.A failed test is one where the null hypothesiswas rejected, ie. ¯the mean of the attribute is

3.4.5. Correlation Matrix significantly different from the correspondingCClass mean, at the 95% significance level.

The matrix provides a similarity measure,or correlation estimate, between two sets ofrules. It ¯describes the amount by which thetwo rule sets overlap. Each class from the firstset of rules is compared to each class fromthe second set. Consider again two subsetsof rules~ A and B, each describing a classwithin two different sets of rules. A describesthe ideal classification; Set B describes theclassification to be compared against the ideal.As with the Difference Matrix, each rule from¯ set A is compared to (cuts) each rule in setB. The hypervolume of the overlap betweeneach pair of rules is taken as a percentage of

The bottom row of this table shows the totalnumber of rejections for each classification.

Tables 5-9 show the correlation matrix out-put produced by the geometric rule compar-¯ ison¯ system. "Phe table entries indicate theamount of overlap between the two classes asa proportion of the class listed across the topof the table--in this case CClass.

5. Discussion

The correlation matrices produced by thegeometric rule comparison system can be used


DifferenceEClass

ClassificationType AClass AClass 1 AClass2 AClass3

Norm=~Chem 3 (2.1%)10 (6.9%)

9 (6.2%)

1 (0.7%)4 (2.8%)

3 (2.1%)

6 (4.1%)0 (0.0%)

13 (8.9%)5 (3.4%)

8 (5.5%)1 (0.7%)

12 (8.3%)2 (1.4%)

Chem~Norm

4 (2.8%)3 (2.1%)

19 (13.1%)Overt=:,Norm

5 (3.4%)8 (5.5%)

Overt=~Chem 2 (1.4%)Total 20(13.8%) 2t(14.5%) 21(14.5%) 25(17.2%) 31(21..4%)

Table 3. Results of classification instance comparison

ClassificationClass EClass AClass AClassl AClass2 AClass3

Normal Glucose Glucose Glucose, SSPGChemical SSPG SSPG SSPG Insulin, SSPG

Overt Insulin

Total 1 0 2 2 5

Table 4. Results oft-tests on classification means

CClassEClass Normal Chemical OvertNormal 94.05%

5.95%31.33% 0.00%

Chemical0.00%

68.67%0.00%

14.19%Overt 85.81%

Table 5. EClass vs. CClass correlation

CClassAClassl Normal Chemical OvertNormal 94.89%

5.11%41.68%

0.00%58.32%

1.64%Chemical 2.76%

Overt 0.00% 95.60%

Table 7. AClassl vs. CClass correlation

CClassAClass Normal Chemical OvertNormal 86.86% 13.62% 0.00%

~hemical 13.14%0.00%

86.38%0.00%

14.19%Overt 85.81%

Table 6. AClass vs. CClass correlation

CClassAClass2 Normal Chemical OvertNormal 87.68%

12.32%37.20%62.80%

37.20%Chemical 8.91%

Overt 0.00% 0.00% 53.89%

Table 8. AClass2 vs. CClass correlation

to determine which classification rule set com-pares most closely to the clinical classifica-tion. Choosing the best rule set to describethe classification also depends on the partic-ular misclassifications made by each rule set.We indicated previously that misclassifyingOvert patients as Normal can result in death.Obviously it is more important to minimisethese types of misclassifications rather thanthose from Normal to Chemical for example,where the effect ofmisclassification is not soimportant. The correlation matrices indicateAClass and ECIass have 0% misclassificationof Overt patients to Normal. AClass2 has aSignificant misclassification error of 37.2%.AClass3 has low correlation between simi-lar classes, and significant Overlap betweenthe Normal and Chemical classes. 35.25% ofChemical diabetics are misclassified as Nor-mal and 64.75% of Normal patients are mis-

CClassAClass3 Normal Chemical Overt

Normal 35.25% 35.25% 8.29%Chemical 64.75% 64.75% 20.76%

Overt 0.00% 0.00% 70.94%

Table 9. AClass3 vs. CClass correlation

classified as Chemical.

The Normal and Overt classes of AClass 1are very similar to the equivalent CClassclasses, but the Chemical classes are nothighly correlated between these two classi-fications. If misclassification of Chemicaldiabetics to Normal were considered unim-portant, AClassl would obviously be the bestclassification. However, AClass has goodcorrelation between similar classes (all above85%), and would be used if an acceptable

Page 404 AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94

error rate over all classes were desired.

It is desirable to have a single metric thatdescribes the similarity between two rulesets. This could perhaps be calculated as aweighted average of the correlation percent-age for equivalent classes. In the tables above,equivalent classes lie on the main diagonal ofthe correlation matrix. If no misclassificationis considered more important than any other, aweighting of 1.0 would be used for each class.Table 10 shows this metric, calculated usinga weight of 1.0, for each rule set comparedto the CClass rule set. Thistabie indicatesthat AClass is the classification most similarto CClass, and that AClass3 is the least sim-ilar. This supports the ’intuitive’ reading ofthe correlation matrices.

Classification Similarity

EClass 82.84%AClass 86.35%AClassl 82.94%AClass2 68.12%AClass3 56.98%

Table 10. Overall correlation with CClass

The table of instance misclassifications canbe used in the same way as the correlationmatrix. The classification error for each typeof misclassification corresponds roughly tothose inthe correlation matrix. For example,there are 19 misclassifications from Chemicalto Normal for the AClass3 e!assification, giv’iing an error rate of 13,1% (one of the highestrates in Table 2). The same misc!assification:in Table 7 gives an error rate of 37.2%----alsoone of the highest. We believe the similaritystatistics in the correlation matrices are moreindicative of potential misclassification errorthan the estimates of Table 2. The instancemisclassification errors are only representa-tive of the current data set. For large datasets this error may be close to the popula-tion errormhowever, small data sets are notrepresentative of the population. The gee,metric rules are more indicative of classifi-cation errors because the rules are represemtative of the population domainDthat is, itis assumed the rules will be used to classifyunseen cases. Misclassifications indicated inTable 2 are not always accounted for by therules. C4.5 reclassifies some instances whenthe rules are constructed. The single misclas-sification from the Overt class to the Normalclass by EClass for example, is not repre:sented in Table 4.

An advantage of the instance misclassifi-cations is that the distribution of the data isinherent in the misclassifications themselves.The geometric representation of the classifi-Cations has no knowledge of the underlyingdistribution of the data; the similarity esti-mates assume a uniform distribution of eachattribute across the domain.

The t-test comparison of classes is veryimprecise. It imparts very little informa-tion aboutthe quality of a classification com-pared with the clinical classification. AClassis indicated as a similar classification, sincethe means of all the variables for each classare equivalent. AClass3 obviously comparespoorly to the clinical classification since themeans are significantly different in all threeclasses--two out of three means are different

¯ in both the Normal and Chemical classes. Thet-test results would indicate that AClass 1 andAClass2 are comparable classifications. Thisis an obvious disagreement with the analysisof the geometric correlation matrices. Thesimilarity metrics for AClassl and AClass2alone differ by 14.82%. 37.2% of Overt di-abetics are classified as Normal by AClass2,compared with 1.64% by AClassl.

6. Conclusions

This paper has presented a new method forevaluating the quality of classifications, basedon a geometric representation of class descrip-tions. Rule sets are produced that describe thedifference between pairs of classes. Correla-tion matrices are used to determine the relativedegree of similarity between classifications.

¯ i The method has been applied to classificationsgenerated by AUTOCLASS in a medical domain,and its evaluation compared to those of simpleinstance comparison and statistical methods.

The results obtained so far are encouraging.The¯evaluations produced by the geometricalgorithm appear to¯correlate reasonably wellwith the simple instance comparison. We be-lieve that the geometric evaluation is moreuseful because it reflects the performance ofthe classification in the real world, on unseendata. Instance based or statistical methods

¯ cannot reproduce this. However, several as-pects of the technique require further experi-mentation and development.

The algorithm currently assumes a uniformdistribution for all attributes, which in gen-eral is not valid. We plan to incorporate a¯ mechanism for specifying the distribution of

KDD-94 AAA1-94 Workshop on Knowledge Discovery in Databases Page 405

attributes to the algorithm, or at least use ’stan-dard’ configurations such as a normal distri-bution.

The system is at present limited to compar-ing ripple-down rule sets. Ideally it wouldalso be able to handle rule sets in which everyrul.e has equal priority. There is a difficulty inthis case with overlapping rules from differentclasses--unlike ripple-down rules there is noeasy way to decide which rule, if any, shouldtake priority.

Finally we plan to extend the system tohandle enumerated as well as continuous at-tributes, and integrate it fully with the WEKAworkbench.

Acknowledgements

We wish to thank Ray Littler and Ian Wittenfor their inspiration, comments and sugges-tions. Credit goes to Craig Nevill-Manningfor his last minute editing. Special thanks toAndrew Donkin for his excellent work on theWEKA Workbench.

ReferencesCendrowska, J., 1987. "PRISM: An algorithm

for inducing modular rules". InternationalJournal of Man-Machine Studies. 27,349-270.

Cheeseman, E, Kelly, J., Self, M., Stutz, J.,Taylor, W, & Freeman, D., 1988. "AuTo-CLASS: A Bayesian Classification System".In Proceedings of the Fifth InternationalConference on Machine Learning, 54-64.Morgan Kaufmann Publishers, San Mateo,California.

Compton, E, Edwards, G., Srinivasan, A.,Malor, R., Preston, E, Kang, B. & Lazarus,L., 1992. "Ripple down rules: turn-ing knowledge aquisition into knowledgemaintenance". Artificial Intelligence inMedicine 4, 47-59.

Fisher, D., 1987. "Knowledge AcquisitionVia Incremental Concept Clustering". Ma-chine Learning 2, 139-172.

Friedman, H. & Rubin, J., 1967. "On someinvariant criteria for grouping data". Jour-nal of the American Statistical Association62, 1159-1178.

Hanson, S. & Bauer, M., 1989. "ConceptualClustering, Categorization, and Polymor-phy". Machine Learning 3, 343-372.

Kononenko, I. & Bratko, I., 1991."Information-Based Evaluation Criterionfor Classifier’s Performance". MachineLearning 6, 67-80.

Lebowitz, M., 1987, "Experiments with In-cremental Concept Formation: UNIMEM".Machine Learning 2, 103-138.

McQueen, R., Neal, D., DeWar, R. & Gar-ner, S., 1994. ,’Preparing and processingrelational data through the WEKA machinelearning workbench". Internal report, De,partment of Computer Science, Universityof Waikato, Hamilton, New Zealand.

Michalski, R. & Stepp, R., 1983. "Learn-ing from Observation: Conceptual Clus-tering". In Michalski, R., Carbonell, J. &Mitchell, T. (eds.), Machine Learning: AnArtificial Intelligence Approach, 331-363.Tioga Publishing Company, Palo Alto, Cal-ifornia.

Mingers, J., 1989. "An Empirical Compari-son Of Pruning Methods for Decision TreeInduction". Machine Learning 4, 227-243.

Nie, N., Hull, H., Jenkins, J., Steinbrenner,K: & Bent, D., 1975. SPSS: StatisticalPackage for the Social Sciences, 2nd ed.MCGraw-Hill Book Company, New York.

Ousterhout, J., 1993. An Introduction to TCLand TK. Addison-Wesley Publishing Com-pany, Inc., Reading, Massachusetts.

Quinlan, J., 1986. "Induction of DecisionTrees". Machine Learning 1, 81-106.

Quinlan, J., 1992. c4.5: Programs for Ma-chine Learning. Morgan Kaufmann Pub-lishers, San Mateo, California.

Reaven, G., Berstein, R., Davis, B. & Olef-sky, J., 1976. "Non-ketoic diabetes mel-litus: insulin deficiency or insulin resis-tance?". American Journal of Medicine 60,80-88.

Reaven, G. & Miller, R., 1979. "An Attemptto Define the Nature of Chemical DiabetesUsing a Multidimensional Analysis". Dia-betologia 16, 17-24.

Tierney, L., 1990. LIsP-S’rA’r: An Object-Oriented Environment for Statistical Com-puting and Dynamic Graphics. John Wiley& Sons, New York.

Utgoff, E, 1989. "Incremental Inductionof Decision Trees". Machine Learning 4,161-186.

Page 406 AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94

geometric comparison of clarifications and rule sets...geometric comparison of classifications and...

Documents