16.c-fuzzy decision trees
TRANSCRIPT
-
8/9/2019 16.C-Fuzzy Decision Trees
1/14
498 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
CFuzzy Decision TreesWitold Pedrycz , Fellow, IEEE, and Zenon A. Sosnowski, Member, IEEE
AbstractThis paper introduces a concept and design of deci-sion trees based on information granulesmultivariable entitiescharacterized by high homogeneity (low variability). As such gran-ules are developed via fuzzy clustering and play a pivotal role in thegrowth of the decision trees, they will be referred to as C-fuzzy de-cisiontrees. In contrast with standard decision treesin which onevariable (feature) is considered at a time, this form of decision treesinvolves all variables that are considered at each node of the tree.Obviously, this gives rise to a completely new geometry of the par-tition of the feature space that is quite different from the guillotinecuts implemented by standard decision trees. The growth of theC-decision tree is realized by expanding a node of tree character-ized by the highest variability of the information granule residingthere. This paper shows how the tree is grown depending on someadditional node expansion criteria such as cardinality (number of
data) at a given node and a level of structural dependencies (struc-turability) of data existing there. A series of experiments is re-ported using both synthetic and machine learning data sets. Theresults are compared with those produced by the standard ver-sion of the decision tree (namely, C4.5).
Index TermsDecision trees, depth-and-breadth tree expan-sion, experimental studies, fuzzy clustering, node variability, treegrowing.
I. INTRODUCTION
DECISION trees [12], [13] are the commonly used archi-
tectures of machine learning and classification systems.
They come with a comprehensive list of various training andpruning schemes, a diversity of discretization (quantization)
algorithms, and a series of detailed learning refinements [1],
[3][7], [10], [11], [15], [16]. In spite of such variety of the
underlying development activities, one can easily witness
several fundamental properties that cut quite evidently across
the entire spectrum of the decision trees. First, the trees operate
on discrete attributes that assume a finite (usually quite small)
number of values. Second, in the design procedure, one attribute
is chosen at a time. More specifically, one selects the most dis-
criminative attribute and expands (grows) the tree by adding
the node whose attributes values are located at the branches
originating from this node. The discriminatory power of theattribute (which stands behind its selection out of the collection
Manuscript received February 10, 2004; revised June 9, 2004 and September9, 2004. This work was supported in part by the Canada Research Chair (CRC)Program of the Natural Science and Engineering Research Council of Canada(NSERC). The work of Z. A. Sosnowski was supported in part by the TechnicalUniversity of Bialystok under Grant W/WI/8/02.
W. Pedrycz is with the Department of Electrical and Computer Engineering,University of Alberta, Edmonton, AB T6G 2V4, Canada, and also with the Sys-tems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland(e-mail: [email protected]).
Z. A. Sosnowski is with the Department of Computer Science, TechnicalUniversity of Bialystok, Bialystok 15-351, Poland (e-mail: [email protected]).
Digital Object Identifier 10.1109/TSMCC.2004.843205
of the attributes existing in the problem at hand) is quantifiedby means of some criterion such as entropy, Gini index, etc.
[13], [5]. Third, decision trees in their generic version are pre-
dominantly applied to discrete class problems (the continuous
prediction problems are handled by regression trees). Interest-
ingly, these three fundamental features somewhat restrict the
nature of the trees and identify a range of applications that
are pertinent in this setting. When dealing with continuous
attributes, it is evident that the discretization is a must. As
such, it directly impacts the performance of the tree. One may
argue that the discretization requires some optimization that by
being guided by the classification error can be realized once the
development of the tree has been finalized. In this sense, thesecond phase (namely a way in which the tree has been grown
and a sequence of the attributes selected) is inherently affected
by the discretization mechanism. In a nutshell, it means that
these two design steps cannot be disjointed. The growth of
the tree relying on a choice of a single attribute can be also
treated as some conceptual drawback. While being quite simple
and transparent, it could well be that considering two or more
attributes as a indivisible group of variables occurring as the
discrimination condition located at some node of the tree may
lead to the better tree.
Having these shortcomings clearly identified, the objective
of this study is to develop a new class of decision trees that at-
tempts to alleviate these problems. The underlying conjecture isthat data can be perceived as a collection of information gran-
ules [11]. Thus, the tree becomes spanned over these granules,
treated now as fundamental building blocks. In turn, informa-
tion granules and information granulation are almost a synonym
of clusters and clustering [2], [8]. Subscribing to the notion
of fuzzy clusters (and fuzzy clustering), the authors intend to
capture the continuous nature of the classes so that there is no
restriction of the use of these constructs to the discrete prob-
lems. Furthermore, fuzzy granulation helps link the discretiza-
tion problem with the formation of the tree in a direct and inti-
mate manner. As it becomes evident that fuzzy clusters are the
central concept behind the generalized tree, they will be referredto as clustered-oriented decision trees or C-decision trees, for
short.
The material of this study is organized in the following
manner. Section II provides an introduction of the architec-
ture of the tree by discussing the underlying processes of its
in-depth and in-breadth growth. Section III brings more details
on the development of the C-trees where we concentrate on the
functional and algorithmic details (fuzzy clustering and various
criteria of node splitting leading to the specific pattern of tree
growing). Next, Section IV elaborates on the use of the trees in
the classification or prediction mode. A series of comparative
numeric studies is presented in Section V.
1094-6977/$20.00 2005 IEEE
-
8/9/2019 16.C-Fuzzy Decision Trees
2/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 499
The study adhered to the standard notation and notions used
in the literature of machine learning and fuzzy sets. The way
of evaluating the performance of the tree is standard as a five-
fold cross-validation is used. More specically, in each pass, an
8020 split of data is generated into the training and testing set,
respectively, and the experiments are repeated for five different
splits for training and testing data (rotation method) that helpsus gain high confidence about the results. As to the format of
the data set, it comes as a family of inputoutput pairs
, , where and
. Note that when we restrict the range of values assumed by
to some finite set (say integers) then we encounter a stan-
dard classification problem while, in general, admitting contin-
uous values assumed by the output, we are concerned with a
(continuous) prediction problem.
II. OVERALL ARCHITECTURE OF THE
CLUSTER-BASED DECISION TREE
The architecture of the clusterbased decision tree develops
around fuzzy clusters that are treated as generic building blocks
of the tree. The training data set is clustered into clusters so
that the data points (patterns) that are similar are put together.
These clusters are completely characterized by their prototypes
(centroids). We start with them positioned at top nodes of the
tree structure. The way of building the clusters implies a spe-
cific way in which we allocate elements of to each of them. In
other words, each cluster comes with a subset of , namely ,
, . The process of growing the tree is guided by a cer-
tain heterogeneity criterion that quantifies a diversity of the data
(with respect to the output variable ) falling under the givencluster (node). Denote the values of the heterogeneity criterion
by , , , , respectively (see also Fig. 1). We then choose
the node with the highest value of the criterion and treat it as a
candidate for further refinement. Let be the one for which
assumes a maximal values, that is . The
th node is refined by splitting it into clusters as visualized in
Fig. 1.
Again the resulting nodes (children) of node come with
their own sets of data. The process is repeated by selecting the
most heterogeneous node out of all final nodes (see Fig. 2). The
growth of the tree is carried out by expanding the nodes and
building their consecutive levels that capture more details of thestructure. It is noticeable that the node expansion leads to the
increase in either the depth or width (breadth) of the tree. The
pattern of the growth is very much implied by the characteristics
of the data as well as influenced by the number of the clusters.
Some typical patterns of growth are illustrated in Fig. 2. Con-
sidering a way in which the tree expands, it is easy to notice that
each node of the tree has exactly zero or children.
By looking at the way of forming the nodes of the tree and
their successive splitting (refinement), we can easily observe
an interesting analogy between this approach and well-known
hierarchical divisive algorithms. Conceptually, they share the
same principle; however, there are a number of technical aspectsof these two.
Fig. 1. Growing a decision tree by expanding nodes (which are viewed asclusters located at its nodes). Shadowed nodes are those with maximal valuesof the diversity criterion and thus being subject to the split operation.
Fig. 2. Typical growth patterns of the cluster-based trees: (a) depth intensiveand (b) breadth intensive.
For the completeness of the construct, each node is character-
ized by the following associated components: the heterogeneity
criterion , the number of patterns associated with it, and a list
of these patterns. Moreover, each pattern on this list comes with
a degree of belongingness (membership) to that node. We pro-
vide a formal definition of the C-decision trees at a later stage
once we cover the pertinent mechanisms of their development.
III. DEVELOPMENT OF THE TREE
In this section, we concentrate on the functional details and
ensuing algorithmic aspects. The crux of the overall design is
the clustering mechanism and the manipulations realized at thelevel of information granules formed in this manner.
-
8/9/2019 16.C-Fuzzy Decision Trees
3/14
5 00 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
A. Fuzzy Clustering
Fuzzy clustering is a core functional part of the overall tree.
It builds the clusters and provides its full description. We con-
fine ourselves to the standard fuzzy C-means (FCM), which is
an omnipresent technique of information granulation. The de-
scription of this algorithm is well documented in the literature.
We refer the reader to [2] and [11] and revisit it here in the set-ting of the decision trees. The FCM algorithm is an example
of an objective-oriented fuzzy clustering where the clusters are
built through a minimization of some objective function. The
standard objective function assumes the format
(1)
with , , being a parti-
tion matrix (here denotes a family of -by- matrices
that satisfies a series of conditions: 1) all elements of the parti-
tion matrix are confined to the unit interval, 2) the sum over
each column is equal to 1, and 3) the sum of the membership
grades in each row is contained to the range of .
The number of clusters is denoted by . The data set to be
clustered consistsof patterns. isa fuzzification factor (usu-
ally ), and is a distance function between the th data
point (pattern) and the th prototype. The prototype of the cluster
can be treated as a typical vector or a representative of the data
forming the cluster. The feature space in which the clustering
is carried out requires a thorough description. Referring to the
format of the data we have started with, let us note that they
come as ordered pairs . For the purpose of clus-
tering, we concatenate the pairs and use the notation
(2)
This implies that the clustering takes place in the ( ) di-
mensional space and involves the data distributed in the input
and output space. Likewise, the resulting prototype ( ) is po-
sitioned in . For future use, we distinguish between the
coordinates of the prototype in the input and output space by
splitting them into two sections (blocks of the variables) as fol-
lows:
and
It isworthemphasizingthat located describes a prototype
located in the input space; this description will be of particular
interest when utilizing the tree to carry out prediction tasks.
In essence, the FCM is an iterative optimization process in
which we iteratively update the partition matrix and prototypes
until some termination criterion has been satisfied. The updates
of the values of and s are governed by the following well-known expressions (cf. [2]):
Fig. 3. Node splitting controlled by the variability criterion V .
partition update
(3)
prototype update
(4)
The series of iterations is started from a randomly initiated par-
tition matrix and involves the calculations of the prototypes and
partition matrices.
B. Node Splitting Criterion
The growth process of the tree is pursued by quantifying the
diversity of data located at the individual nodes of the tree and
splitting the nodes that exhibit the highest diversity. This intu-
itively appealing criterion takes into account the variability of
the data, finds the node with the highest value of the criterion,
and splits it into nodes that occur at the consecutive lower level
of the tree (see Fig. 3). The essence of the diversity (variability)
criterion is to quantify a dispersion of the data allocated to the
given cluster so that higher dispersion of data results in higher
values of the criterion. Recall that individual data points (pat-
terns) belong to the clusters with different membership grades;
however, for each pattern, there is a dominant cluster to which
they exhibit the highest degree of belongingness (membership).
More formally, let us represent the th node of the tree as
an ordered triple
(5)
Here, denotes all elements of the data set that belong to this
node in virtue of the highest membership grade
for all
the index pertains to the nodes originating from
the same parent
The second set collects the output coordinates of the ele-
ments that have already been assigned to , as follows:
(6)
-
8/9/2019 16.C-Fuzzy Decision Trees
4/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 501
Likewise, is a vector of the grades of membership of the
elements in , as follows:
(7)
We define the representative of this node positioned in the output
space as the weighted sum (note that in the construct hereafter
we include only those elements that contribute to the cluster sothe summation is taken over and ), as follows:
(8)
The variability of the data in the output space existing at this
node is taken as a spread around the representative ( )
where again we consider a partial involvement of the elements
in by weighting the distance by the associated membership
grade
(9)
In the next step, we select the node of the tree (leaf) that has the
highest value of , say and expand the node by forming
its children by applying the clustering of the associated data
set into clusters. The process is then repeated: we examine the
leaves of the tree and expand the one with the highest value of
the diversity criterion.
We treat a C-decision tree as a regular tree structure of the
form with nodes, where nodes are
described by (5) and each nonterminal node has children.
The growth of the tree is controlled by conditions under which
the clusters can be further expanded (split). We envision two in-
tuitively appealing conditions that tackle the nature of the data
behind each node. The first one is self-evident: a given node can
be expanded if it contains enough data points. With clusters,
we require this number to be greater than the number of the clus-
ters; otherwise, the clusters cannot be formed. While this is the
lower bound on the cardinality of the data, practically we would
expect this number to be a multiplicity of , say , ,
etc. The second stopping condition pertains to the structure of
data that we attempt to discover through clustering. It becomes
obvious that once we approach smaller subsets of data, the dom-
inant structure (which is strongly visible at the level of the en-
tire and far more numerous data set) may not manifest that pro-foundly in the subset. It is likely that the smaller the data, the less
pronounced its structure. This becomes reflected in the entries
of the partition matrix that tend to be equal to each other and
equal to . If no structure becomes present, this equal distri-
bution of membership grades occurs across each column of the
partition matrix. This lack of visible structure can be quantified
by the expression (for the th pattern)
(10)
If all entries of the partition matrix are equal to , then the
result is equal to zero. If we encounter a full membership to acertain cluster, then the resulting value is equal to 1 (that is a
Fig. 4. Structurability index (9) viewed as a function of c and plotted forseveral selected values of .
maximal value of the above expression). To describe the struc-
tural dependencies within the entire data set in a certain node,
we carry out calculations over all patterns located at the node ofthe tree
(11)
Again, with no structuralability present in the data, this expres-
sion returns a zero value.
To gain a better feel as to the lack of structure and the en-
suing values of (11), let us consider a case where all entries in
a certain column of the partition matrix (pattern) are equal to
with some slight deviation being equal to . In 50% cases,
we consider that these entries are higher than , and we put
; in the remaining 50%, we consider the decreaseover and have . Furthermore, let us treat as a
fraction of the original membership grade, that is, make it equal
to , where in the interval (0, 1/2). Then, (11) reads as
(12)
The plot of this relationship treated as a function of is shown
in Fig. 4. It shows how the departure from the situation where no
structure has beendetected ( ) tothe case wherequantifies in the values of the structurability expression. The
plot shows several curves over the number of clusters ( ); higher
values of lead to the substantial drop in the values of the index
(9).
The two measures introduced previously can be used as a
stopping criterion in the expansion (growth) of the tree. We can
leave the node intact once the number of patterns falls under the
assumed threshold and/or the structurability index is too low.
The first index is a sort of precondition: if not satisfied, it pre-
vents us from expanding the node. The second index comes in
a form of some postcondition: to compute its value, we have
to cluster the data first and then determine its value. It is also
stronger as one may encounter cases where there a significantnumber of the data points to warrant clustering in light of the
-
8/9/2019 16.C-Fuzzy Decision Trees
5/14
5 02 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE IEXPERIMENTAL SETTING OF THE FCM ALGORITHM
Fig. 5. Traversing a C-fuzzy tree: an implicit mode.
first criterion; however, the second one concludes that the un-
derlying structure is too weak, and this may advise us to back-
track and refuse to expand this particular node.
These two indexes support decision making that focuses on
the structural aspects of the data (namely, we decide whether to
split the data). It does not guarantee that the resulting C-tree will
be the best from the point of view of classification or prediction
of continuous output variable. The diversity criterion (sum of
at the leaves) can be also viewed as another termination cri-
terion. While conceptually appealing, we may have difficultiesin translating its values into a more tangible and appealing
descriptor of the tree (obviously the lower, the better). Another
possible termination option (which may equally well apply to
each of these three indexes) is to monitor their changes along
the nodes of the tree once being built; an evident saturation of
the values of each of them could be treated as a signal to stop
growing the tree.
IV. USE OF THE TREE IN THE CLASSIFICATION
(PREDICTION) MODE
Once the C-tree has been constructed, it can be used to clas-sify a new input ( ) or predict a value of the associated output
Fig. 6. Classification boundaries (thick lines) for some configurations of
the prototypes formed by (13) and hyperboxes: (a) v = [ 1 : 5 1 : 2 ] v =[ 2 : 5 2 : 1 ] v = [ 0 : 6 3 : 5 ] and(b) v = [ 1 : 5 1 : 5 ] v = [ 1 : 5 2 : 6 ] v = [ 0 : 6 3 : 5 ] .Also shown are contour plots of the membership functions of the three clusters.
variable (denoted here by ). In the calculations, we rely on the
membership grades computed for each cluster as follows:
(13)
where is a distance computed between and (as a
matter of fact, we have the same expression as used in the FCM
method, refer to (3)). The calculations pertain to the leaves of
the C-tree, so for several levels of depth we have to traverse thetree first to reach the specific leaves. This is done by computing
-
8/9/2019 16.C-Fuzzy Decision Trees
6/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 503
Fig. 7. Two-dimensional training data (239 patterns).
Fig. 8. Two-dimensional testing data (61 patterns).
for each level of the tree, selecting the corresponding path
and moving down (Fig. 5). At some level, we determine the
path , where . Once at the th node, we
repeat the processthat is, determine ,
(here, we are dealing with the clusters at the successive level
of the tree). The process repeats for each level of the tree. The
predicted value occurring at the final leaf node is equal to
(refer to (8)).
It is of interest to show the boundaries of the classification
regions produced by the clusters (i.e., the implicit method) and
contrast them with the geometry of classification regions gener-
ated by the decision trees. In the first case we use a straightfor-
ward classification rule
assign to class if exceeds the values of the
membership in all remaining clusters,
that is
For the decision trees, the boundaries are guillotine cuts. As a
result, we get hyperboxes whose faces are parallel to the coor-
dinates. When dealing with the FCM, we can exercise the fol-
lowing method. For the given prototypes, we can project them
on the individual coordinates (variables), take averages of the
successive projected prototypes, and build hyperboxes around
the prototypes in the entire space. This approach is conceptu-
ally close to the decision trees as leading to the same geometriccharacter of the classifier. The obvious rule holds: assign to
Fig. 9. Top level of the C-decision tree; note two clusters of different values
of the variability index; the cluster with its higher value is shaded.
Fig. 10. Decision tree after the second expansion (iteration).
class if it falls into the hyperbox formed around prototype
.
Some examples of the classification boundaries are shown in
Fig. 6.As Fig. 6 reveals, the hyperbox model of the classi fication
boundaries is far more conservative than the one based on
the maximal membership rule. This is intuitively appealing as
in the process of forming the hyperboxes we allowed only for
cuts that are parallel to the coordinates. It becomes apparent
that the geometry of the decision tree induced in this way
varies substantially from the far more diversified geometry of
the FCM-based class boundaries.
V. EXPERIMENTAL STUDIES
The experiments conducted in the study involve both
prediction problems (in which the output variable is contin-uous) and those of classification nature (where we encounter
-
8/9/2019 16.C-Fuzzy Decision Trees
7/14
5 04 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
Fig. 11. Complete C-tree; note that the values of the variability criterion have reached zero at all leaves, and this terminates further growth of the tree.
several discrete classes). Experiments 1 and 2 concern two-di-mensional (2-D) synthetic data sets. Data used in experi-
ments 3 and 4 come from the Machine Learning repository
(http://www.ics.uci.edu/~mlearn/MLRepository.html), which
makes the experiments fully reproducible and facilitates further
comparative analysis. The data sets we experimented with are
as follows: (experiment 3) auto-mpg [9] and (a) pima diabetes
[9], (b) ionosphere [9], (c) hepatitis [9], (d) dermatology [9],
and (e) auto data [14] in experiment 4. The first one deals
with a continuous problem, while the other ones concern dis-
crete-class data. In all experiments, we use the FCM algorithm
with the settings summarized in Table I. As far as learning and
prediction abilities of the tree are concerned, we proceed with a
fivefold cross-validation that generates an 8020 split by taking
80% of the data as a training set and testing the tree on the
remaining 20% of the data set. Furthermore, the experiments
are repeated five times by taking splits of data into the training
and testing part, respectively.
A. Experiment 1
Experiment 1 is a 2-D synthetic data set generated by uniform
distribution random generators. The training set comprises 239
data points (patterns), while the testing set consists of 61 pat-
terns. These two data sets are visualized in Figs. 7 and 8, re-
spectively.
The results of the FCM (with ) are visualized in Fig. 9.Here, we report the prototypes of each cluster ( ), the values
of splitting criterion ( ), the number of the data points from thetraining set allocated to the cluster ( ), and the predicted value
at the node (class) that is rounded to the nearest integer value of
, which is evident as we are dealing with two discrete classes
(labels) of the patterns.
In the next step, we select the first node of the tree, which
is characterized by the highest value of the variability index,
and expand it by forming two children nodes by applying the
FCM algorithm to data associated with this original node. The
decision trees grown in this manner is visualized in Fig. 10.
At the next step, we select the second node of the tree (the
one with the highest variability) and expand it in the same way
as before (see Fig. 11).As expected, the classification error is equal to zero both for
the training and testing set. This is not surprising considering
that the classes are positioned quite apart from each other.
B. Experiment 2
Two-dimensional synthetic patterns used in this experiment
are normally distributed with some overlap between the classes
(see Figs. 12 and 13). The resulting C-decision tree is visualized
in Fig. 14. For this tree, an average classification error on the
training data is equal to 0.001 250 (with a standard deviation
equal to 0.002 013). Forthe testing data, these numbers areequalto 0.188 333 and 0.043 780, respectively.
-
8/9/2019 16.C-Fuzzy Decision Trees
8/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 505
Fig. 12. Two-dimensional training data (240 patterns).
Fig. 13. Two-dimensional testing data (60 patterns).
C. Experiment 3
This auto-mpg data set [9] involves a collection of vehicles
described by a number of features (such as weight of a vehicle,
number of cylinders, and fuel consumption).We complete a se-
ries of experiments in which we sweep through the number of
the clusters (c) varying it from 1 to 20 and carrying out 20 expan-
sions (iterations) of the tree (the tree is expanded step by step,
leading either to its in-depth or breadth expansion). The vari-
ability observed at all leaves of the tree
characterizes a process of the growth of the tree (refer to
Figs. 1 and 2).
The variability measure is reported for the training and testing
set as visualized in Figs. 15 and 16. The variability goes down
with the growth of the tree, and this becomes evident for the
training and testing data. It is also clear that most of the changes
in the reduction of the variability occur at the early stages of the
growth of the trees; afterwards, the changes are quite limited.
Likewise, we note that the drop in the variability values becomes
visible when moving from two to three or four clusters. Notice-ably for the increased number of the clusters, the variability is
Fig. 14. Complete C-tree; as the values of the variability criterion has reachedzero at all leaves, this terminates further growth of the tree.
-
8/9/2019 16.C-Fuzzy Decision Trees
9/14
5 06 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
Fig. 15. Variability of the tree reported during the growth of the tree (training data) for a selected number of clusters.
Fig. 16. Variability of the tree reported during the growth of the tree (testing data) for a selected number of clusters.
practically left unaffected (we see a series of barely distinguish-
able plots for greater than five). This effect is present for the
training and testing data.
After taking a careful look at the variability of the tree, we
conclude that the optimal configuration occurs at 5 clusters
with the number of expansion equal to seven. In this case, the
resulting tree is portrayed in Fig. 17. The arrows shown there
along with the labels (numbers) visualize a growth of the tree,
namely a way in which the tree is grown (expanded in consecu-
tive iterations). The numbers in circles denote the node number.
The last digit of the node number denotes the number of clus-
ters while the beginning digits denote the parent node number.
Detailed description of the nodes is given in Table II. Again, wereport the details of the tree, including the number of patterns
residing at each node ( ) as well as their variability ( ) and
the predicted value at the node (8).
While the variability criterion is an underlying measure in the
design process, the predictive capabilities of the C-decision tree
are quantified by the following performance index:
(14)
In the above expression, denotes the predicted value occur-
ring at the corresponding terminal node (refer to (8)). More
specifically, the representative of this node in the output space iscalculated as a weighted sum of those elements from the training
-
8/9/2019 16.C-Fuzzy Decision Trees
10/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 507
Fig. 17. Detailed C-decision tree for the optimal number of clusters and iterations; see a detailed description in text.
set that contribute to this node, while is the output value en-
countered in the data set. For a discrete (classification) problem,
this index is simply a classification error that is determined by
counting the number of patterns that have been misclassified by
the C-decision tree.
The values of the error obtained on the training set for a
number of different configurations of the tree (number of clus-
ters and iterations) are shown in Fig. 18. Again, most of the
changes occur for low values of the clusters and are character-
istic of the early phases of the growth of the tree.
We see that low values of do not reducethe error even with asubstantial growth (number of iterations) of the tree. Similarly,
we observe that the same effect occurs for the testing set (see
Fig. 19) (obviously, these results are reported for the sake of
completeness; in practice, we choose the best tree on a basis
of the training set and test it by means of the testing data).
D. Experiment 4
Again, we use a data set from the Machine Learning repos-
itory [9], a two-class pima-diabetes consisting of 768 patterns
distributed in an 8-dimensional feature space. In the design of
the C-tree, we use the development procedure outlined in theprevious section: a node with the maximal value of diversity
is chosen as a potential candidate to expand (split into clus-
ters). Prior to this node unfolding, we check if there are a suf-
ficient number of patterns located there (here, we consider that
the criterion is satisfied when this number is greater than the
number of the clusters). Once this holds, we proceed with the
clustering and then look at the structurability index (9) whose
values should be greater or equal to 0.05 (this threshold value
has been selected arbitrarily) to accept the unfolding of the node.
The number of iterations is set to 10. The plots of the variability
shown in Fig. 20 point out that there is not a substantial differ-
ence in the number of iterations on the value of ; the changesoccur mainly during the first few iterations (expansions of the
tree). Similarly, there are changes when we increase the number
of the clusters from two up to six, but beyond this the changes
are not very significant.
When using the C-decision tree in the predictive mode, its
performance is evaluated by means of (14). The collection of
pertinent plots is shown in Figs. 21 and 22 for the training data
(the performance results are averaged over the series of exper-
iments). Similarly, the results of the tree on the testing set are
visualized in Fig. 23. Evidently, with the increase in the number
of clusters, we note a drop in the error values; yet, valuesof that
are too high lead to some fluctuations of the error (so it is not evi-
dent that growing larger trees is still fully legitimate). Such fluc-tuations are even more profound when studying the plots of error
-
8/9/2019 16.C-Fuzzy Decision Trees
11/14
5 08 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE IIDESCRIPTION OF THE NODES OF THE C-DECISIONS TREE INCLUDED IN FIG. 17
Fig. 18. Average error of the C-decision tree reported for the training set.
reported for the testing set. In a search for the optimal con-
figuration, we have found that the number of clusters between
three and five and a few iterations led to the best results (see
Fig. 24). We observe an evident tendency: while growing larger
trees is definitely beneficial in case of the training data (gener-
ally, the error is reduced with a few exceptions), the error does
not change quite visibly on the testing data (where the changes
are in the range of 1%).
It is of interest to compare the results produced by theC-decision tree with those obtained when applying stan-
Fig. 19. Average error of the C-decision tree reported for the testing set.
dard decision trees, namely C4.5. In this analysis, we
have experimented with the software available on the Web
(http://www.cse.unsw.edu.au/~quinlan/), which is C4.5 revi-
sion 8 run with the standard settings (i.e., selection of the
attribute that maximizes the information gain and no pruning
was used). The results are summarized in Table III. Following
the experimental scenario outlined at the beginning of the
section, we report the mean values and the standard deviation
of the error. For the C-decision trees, the number of nodes is
equal to the number of clusters multiplied by the number ofiterations.
-
8/9/2019 16.C-Fuzzy Decision Trees
12/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 509
Fig. 20. Variability (V ) for the pima-diabetes data (training set).
Fig. 21. Error (e
) as a function of iterations (expansion of the tree) for thetraining set for selected number of clusters.
Fig.22. Error (e ) as a function of the number of clusters forselected iterations.
Overall, we note that the C-tree is more compact (in terms of
the number of nodes). This is not surprising as its nodes are more
complex than those in the original decision tree. If our intent is
to have smaller and more compact structures, C-trees become
quite an appealing architecture. The results on the training sets
are better for the C-trees at the level of 3%6% improvement
(for the pima data set). The standard deviation of the error is
two times lower for these trees in comparison with the C4.5.
For the testing set, we note that the larger out of the two C-trees
in Table III produces almost the same results as the C4.5. With
the smaller C-tree, we note an increase in the classification rateby 1% in comparison with the larger structure however the size
Fig. 23. Error (e
) pima-diabetes (testing set).
Fig. 24. Classificationerror for the C-decision tree versus successive iterations(expansions) of the tree; c = 3 and 5; both training and testing sets are included.
of the tree has been reduced to one half of the larger tree (on the
pima data). The increase in the size of the tree does not dramati-
cally affect the classification results; the classification rates tend
to be better but do not differ significantly from structure to struc-
ture. In general, we note that the C-decision tree produces more
consistent results in terms of the classification for the training
and testing sets; these are closer when compared with the results
produced by the C4.5 tree. In some cases, the results of C-de-
cision tree are better than C4.5; this happens for the hepatatis
data.
VI. CONCLUSION
The C-decision trees are classification constructs that are built
on a basis of information granulesfuzzy clusters. The way
in which these trees are constructed deals with successive re-
finements of the clusters (granules) forming the nodes of the
tree. When growing the tree, the nodes (clusters) are split into
granules of lower diversity (higher homogeneity). In contrast to
C4.5-like trees, all features are used once at a time, and such a
development approach promotes more compact trees and a ver-
satile geometry of the partition of the feature space. The exper-
imental studies illustrate the main features of the C-trees. Oneof them is quite profound and highly desirable for any practical
-
8/9/2019 16.C-Fuzzy Decision Trees
13/14
5 10 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE IIIC-DECISIONS TREE AND C4.5: A COMPARATIVE ANALYSIS FOR SEVERAL MACHINE LEARNING DATA SETS:
(a) PIMA-DIABETES, (b) IONOSPHERE, (c) HEPATITIS ( IN THIS DATA SET ALL MISSING VALUESWERE REPLACED BY THE AVERAGES OF THE CORRESPONDING ATTRIBUTES),
(d) DERMATOLOGY, AND (e) AUTO DATA
usage: the difference of performance of C-trees on the trainingand testing sets is lower than the ones reported for the C4.5.
The C-tree is also sought as a certain blueprint of some de-tailed models that can be formed on a local basis by considering
-
8/9/2019 16.C-Fuzzy Decision Trees
14/14
PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 511
data allocated to the individual nodes. At this stage, the models
are refined by choosing their topology (e.g., linear models and
neural networks) and making a decision about detailed learning.
REFERENCES
[1] W. P. Alexander and S. Grimshaw, Treed regression, J. Computational
Graphical Statistics, vol. 5, pp. 156175, 1996.[2] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, NewYork: Plenum, 1981.
[3] L.Breiman, J.H. Friedman, R. A.Olshen,and C. J.Stone, Classificationand Regression Trees. Belmont, CA: Wadsworth, 1984.
[4] E. Cantu-Paz and C. Kamath, Using evolutionary algorithms to in-duce oblique decision trees, in Proc. Genetic Evolutionary Computa-tion Conf. 2000, D. Whitley, D. E. Goldberg, E. Cantu-Paz, L. Spector,L. Partnee, and H.-G. Beyer, Eds., San Francisco, CA, pp. 1053 1060.
[5] A. Dobra and J. Gehrke, SECRET: A scalable linear regression treealgorithm, in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery
Data Mining, Edmonton, AB, Canada, Jul. 2002.[6] A. Ittner and M. Schlosser, Non-linear decision treesNDT, in Proc.
13th Int. Conf. Machine Learning (ICML96), Bari, Italy, Jul. 36, 1996.[7] A. Ittner, J. Zeidler, R. Rossius, W. Dilger, and M. Schlosser, Feature
space partitioning by nonlinear fuzzy decision trees, in Proc. Int. FuzzySystems Assoc., pp. 394398.
[8] A. K. Jain et al., Data clustering: A review, ACM Comput. Surv., vol.31, no. 3, pp. 264323, Sep. 1999.
[9] C. J. Merz and P. M. Murphy. (1996) UCI Repository for Ma-chine Learning Data-Bases. Dept. of Information and ComputerScience, University of California, , Irvine, CA. [Online]. Available:
http://www.ics.uci.edu/~mlearn/MLRepository.html[10] S. K. Murthy, S. Kasif, and S. Salzberg, A system for induction of
oblique decision trees, J. Artificial Intelligence Res., vol. 2, pp. 132,1994.
[11] W. Pedrycz and Z. A. Sosnowski, Designing decision trees with theuse of fuzzy granulation, IEEE Trans. Syst., Man, Cybern. A, Syst., Hu-mans, vol. 30, no. 2, pp. 151159, Mar. 2000.
[12] J. R. Quinlann, Induction of decision trees, Mach. Learn. 1, pp.81106, 1986.
[13] , C4.5: Programs for Machine Learning. San Francisco, CA:Morgan Kaufmann, 1993.
[14] J. S. Siebert, Vehicle Recognition using Rule-Based Methods, ResearchMemo, Turing Institute, TIRM-87-017, 1987.[15] R. Weber, Fuzzy ID3: A class of methods for automatic knowledge ac-
quisition, in Proc. 2nd Int. Conf Fuzzy Logic Neural Networks , Iizuka,Japan, Jul. 1722, 1992, pp. 265268.
[16] O. T. Yildiz and E. Alpaydin, Omnivariate decision trees, IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 15391546, Nov. 2001.
Witold Pedrycz (M88SM94F99) is a Professorand CanadaResearch Chair (CRC) in theDepartmentof Electrical and Computer Engineering, University
of Alberta, Edmonton, AB, Canada. He is activelypursuing research in computational intelligence,fuzzy modeling, knowledge discovery and datamining, fuzzy control, including fuzzy controllers,pattern recognition, knowledge-based neural net-
works, relational computation, bioinformatics, andsoftware engineering. He has published numerouspapers in this area. He is also an author of eight
research monographs covering various aspects of computational intelligenceand software engineering.
Dr. Pedrycz has been a member of numerous program committees ofIEEE conferences in the area of fuzzy sets and neurocomputing. He currentlyserves as an Associate Editor of IEEE TRANSACTIONS ON SYSTEMS, MANAND CYBERNETICS, Parts A and B, and the IEEE T RANSACTIONS ON FUZZY
SYSTEMS. He is an Editor-in-Chief of Information Sciences, and he is Presi-dent-Elect of International Fuzzy Systems Association (IFSA) and President of
North American Fuzzy Information Processing Society (NAFIPS).
Zenon A. Sosnowski (M99) received the M.Sc. de-gree in mathematics from the University of Warsaw,Warsaw, Poland, in 1976 and the Ph.D. degree incomputer science from the Warsaw University of
Technology, Warsaw, Poland, in 1986.He has been with the Technical University of Bi-
alystok, Bialystok, Poland, since 1976, where he isan Assistant Professor at the Department of Com-puter Science. In 19881989, he had been atthe DelftUniversity of Technology in the Netherlands for fivemonths. He spent two years (19901991) with the
Knowledge Systems Laboratory of the National Research Councils Institutefor
Information Technology, Ottawa, ON, Canada. His research interests includeartificial intelligence, expert systems, approximate reasoning, fuzzy sets, andknowledge engineering.
Dr. Sosnowski is a Member of the IEEE Systems, Man, and Cybernetics So-ciety.