16.C-Fuzzy Decision Trees

8/9/2019 16.C-Fuzzy Decision Trees

1/14

498 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

CFuzzy Decision TreesWitold Pedrycz , Fellow, IEEE, and Zenon A. Sosnowski, Member, IEEE

AbstractThis paper introduces a concept and design of deci-sion trees based on information granulesmultivariable entitiescharacterized by high homogeneity (low variability). As such gran-ules are developed via fuzzy clustering and play a pivotal role in thegrowth of the decision trees, they will be referred to as C-fuzzy de-cisiontrees. In contrast with standard decision treesin which onevariable (feature) is considered at a time, this form of decision treesinvolves all variables that are considered at each node of the tree.Obviously, this gives rise to a completely new geometry of the par-tition of the feature space that is quite different from the guillotinecuts implemented by standard decision trees. The growth of theC-decision tree is realized by expanding a node of tree character-ized by the highest variability of the information granule residingthere. This paper shows how the tree is grown depending on someadditional node expansion criteria such as cardinality (number of

data) at a given node and a level of structural dependencies (struc-turability) of data existing there. A series of experiments is re-ported using both synthetic and machine learning data sets. Theresults are compared with those produced by the standard ver-sion of the decision tree (namely, C4.5).

Index TermsDecision trees, depth-and-breadth tree expan-sion, experimental studies, fuzzy clustering, node variability, treegrowing.

I. INTRODUCTION

DECISION trees [12], [13] are the commonly used archi-

tectures of machine learning and classification systems.

They come with a comprehensive list of various training andpruning schemes, a diversity of discretization (quantization)

algorithms, and a series of detailed learning refinements [1],

[3][7], [10], [11], [15], [16]. In spite of such variety of the

underlying development activities, one can easily witness

several fundamental properties that cut quite evidently across

the entire spectrum of the decision trees. First, the trees operate

on discrete attributes that assume a finite (usually quite small)

number of values. Second, in the design procedure, one attribute

is chosen at a time. More specifically, one selects the most dis-

criminative attribute and expands (grows) the tree by adding

the node whose attributes values are located at the branches

originating from this node. The discriminatory power of theattribute (which stands behind its selection out of the collection

Manuscript received February 10, 2004; revised June 9, 2004 and September9, 2004. This work was supported in part by the Canada Research Chair (CRC)Program of the Natural Science and Engineering Research Council of Canada(NSERC). The work of Z. A. Sosnowski was supported in part by the TechnicalUniversity of Bialystok under Grant W/WI/8/02.

W. Pedrycz is with the Department of Electrical and Computer Engineering,University of Alberta, Edmonton, AB T6G 2V4, Canada, and also with the Sys-tems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland(e-mail: [email protected]).

Z. A. Sosnowski is with the Department of Computer Science, TechnicalUniversity of Bialystok, Bialystok 15-351, Poland (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSMCC.2004.843205

of the attributes existing in the problem at hand) is quantifiedby means of some criterion such as entropy, Gini index, etc.

[13], [5]. Third, decision trees in their generic version are pre-

dominantly applied to discrete class problems (the continuous

prediction problems are handled by regression trees). Interest-

ingly, these three fundamental features somewhat restrict the

nature of the trees and identify a range of applications that

are pertinent in this setting. When dealing with continuous

attributes, it is evident that the discretization is a must. As

such, it directly impacts the performance of the tree. One may

argue that the discretization requires some optimization that by

being guided by the classification error can be realized once the

development of the tree has been finalized. In this sense, thesecond phase (namely a way in which the tree has been grown

and a sequence of the attributes selected) is inherently affected

by the discretization mechanism. In a nutshell, it means that

these two design steps cannot be disjointed. The growth of

the tree relying on a choice of a single attribute can be also

treated as some conceptual drawback. While being quite simple

and transparent, it could well be that considering two or more

attributes as a indivisible group of variables occurring as the

discrimination condition located at some node of the tree may

lead to the better tree.

Having these shortcomings clearly identified, the objective

of this study is to develop a new class of decision trees that at-

tempts to alleviate these problems. The underlying conjecture isthat data can be perceived as a collection of information gran-

ules [11]. Thus, the tree becomes spanned over these granules,

treated now as fundamental building blocks. In turn, informa-

tion granules and information granulation are almost a synonym

of clusters and clustering [2], [8]. Subscribing to the notion

of fuzzy clusters (and fuzzy clustering), the authors intend to

capture the continuous nature of the classes so that there is no

restriction of the use of these constructs to the discrete prob-

lems. Furthermore, fuzzy granulation helps link the discretiza-

tion problem with the formation of the tree in a direct and inti-

mate manner. As it becomes evident that fuzzy clusters are the

central concept behind the generalized tree, they will be referredto as clustered-oriented decision trees or C-decision trees, for

short.

The material of this study is organized in the following

manner. Section II provides an introduction of the architec-

ture of the tree by discussing the underlying processes of its

in-depth and in-breadth growth. Section III brings more details

on the development of the C-trees where we concentrate on the

functional and algorithmic details (fuzzy clustering and various

criteria of node splitting leading to the specific pattern of tree

growing). Next, Section IV elaborates on the use of the trees in

the classification or prediction mode. A series of comparative

numeric studies is presented in Section V.

1094-6977/$20.00 2005 IEEE


2/14

PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 499

The study adhered to the standard notation and notions used

in the literature of machine learning and fuzzy sets. The way

of evaluating the performance of the tree is standard as a five-

fold cross-validation is used. More specically, in each pass, an

8020 split of data is generated into the training and testing set,

respectively, and the experiments are repeated for five different

splits for training and testing data (rotation method) that helpsus gain high confidence about the results. As to the format of

the data set, it comes as a family of inputoutput pairs

, , where and

. Note that when we restrict the range of values assumed by

to some finite set (say integers) then we encounter a stan-

dard classification problem while, in general, admitting contin-

uous values assumed by the output, we are concerned with a

(continuous) prediction problem.

II. OVERALL ARCHITECTURE OF THE

CLUSTER-BASED DECISION TREE

The architecture of the clusterbased decision tree develops

around fuzzy clusters that are treated as generic building blocks

of the tree. The training data set is clustered into clusters so

that the data points (patterns) that are similar are put together.

These clusters are completely characterized by their prototypes

(centroids). We start with them positioned at top nodes of the

tree structure. The way of building the clusters implies a spe-

cific way in which we allocate elements of to each of them. In

other words, each cluster comes with a subset of , namely ,

, . The process of growing the tree is guided by a cer-

tain heterogeneity criterion that quantifies a diversity of the data

(with respect to the output variable ) falling under the givencluster (node). Denote the values of the heterogeneity criterion

by , , , , respectively (see also Fig. 1). We then choose

the node with the highest value of the criterion and treat it as a

candidate for further refinement. Let be the one for which

assumes a maximal values, that is . The

th node is refined by splitting it into clusters as visualized in

Fig. 1.

Again the resulting nodes (children) of node come with

their own sets of data. The process is repeated by selecting the

most heterogeneous node out of all final nodes (see Fig. 2). The

growth of the tree is carried out by expanding the nodes and

building their consecutive levels that capture more details of thestructure. It is noticeable that the node expansion leads to the

increase in either the depth or width (breadth) of the tree. The

pattern of the growth is very much implied by the characteristics

of the data as well as influenced by the number of the clusters.

Some typical patterns of growth are illustrated in Fig. 2. Con-

sidering a way in which the tree expands, it is easy to notice that

each node of the tree has exactly zero or children.

By looking at the way of forming the nodes of the tree and

their successive splitting (refinement), we can easily observe

an interesting analogy between this approach and well-known

hierarchical divisive algorithms. Conceptually, they share the

same principle; however, there are a number of technical aspectsof these two.

Fig. 1. Growing a decision tree by expanding nodes (which are viewed asclusters located at its nodes). Shadowed nodes are those with maximal valuesof the diversity criterion and thus being subject to the split operation.

Fig. 2. Typical growth patterns of the cluster-based trees: (a) depth intensiveand (b) breadth intensive.

For the completeness of the construct, each node is character-

ized by the following associated components: the heterogeneity

criterion , the number of patterns associated with it, and a list

of these patterns. Moreover, each pattern on this list comes with

a degree of belongingness (membership) to that node. We pro-

vide a formal definition of the C-decision trees at a later stage

once we cover the pertinent mechanisms of their development.

III. DEVELOPMENT OF THE TREE

In this section, we concentrate on the functional details and

ensuing algorithmic aspects. The crux of the overall design is

the clustering mechanism and the manipulations realized at thelevel of information granules formed in this manner.


3/14

5 00 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

A. Fuzzy Clustering

Fuzzy clustering is a core functional part of the overall tree.

It builds the clusters and provides its full description. We con-

fine ourselves to the standard fuzzy C-means (FCM), which is

an omnipresent technique of information granulation. The de-

scription of this algorithm is well documented in the literature.

We refer the reader to [2] and [11] and revisit it here in the set-ting of the decision trees. The FCM algorithm is an example

of an objective-oriented fuzzy clustering where the clusters are

built through a minimization of some objective function. The

standard objective function assumes the format

(1)

with , , being a parti-

tion matrix (here denotes a family of -by- matrices

that satisfies a series of conditions: 1) all elements of the parti-

tion matrix are confined to the unit interval, 2) the sum over

each column is equal to 1, and 3) the sum of the membership

grades in each row is contained to the range of .

The number of clusters is denoted by . The data set to be

clustered consistsof patterns. isa fuzzification factor (usu-

ally ), and is a distance function between the th data

point (pattern) and the th prototype. The prototype of the cluster

can be treated as a typical vector or a representative of the data

forming the cluster. The feature space in which the clustering

is carried out requires a thorough description. Referring to the

format of the data we have started with, let us note that they

come as ordered pairs . For the purpose of clus-

tering, we concatenate the pairs and use the notation

(2)

This implies that the clustering takes place in the ( ) di-

mensional space and involves the data distributed in the input

and output space. Likewise, the resulting prototype ( ) is po-

sitioned in . For future use, we distinguish between the

coordinates of the prototype in the input and output space by

splitting them into two sections (blocks of the variables) as fol-

lows:

and

It isworthemphasizingthat located describes a prototype

located in the input space; this description will be of particular

interest when utilizing the tree to carry out prediction tasks.

In essence, the FCM is an iterative optimization process in

which we iteratively update the partition matrix and prototypes

until some termination criterion has been satisfied. The updates

of the values of and s are governed by the following well-known expressions (cf. [2]):

Fig. 3. Node splitting controlled by the variability criterion V .

partition update

(3)

prototype update

(4)

The series of iterations is started from a randomly initiated par-

tition matrix and involves the calculations of the prototypes and

partition matrices.

B. Node Splitting Criterion

The growth process of the tree is pursued by quantifying the

diversity of data located at the individual nodes of the tree and

splitting the nodes that exhibit the highest diversity. This intu-

itively appealing criterion takes into account the variability of

the data, finds the node with the highest value of the criterion,

and splits it into nodes that occur at the consecutive lower level

of the tree (see Fig. 3). The essence of the diversity (variability)

criterion is to quantify a dispersion of the data allocated to the

given cluster so that higher dispersion of data results in higher

values of the criterion. Recall that individual data points (pat-

terns) belong to the clusters with different membership grades;

however, for each pattern, there is a dominant cluster to which

they exhibit the highest degree of belongingness (membership).

More formally, let us represent the th node of the tree as

an ordered triple

(5)

Here, denotes all elements of the data set that belong to this

node in virtue of the highest membership grade

for all

the index pertains to the nodes originating from

the same parent

The second set collects the output coordinates of the ele-

ments that have already been assigned to , as follows:

(6)


4/14


Likewise, is a vector of the grades of membership of the

elements in , as follows:

(7)

We define the representative of this node positioned in the output

space as the weighted sum (note that in the construct hereafter

we include only those elements that contribute to the cluster sothe summation is taken over and ), as follows:

(8)

The variability of the data in the output space existing at this

node is taken as a spread around the representative ( )

where again we consider a partial involvement of the elements

in by weighting the distance by the associated membership

grade

(9)

In the next step, we select the node of the tree (leaf) that has the

highest value of , say and expand the node by forming

its children by applying the clustering of the associated data

set into clusters. The process is then repeated: we examine the

leaves of the tree and expand the one with the highest value of

the diversity criterion.

We treat a C-decision tree as a regular tree structure of the

form with nodes, where nodes are

described by (5) and each nonterminal node has children.

The growth of the tree is controlled by conditions under which

the clusters can be further expanded (split). We envision two in-

tuitively appealing conditions that tackle the nature of the data

behind each node. The first one is self-evident: a given node can

be expanded if it contains enough data points. With clusters,

we require this number to be greater than the number of the clus-

ters; otherwise, the clusters cannot be formed. While this is the

lower bound on the cardinality of the data, practically we would

expect this number to be a multiplicity of , say , ,

etc. The second stopping condition pertains to the structure of

data that we attempt to discover through clustering. It becomes

obvious that once we approach smaller subsets of data, the dom-

inant structure (which is strongly visible at the level of the en-

tire and far more numerous data set) may not manifest that pro-foundly in the subset. It is likely that the smaller the data, the less

pronounced its structure. This becomes reflected in the entries

of the partition matrix that tend to be equal to each other and

equal to . If no structure becomes present, this equal distri-

bution of membership grades occurs across each column of the

partition matrix. This lack of visible structure can be quantified

by the expression (for the th pattern)

(10)

If all entries of the partition matrix are equal to , then the

result is equal to zero. If we encounter a full membership to acertain cluster, then the resulting value is equal to 1 (that is a

Fig. 4. Structurability index (9) viewed as a function of c and plotted forseveral selected values of .

maximal value of the above expression). To describe the struc-

tural dependencies within the entire data set in a certain node,

we carry out calculations over all patterns located at the node ofthe tree

(11)

Again, with no structuralability present in the data, this expres-

sion returns a zero value.

To gain a better feel as to the lack of structure and the en-

suing values of (11), let us consider a case where all entries in

a certain column of the partition matrix (pattern) are equal to

with some slight deviation being equal to . In 50% cases,

we consider that these entries are higher than , and we put

; in the remaining 50%, we consider the decreaseover and have . Furthermore, let us treat as a

fraction of the original membership grade, that is, make it equal

to , where in the interval (0, 1/2). Then, (11) reads as

(12)

The plot of this relationship treated as a function of is shown

in Fig. 4. It shows how the departure from the situation where no

structure has beendetected ( ) tothe case wherequantifies in the values of the structurability expression. The

plot shows several curves over the number of clusters ( ); higher

values of lead to the substantial drop in the values of the index

(9).

The two measures introduced previously can be used as a

stopping criterion in the expansion (growth) of the tree. We can

leave the node intact once the number of patterns falls under the

assumed threshold and/or the structurability index is too low.

The first index is a sort of precondition: if not satisfied, it pre-

vents us from expanding the node. The second index comes in

a form of some postcondition: to compute its value, we have

to cluster the data first and then determine its value. It is also

stronger as one may encounter cases where there a significantnumber of the data points to warrant clustering in light of the


5/14


TABLE IEXPERIMENTAL SETTING OF THE FCM ALGORITHM

Fig. 5. Traversing a C-fuzzy tree: an implicit mode.

first criterion; however, the second one concludes that the un-

derlying structure is too weak, and this may advise us to back-

track and refuse to expand this particular node.

These two indexes support decision making that focuses on

the structural aspects of the data (namely, we decide whether to

split the data). It does not guarantee that the resulting C-tree will

be the best from the point of view of classification or prediction

of continuous output variable. The diversity criterion (sum of

at the leaves) can be also viewed as another termination cri-

terion. While conceptually appealing, we may have difficultiesin translating its values into a more tangible and appealing

descriptor of the tree (obviously the lower, the better). Another

possible termination option (which may equally well apply to

each of these three indexes) is to monitor their changes along

the nodes of the tree once being built; an evident saturation of

the values of each of them could be treated as a signal to stop

growing the tree.

IV. USE OF THE TREE IN THE CLASSIFICATION

(PREDICTION) MODE

Once the C-tree has been constructed, it can be used to clas-sify a new input ( ) or predict a value of the associated output

Fig. 6. Classification boundaries (thick lines) for some configurations of

the prototypes formed by (13) and hyperboxes: (a) v = [ 1 : 5 1 : 2 ] v =[ 2 : 5 2 : 1 ] v = [ 0 : 6 3 : 5 ] and(b) v = [ 1 : 5 1 : 5 ] v = [ 1 : 5 2 : 6 ] v = [ 0 : 6 3 : 5 ] .Also shown are contour plots of the membership functions of the three clusters.

variable (denoted here by ). In the calculations, we rely on the

membership grades computed for each cluster as follows:

(13)

where is a distance computed between and (as a

matter of fact, we have the same expression as used in the FCM

method, refer to (3)). The calculations pertain to the leaves of

the C-tree, so for several levels of depth we have to traverse thetree first to reach the specific leaves. This is done by computing


6/14


Fig. 7. Two-dimensional training data (239 patterns).

Fig. 8. Two-dimensional testing data (61 patterns).

for each level of the tree, selecting the corresponding path

and moving down (Fig. 5). At some level, we determine the

path , where . Once at the th node, we

repeat the processthat is, determine ,

(here, we are dealing with the clusters at the successive level

of the tree). The process repeats for each level of the tree. The

predicted value occurring at the final leaf node is equal to

(refer to (8)).

It is of interest to show the boundaries of the classification

regions produced by the clusters (i.e., the implicit method) and

contrast them with the geometry of classification regions gener-

ated by the decision trees. In the first case we use a straightfor-

ward classification rule

assign to class if exceeds the values of the

membership in all remaining clusters,

that is

For the decision trees, the boundaries are guillotine cuts. As a

result, we get hyperboxes whose faces are parallel to the coor-

dinates. When dealing with the FCM, we can exercise the fol-

lowing method. For the given prototypes, we can project them

on the individual coordinates (variables), take averages of the

successive projected prototypes, and build hyperboxes around

the prototypes in the entire space. This approach is conceptu-

ally close to the decision trees as leading to the same geometriccharacter of the classifier. The obvious rule holds: assign to

Fig. 9. Top level of the C-decision tree; note two clusters of different values

of the variability index; the cluster with its higher value is shaded.

Fig. 10. Decision tree after the second expansion (iteration).

class if it falls into the hyperbox formed around prototype

.

Some examples of the classification boundaries are shown in

Fig. 6.As Fig. 6 reveals, the hyperbox model of the classi fication

boundaries is far more conservative than the one based on

the maximal membership rule. This is intuitively appealing as

in the process of forming the hyperboxes we allowed only for

cuts that are parallel to the coordinates. It becomes apparent

that the geometry of the decision tree induced in this way

varies substantially from the far more diversified geometry of

the FCM-based class boundaries.

V. EXPERIMENTAL STUDIES

The experiments conducted in the study involve both

prediction problems (in which the output variable is contin-uous) and those of classification nature (where we encounter


7/14


Fig. 11. Complete C-tree; note that the values of the variability criterion have reached zero at all leaves, and this terminates further growth of the tree.

several discrete classes). Experiments 1 and 2 concern two-di-mensional (2-D) synthetic data sets. Data used in experi-

ments 3 and 4 come from the Machine Learning repository

(http://www.ics.uci.edu/~mlearn/MLRepository.html), which

makes the experiments fully reproducible and facilitates further

comparative analysis. The data sets we experimented with are

as follows: (experiment 3) auto-mpg [9] and (a) pima diabetes

[9], (b) ionosphere [9], (c) hepatitis [9], (d) dermatology [9],

and (e) auto data [14] in experiment 4. The first one deals

with a continuous problem, while the other ones concern dis-

crete-class data. In all experiments, we use the FCM algorithm

with the settings summarized in Table I. As far as learning and

prediction abilities of the tree are concerned, we proceed with a

fivefold cross-validation that generates an 8020 split by taking

80% of the data as a training set and testing the tree on the

remaining 20% of the data set. Furthermore, the experiments

are repeated five times by taking splits of data into the training

and testing part, respectively.

A. Experiment 1

Experiment 1 is a 2-D synthetic data set generated by uniform

distribution random generators. The training set comprises 239

data points (patterns), while the testing set consists of 61 pat-

terns. These two data sets are visualized in Figs. 7 and 8, re-

spectively.

The results of the FCM (with ) are visualized in Fig. 9.Here, we report the prototypes of each cluster ( ), the values

of splitting criterion ( ), the number of the data points from thetraining set allocated to the cluster ( ), and the predicted value

at the node (class) that is rounded to the nearest integer value of

, which is evident as we are dealing with two discrete classes

(labels) of the patterns.

In the next step, we select the first node of the tree, which

is characterized by the highest value of the variability index,

and expand it by forming two children nodes by applying the

FCM algorithm to data associated with this original node. The

decision trees grown in this manner is visualized in Fig. 10.

At the next step, we select the second node of the tree (the

one with the highest variability) and expand it in the same way

as before (see Fig. 11).As expected, the classification error is equal to zero both for

the training and testing set. This is not surprising considering

that the classes are positioned quite apart from each other.

B. Experiment 2

Two-dimensional synthetic patterns used in this experiment

are normally distributed with some overlap between the classes

(see Figs. 12 and 13). The resulting C-decision tree is visualized

in Fig. 14. For this tree, an average classification error on the

training data is equal to 0.001 250 (with a standard deviation

equal to 0.002 013). Forthe testing data, these numbers areequalto 0.188 333 and 0.043 780, respectively.


8/14


Fig. 12. Two-dimensional training data (240 patterns).

Fig. 13. Two-dimensional testing data (60 patterns).

C. Experiment 3

This auto-mpg data set [9] involves a collection of vehicles

described by a number of features (such as weight of a vehicle,

number of cylinders, and fuel consumption).We complete a se-

ries of experiments in which we sweep through the number of

the clusters (c) varying it from 1 to 20 and carrying out 20 expan-

sions (iterations) of the tree (the tree is expanded step by step,

leading either to its in-depth or breadth expansion). The vari-

ability observed at all leaves of the tree

characterizes a process of the growth of the tree (refer to

Figs. 1 and 2).

The variability measure is reported for the training and testing

set as visualized in Figs. 15 and 16. The variability goes down

with the growth of the tree, and this becomes evident for the

training and testing data. It is also clear that most of the changes

in the reduction of the variability occur at the early stages of the

growth of the trees; afterwards, the changes are quite limited.

Likewise, we note that the drop in the variability values becomes

visible when moving from two to three or four clusters. Notice-ably for the increased number of the clusters, the variability is

Fig. 14. Complete C-tree; as the values of the variability criterion has reachedzero at all leaves, this terminates further growth of the tree.


9/14


Fig. 15. Variability of the tree reported during the growth of the tree (training data) for a selected number of clusters.

Fig. 16. Variability of the tree reported during the growth of the tree (testing data) for a selected number of clusters.

practically left unaffected (we see a series of barely distinguish-

able plots for greater than five). This effect is present for the

training and testing data.

After taking a careful look at the variability of the tree, we

conclude that the optimal configuration occurs at 5 clusters

with the number of expansion equal to seven. In this case, the

resulting tree is portrayed in Fig. 17. The arrows shown there

along with the labels (numbers) visualize a growth of the tree,

namely a way in which the tree is grown (expanded in consecu-

tive iterations). The numbers in circles denote the node number.

The last digit of the node number denotes the number of clus-

ters while the beginning digits denote the parent node number.

Detailed description of the nodes is given in Table II. Again, wereport the details of the tree, including the number of patterns

residing at each node ( ) as well as their variability ( ) and

the predicted value at the node (8).

While the variability criterion is an underlying measure in the

design process, the predictive capabilities of the C-decision tree

are quantified by the following performance index:

(14)

In the above expression, denotes the predicted value occur-

ring at the corresponding terminal node (refer to (8)). More

specifically, the representative of this node in the output space iscalculated as a weighted sum of those elements from the training


10/14


Fig. 17. Detailed C-decision tree for the optimal number of clusters and iterations; see a detailed description in text.

set that contribute to this node, while is the output value en-

countered in the data set. For a discrete (classification) problem,

this index is simply a classification error that is determined by

counting the number of patterns that have been misclassified by

the C-decision tree.

The values of the error obtained on the training set for a

number of different configurations of the tree (number of clus-

ters and iterations) are shown in Fig. 18. Again, most of the

changes occur for low values of the clusters and are character-

istic of the early phases of the growth of the tree.

We see that low values of do not reducethe error even with asubstantial growth (number of iterations) of the tree. Similarly,

we observe that the same effect occurs for the testing set (see

Fig. 19) (obviously, these results are reported for the sake of

completeness; in practice, we choose the best tree on a basis

of the training set and test it by means of the testing data).

D. Experiment 4

Again, we use a data set from the Machine Learning repos-

itory [9], a two-class pima-diabetes consisting of 768 patterns

distributed in an 8-dimensional feature space. In the design of

the C-tree, we use the development procedure outlined in theprevious section: a node with the maximal value of diversity

is chosen as a potential candidate to expand (split into clus-

ters). Prior to this node unfolding, we check if there are a suf-

ficient number of patterns located there (here, we consider that

the criterion is satisfied when this number is greater than the

number of the clusters). Once this holds, we proceed with the

clustering and then look at the structurability index (9) whose

values should be greater or equal to 0.05 (this threshold value

has been selected arbitrarily) to accept the unfolding of the node.

The number of iterations is set to 10. The plots of the variability

shown in Fig. 20 point out that there is not a substantial differ-

ence in the number of iterations on the value of ; the changesoccur mainly during the first few iterations (expansions of the

tree). Similarly, there are changes when we increase the number

of the clusters from two up to six, but beyond this the changes

are not very significant.

When using the C-decision tree in the predictive mode, its

performance is evaluated by means of (14). The collection of

pertinent plots is shown in Figs. 21 and 22 for the training data

(the performance results are averaged over the series of exper-

iments). Similarly, the results of the tree on the testing set are

visualized in Fig. 23. Evidently, with the increase in the number

of clusters, we note a drop in the error values; yet, valuesof that

are too high lead to some fluctuations of the error (so it is not evi-

dent that growing larger trees is still fully legitimate). Such fluc-tuations are even more profound when studying the plots of error


11/14


TABLE IIDESCRIPTION OF THE NODES OF THE C-DECISIONS TREE INCLUDED IN FIG. 17

Fig. 18. Average error of the C-decision tree reported for the training set.

reported for the testing set. In a search for the optimal con-

figuration, we have found that the number of clusters between

three and five and a few iterations led to the best results (see

Fig. 24). We observe an evident tendency: while growing larger

trees is definitely beneficial in case of the training data (gener-

ally, the error is reduced with a few exceptions), the error does

not change quite visibly on the testing data (where the changes

are in the range of 1%).

It is of interest to compare the results produced by theC-decision tree with those obtained when applying stan-

Fig. 19. Average error of the C-decision tree reported for the testing set.

dard decision trees, namely C4.5. In this analysis, we

have experimented with the software available on the Web

(http://www.cse.unsw.edu.au/~quinlan/), which is C4.5 revi-

sion 8 run with the standard settings (i.e., selection of the

attribute that maximizes the information gain and no pruning

was used). The results are summarized in Table III. Following

the experimental scenario outlined at the beginning of the

section, we report the mean values and the standard deviation

of the error. For the C-decision trees, the number of nodes is

equal to the number of clusters multiplied by the number ofiterations.


12/14


Fig. 20. Variability (V ) for the pima-diabetes data (training set).

Fig. 21. Error (e

) as a function of iterations (expansion of the tree) for thetraining set for selected number of clusters.

Fig.22. Error (e ) as a function of the number of clusters forselected iterations.

Overall, we note that the C-tree is more compact (in terms of

the number of nodes). This is not surprising as its nodes are more

complex than those in the original decision tree. If our intent is

to have smaller and more compact structures, C-trees become

quite an appealing architecture. The results on the training sets

are better for the C-trees at the level of 3%6% improvement

(for the pima data set). The standard deviation of the error is

two times lower for these trees in comparison with the C4.5.

For the testing set, we note that the larger out of the two C-trees

in Table III produces almost the same results as the C4.5. With

the smaller C-tree, we note an increase in the classification rateby 1% in comparison with the larger structure however the size

Fig. 23. Error (e

) pima-diabetes (testing set).

Fig. 24. Classificationerror for the C-decision tree versus successive iterations(expansions) of the tree; c = 3 and 5; both training and testing sets are included.

of the tree has been reduced to one half of the larger tree (on the

pima data). The increase in the size of the tree does not dramati-

cally affect the classification results; the classification rates tend

to be better but do not differ significantly from structure to struc-

ture. In general, we note that the C-decision tree produces more

consistent results in terms of the classification for the training

and testing sets; these are closer when compared with the results

produced by the C4.5 tree. In some cases, the results of C-de-

cision tree are better than C4.5; this happens for the hepatatis

data.

VI. CONCLUSION

The C-decision trees are classification constructs that are built

on a basis of information granulesfuzzy clusters. The way

in which these trees are constructed deals with successive re-

finements of the clusters (granules) forming the nodes of the

tree. When growing the tree, the nodes (clusters) are split into

granules of lower diversity (higher homogeneity). In contrast to

C4.5-like trees, all features are used once at a time, and such a

development approach promotes more compact trees and a ver-

satile geometry of the partition of the feature space. The exper-

imental studies illustrate the main features of the C-trees. Oneof them is quite profound and highly desirable for any practical


13/14


TABLE IIIC-DECISIONS TREE AND C4.5: A COMPARATIVE ANALYSIS FOR SEVERAL MACHINE LEARNING DATA SETS:

(a) PIMA-DIABETES, (b) IONOSPHERE, (c) HEPATITIS ( IN THIS DATA SET ALL MISSING VALUESWERE REPLACED BY THE AVERAGES OF THE CORRESPONDING ATTRIBUTES),

(d) DERMATOLOGY, AND (e) AUTO DATA

usage: the difference of performance of C-trees on the trainingand testing sets is lower than the ones reported for the C4.5.

The C-tree is also sought as a certain blueprint of some de-tailed models that can be formed on a local basis by considering


14/14


data allocated to the individual nodes. At this stage, the models

are refined by choosing their topology (e.g., linear models and

neural networks) and making a decision about detailed learning.

REFERENCES

[1] W. P. Alexander and S. Grimshaw, Treed regression, J. Computational

Graphical Statistics, vol. 5, pp. 156175, 1996.[2] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, NewYork: Plenum, 1981.

[3] L.Breiman, J.H. Friedman, R. A.Olshen,and C. J.Stone, Classificationand Regression Trees. Belmont, CA: Wadsworth, 1984.

[4] E. Cantu-Paz and C. Kamath, Using evolutionary algorithms to in-duce oblique decision trees, in Proc. Genetic Evolutionary Computa-tion Conf. 2000, D. Whitley, D. E. Goldberg, E. Cantu-Paz, L. Spector,L. Partnee, and H.-G. Beyer, Eds., San Francisco, CA, pp. 1053 1060.

[5] A. Dobra and J. Gehrke, SECRET: A scalable linear regression treealgorithm, in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery

Data Mining, Edmonton, AB, Canada, Jul. 2002.[6] A. Ittner and M. Schlosser, Non-linear decision treesNDT, in Proc.

13th Int. Conf. Machine Learning (ICML96), Bari, Italy, Jul. 36, 1996.[7] A. Ittner, J. Zeidler, R. Rossius, W. Dilger, and M. Schlosser, Feature

space partitioning by nonlinear fuzzy decision trees, in Proc. Int. FuzzySystems Assoc., pp. 394398.

[8] A. K. Jain et al., Data clustering: A review, ACM Comput. Surv., vol.31, no. 3, pp. 264323, Sep. 1999.

[9] C. J. Merz and P. M. Murphy. (1996) UCI Repository for Ma-chine Learning Data-Bases. Dept. of Information and ComputerScience, University of California, , Irvine, CA. [Online]. Available:

http://www.ics.uci.edu/~mlearn/MLRepository.html[10] S. K. Murthy, S. Kasif, and S. Salzberg, A system for induction of

oblique decision trees, J. Artificial Intelligence Res., vol. 2, pp. 132,1994.

[11] W. Pedrycz and Z. A. Sosnowski, Designing decision trees with theuse of fuzzy granulation, IEEE Trans. Syst., Man, Cybern. A, Syst., Hu-mans, vol. 30, no. 2, pp. 151159, Mar. 2000.

[12] J. R. Quinlann, Induction of decision trees, Mach. Learn. 1, pp.81106, 1986.

[13] , C4.5: Programs for Machine Learning. San Francisco, CA:Morgan Kaufmann, 1993.

[14] J. S. Siebert, Vehicle Recognition using Rule-Based Methods, ResearchMemo, Turing Institute, TIRM-87-017, 1987.[15] R. Weber, Fuzzy ID3: A class of methods for automatic knowledge ac-

quisition, in Proc. 2nd Int. Conf Fuzzy Logic Neural Networks , Iizuka,Japan, Jul. 1722, 1992, pp. 265268.

[16] O. T. Yildiz and E. Alpaydin, Omnivariate decision trees, IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 15391546, Nov. 2001.

Witold Pedrycz (M88SM94F99) is a Professorand CanadaResearch Chair (CRC) in theDepartmentof Electrical and Computer Engineering, University

of Alberta, Edmonton, AB, Canada. He is activelypursuing research in computational intelligence,fuzzy modeling, knowledge discovery and datamining, fuzzy control, including fuzzy controllers,pattern recognition, knowledge-based neural net-

works, relational computation, bioinformatics, andsoftware engineering. He has published numerouspapers in this area. He is also an author of eight

research monographs covering various aspects of computational intelligenceand software engineering.

Dr. Pedrycz has been a member of numerous program committees ofIEEE conferences in the area of fuzzy sets and neurocomputing. He currentlyserves as an Associate Editor of IEEE TRANSACTIONS ON SYSTEMS, MANAND CYBERNETICS, Parts A and B, and the IEEE T RANSACTIONS ON FUZZY

SYSTEMS. He is an Editor-in-Chief of Information Sciences, and he is Presi-dent-Elect of International Fuzzy Systems Association (IFSA) and President of

North American Fuzzy Information Processing Society (NAFIPS).

Zenon A. Sosnowski (M99) received the M.Sc. de-gree in mathematics from the University of Warsaw,Warsaw, Poland, in 1976 and the Ph.D. degree incomputer science from the Warsaw University of

Technology, Warsaw, Poland, in 1986.He has been with the Technical University of Bi-

alystok, Bialystok, Poland, since 1976, where he isan Assistant Professor at the Department of Com-puter Science. In 19881989, he had been atthe DelftUniversity of Technology in the Netherlands for fivemonths. He spent two years (19901991) with the

Knowledge Systems Laboratory of the National Research Councils Institutefor

Information Technology, Ottawa, ON, Canada. His research interests includeartificial intelligence, expert systems, approximate reasoning, fuzzy sets, andknowledge engineering.

Dr. Sosnowski is a Member of the IEEE Systems, Man, and Cybernetics So-ciety.

16.C-Fuzzy Decision Trees

Documents

Transcript of 16.C-Fuzzy Decision Trees