16.C-Fuzzy Decision Trees

download 16.C-Fuzzy Decision Trees

of 14

Transcript of 16.C-Fuzzy Decision Trees

  • 8/9/2019 16.C-Fuzzy Decision Trees

    1/14

    498 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    CFuzzy Decision TreesWitold Pedrycz , Fellow, IEEE, and Zenon A. Sosnowski, Member, IEEE

    AbstractThis paper introduces a concept and design of deci-sion trees based on information granulesmultivariable entitiescharacterized by high homogeneity (low variability). As such gran-ules are developed via fuzzy clustering and play a pivotal role in thegrowth of the decision trees, they will be referred to as C-fuzzy de-cisiontrees. In contrast with standard decision treesin which onevariable (feature) is considered at a time, this form of decision treesinvolves all variables that are considered at each node of the tree.Obviously, this gives rise to a completely new geometry of the par-tition of the feature space that is quite different from the guillotinecuts implemented by standard decision trees. The growth of theC-decision tree is realized by expanding a node of tree character-ized by the highest variability of the information granule residingthere. This paper shows how the tree is grown depending on someadditional node expansion criteria such as cardinality (number of

    data) at a given node and a level of structural dependencies (struc-turability) of data existing there. A series of experiments is re-ported using both synthetic and machine learning data sets. Theresults are compared with those produced by the standard ver-sion of the decision tree (namely, C4.5).

    Index TermsDecision trees, depth-and-breadth tree expan-sion, experimental studies, fuzzy clustering, node variability, treegrowing.

    I. INTRODUCTION

    DECISION trees [12], [13] are the commonly used archi-

    tectures of machine learning and classification systems.

    They come with a comprehensive list of various training andpruning schemes, a diversity of discretization (quantization)

    algorithms, and a series of detailed learning refinements [1],

    [3][7], [10], [11], [15], [16]. In spite of such variety of the

    underlying development activities, one can easily witness

    several fundamental properties that cut quite evidently across

    the entire spectrum of the decision trees. First, the trees operate

    on discrete attributes that assume a finite (usually quite small)

    number of values. Second, in the design procedure, one attribute

    is chosen at a time. More specifically, one selects the most dis-

    criminative attribute and expands (grows) the tree by adding

    the node whose attributes values are located at the branches

    originating from this node. The discriminatory power of theattribute (which stands behind its selection out of the collection

    Manuscript received February 10, 2004; revised June 9, 2004 and September9, 2004. This work was supported in part by the Canada Research Chair (CRC)Program of the Natural Science and Engineering Research Council of Canada(NSERC). The work of Z. A. Sosnowski was supported in part by the TechnicalUniversity of Bialystok under Grant W/WI/8/02.

    W. Pedrycz is with the Department of Electrical and Computer Engineering,University of Alberta, Edmonton, AB T6G 2V4, Canada, and also with the Sys-tems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland(e-mail: [email protected]).

    Z. A. Sosnowski is with the Department of Computer Science, TechnicalUniversity of Bialystok, Bialystok 15-351, Poland (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TSMCC.2004.843205

    of the attributes existing in the problem at hand) is quantifiedby means of some criterion such as entropy, Gini index, etc.

    [13], [5]. Third, decision trees in their generic version are pre-

    dominantly applied to discrete class problems (the continuous

    prediction problems are handled by regression trees). Interest-

    ingly, these three fundamental features somewhat restrict the

    nature of the trees and identify a range of applications that

    are pertinent in this setting. When dealing with continuous

    attributes, it is evident that the discretization is a must. As

    such, it directly impacts the performance of the tree. One may

    argue that the discretization requires some optimization that by

    being guided by the classification error can be realized once the

    development of the tree has been finalized. In this sense, thesecond phase (namely a way in which the tree has been grown

    and a sequence of the attributes selected) is inherently affected

    by the discretization mechanism. In a nutshell, it means that

    these two design steps cannot be disjointed. The growth of

    the tree relying on a choice of a single attribute can be also

    treated as some conceptual drawback. While being quite simple

    and transparent, it could well be that considering two or more

    attributes as a indivisible group of variables occurring as the

    discrimination condition located at some node of the tree may

    lead to the better tree.

    Having these shortcomings clearly identified, the objective

    of this study is to develop a new class of decision trees that at-

    tempts to alleviate these problems. The underlying conjecture isthat data can be perceived as a collection of information gran-

    ules [11]. Thus, the tree becomes spanned over these granules,

    treated now as fundamental building blocks. In turn, informa-

    tion granules and information granulation are almost a synonym

    of clusters and clustering [2], [8]. Subscribing to the notion

    of fuzzy clusters (and fuzzy clustering), the authors intend to

    capture the continuous nature of the classes so that there is no

    restriction of the use of these constructs to the discrete prob-

    lems. Furthermore, fuzzy granulation helps link the discretiza-

    tion problem with the formation of the tree in a direct and inti-

    mate manner. As it becomes evident that fuzzy clusters are the

    central concept behind the generalized tree, they will be referredto as clustered-oriented decision trees or C-decision trees, for

    short.

    The material of this study is organized in the following

    manner. Section II provides an introduction of the architec-

    ture of the tree by discussing the underlying processes of its

    in-depth and in-breadth growth. Section III brings more details

    on the development of the C-trees where we concentrate on the

    functional and algorithmic details (fuzzy clustering and various

    criteria of node splitting leading to the specific pattern of tree

    growing). Next, Section IV elaborates on the use of the trees in

    the classification or prediction mode. A series of comparative

    numeric studies is presented in Section V.

    1094-6977/$20.00 2005 IEEE

  • 8/9/2019 16.C-Fuzzy Decision Trees

    2/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 499

    The study adhered to the standard notation and notions used

    in the literature of machine learning and fuzzy sets. The way

    of evaluating the performance of the tree is standard as a five-

    fold cross-validation is used. More specically, in each pass, an

    8020 split of data is generated into the training and testing set,

    respectively, and the experiments are repeated for five different

    splits for training and testing data (rotation method) that helpsus gain high confidence about the results. As to the format of

    the data set, it comes as a family of inputoutput pairs

    , , where and

    . Note that when we restrict the range of values assumed by

    to some finite set (say integers) then we encounter a stan-

    dard classification problem while, in general, admitting contin-

    uous values assumed by the output, we are concerned with a

    (continuous) prediction problem.

    II. OVERALL ARCHITECTURE OF THE

    CLUSTER-BASED DECISION TREE

    The architecture of the clusterbased decision tree develops

    around fuzzy clusters that are treated as generic building blocks

    of the tree. The training data set is clustered into clusters so

    that the data points (patterns) that are similar are put together.

    These clusters are completely characterized by their prototypes

    (centroids). We start with them positioned at top nodes of the

    tree structure. The way of building the clusters implies a spe-

    cific way in which we allocate elements of to each of them. In

    other words, each cluster comes with a subset of , namely ,

    , . The process of growing the tree is guided by a cer-

    tain heterogeneity criterion that quantifies a diversity of the data

    (with respect to the output variable ) falling under the givencluster (node). Denote the values of the heterogeneity criterion

    by , , , , respectively (see also Fig. 1). We then choose

    the node with the highest value of the criterion and treat it as a

    candidate for further refinement. Let be the one for which

    assumes a maximal values, that is . The

    th node is refined by splitting it into clusters as visualized in

    Fig. 1.

    Again the resulting nodes (children) of node come with

    their own sets of data. The process is repeated by selecting the

    most heterogeneous node out of all final nodes (see Fig. 2). The

    growth of the tree is carried out by expanding the nodes and

    building their consecutive levels that capture more details of thestructure. It is noticeable that the node expansion leads to the

    increase in either the depth or width (breadth) of the tree. The

    pattern of the growth is very much implied by the characteristics

    of the data as well as influenced by the number of the clusters.

    Some typical patterns of growth are illustrated in Fig. 2. Con-

    sidering a way in which the tree expands, it is easy to notice that

    each node of the tree has exactly zero or children.

    By looking at the way of forming the nodes of the tree and

    their successive splitting (refinement), we can easily observe

    an interesting analogy between this approach and well-known

    hierarchical divisive algorithms. Conceptually, they share the

    same principle; however, there are a number of technical aspectsof these two.

    Fig. 1. Growing a decision tree by expanding nodes (which are viewed asclusters located at its nodes). Shadowed nodes are those with maximal valuesof the diversity criterion and thus being subject to the split operation.

    Fig. 2. Typical growth patterns of the cluster-based trees: (a) depth intensiveand (b) breadth intensive.

    For the completeness of the construct, each node is character-

    ized by the following associated components: the heterogeneity

    criterion , the number of patterns associated with it, and a list

    of these patterns. Moreover, each pattern on this list comes with

    a degree of belongingness (membership) to that node. We pro-

    vide a formal definition of the C-decision trees at a later stage

    once we cover the pertinent mechanisms of their development.

    III. DEVELOPMENT OF THE TREE

    In this section, we concentrate on the functional details and

    ensuing algorithmic aspects. The crux of the overall design is

    the clustering mechanism and the manipulations realized at thelevel of information granules formed in this manner.

  • 8/9/2019 16.C-Fuzzy Decision Trees

    3/14

    5 00 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    A. Fuzzy Clustering

    Fuzzy clustering is a core functional part of the overall tree.

    It builds the clusters and provides its full description. We con-

    fine ourselves to the standard fuzzy C-means (FCM), which is

    an omnipresent technique of information granulation. The de-

    scription of this algorithm is well documented in the literature.

    We refer the reader to [2] and [11] and revisit it here in the set-ting of the decision trees. The FCM algorithm is an example

    of an objective-oriented fuzzy clustering where the clusters are

    built through a minimization of some objective function. The

    standard objective function assumes the format

    (1)

    with , , being a parti-

    tion matrix (here denotes a family of -by- matrices

    that satisfies a series of conditions: 1) all elements of the parti-

    tion matrix are confined to the unit interval, 2) the sum over

    each column is equal to 1, and 3) the sum of the membership

    grades in each row is contained to the range of .

    The number of clusters is denoted by . The data set to be

    clustered consistsof patterns. isa fuzzification factor (usu-

    ally ), and is a distance function between the th data

    point (pattern) and the th prototype. The prototype of the cluster

    can be treated as a typical vector or a representative of the data

    forming the cluster. The feature space in which the clustering

    is carried out requires a thorough description. Referring to the

    format of the data we have started with, let us note that they

    come as ordered pairs . For the purpose of clus-

    tering, we concatenate the pairs and use the notation

    (2)

    This implies that the clustering takes place in the ( ) di-

    mensional space and involves the data distributed in the input

    and output space. Likewise, the resulting prototype ( ) is po-

    sitioned in . For future use, we distinguish between the

    coordinates of the prototype in the input and output space by

    splitting them into two sections (blocks of the variables) as fol-

    lows:

    and

    It isworthemphasizingthat located describes a prototype

    located in the input space; this description will be of particular

    interest when utilizing the tree to carry out prediction tasks.

    In essence, the FCM is an iterative optimization process in

    which we iteratively update the partition matrix and prototypes

    until some termination criterion has been satisfied. The updates

    of the values of and s are governed by the following well-known expressions (cf. [2]):

    Fig. 3. Node splitting controlled by the variability criterion V .

    partition update

    (3)

    prototype update

    (4)

    The series of iterations is started from a randomly initiated par-

    tition matrix and involves the calculations of the prototypes and

    partition matrices.

    B. Node Splitting Criterion

    The growth process of the tree is pursued by quantifying the

    diversity of data located at the individual nodes of the tree and

    splitting the nodes that exhibit the highest diversity. This intu-

    itively appealing criterion takes into account the variability of

    the data, finds the node with the highest value of the criterion,

    and splits it into nodes that occur at the consecutive lower level

    of the tree (see Fig. 3). The essence of the diversity (variability)

    criterion is to quantify a dispersion of the data allocated to the

    given cluster so that higher dispersion of data results in higher

    values of the criterion. Recall that individual data points (pat-

    terns) belong to the clusters with different membership grades;

    however, for each pattern, there is a dominant cluster to which

    they exhibit the highest degree of belongingness (membership).

    More formally, let us represent the th node of the tree as

    an ordered triple

    (5)

    Here, denotes all elements of the data set that belong to this

    node in virtue of the highest membership grade

    for all

    the index pertains to the nodes originating from

    the same parent

    The second set collects the output coordinates of the ele-

    ments that have already been assigned to , as follows:

    (6)

  • 8/9/2019 16.C-Fuzzy Decision Trees

    4/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 501

    Likewise, is a vector of the grades of membership of the

    elements in , as follows:

    (7)

    We define the representative of this node positioned in the output

    space as the weighted sum (note that in the construct hereafter

    we include only those elements that contribute to the cluster sothe summation is taken over and ), as follows:

    (8)

    The variability of the data in the output space existing at this

    node is taken as a spread around the representative ( )

    where again we consider a partial involvement of the elements

    in by weighting the distance by the associated membership

    grade

    (9)

    In the next step, we select the node of the tree (leaf) that has the

    highest value of , say and expand the node by forming

    its children by applying the clustering of the associated data

    set into clusters. The process is then repeated: we examine the

    leaves of the tree and expand the one with the highest value of

    the diversity criterion.

    We treat a C-decision tree as a regular tree structure of the

    form with nodes, where nodes are

    described by (5) and each nonterminal node has children.

    The growth of the tree is controlled by conditions under which

    the clusters can be further expanded (split). We envision two in-

    tuitively appealing conditions that tackle the nature of the data

    behind each node. The first one is self-evident: a given node can

    be expanded if it contains enough data points. With clusters,

    we require this number to be greater than the number of the clus-

    ters; otherwise, the clusters cannot be formed. While this is the

    lower bound on the cardinality of the data, practically we would

    expect this number to be a multiplicity of , say , ,

    etc. The second stopping condition pertains to the structure of

    data that we attempt to discover through clustering. It becomes

    obvious that once we approach smaller subsets of data, the dom-

    inant structure (which is strongly visible at the level of the en-

    tire and far more numerous data set) may not manifest that pro-foundly in the subset. It is likely that the smaller the data, the less

    pronounced its structure. This becomes reflected in the entries

    of the partition matrix that tend to be equal to each other and

    equal to . If no structure becomes present, this equal distri-

    bution of membership grades occurs across each column of the

    partition matrix. This lack of visible structure can be quantified

    by the expression (for the th pattern)

    (10)

    If all entries of the partition matrix are equal to , then the

    result is equal to zero. If we encounter a full membership to acertain cluster, then the resulting value is equal to 1 (that is a

    Fig. 4. Structurability index (9) viewed as a function of c and plotted forseveral selected values of .

    maximal value of the above expression). To describe the struc-

    tural dependencies within the entire data set in a certain node,

    we carry out calculations over all patterns located at the node ofthe tree

    (11)

    Again, with no structuralability present in the data, this expres-

    sion returns a zero value.

    To gain a better feel as to the lack of structure and the en-

    suing values of (11), let us consider a case where all entries in

    a certain column of the partition matrix (pattern) are equal to

    with some slight deviation being equal to . In 50% cases,

    we consider that these entries are higher than , and we put

    ; in the remaining 50%, we consider the decreaseover and have . Furthermore, let us treat as a

    fraction of the original membership grade, that is, make it equal

    to , where in the interval (0, 1/2). Then, (11) reads as

    (12)

    The plot of this relationship treated as a function of is shown

    in Fig. 4. It shows how the departure from the situation where no

    structure has beendetected ( ) tothe case wherequantifies in the values of the structurability expression. The

    plot shows several curves over the number of clusters ( ); higher

    values of lead to the substantial drop in the values of the index

    (9).

    The two measures introduced previously can be used as a

    stopping criterion in the expansion (growth) of the tree. We can

    leave the node intact once the number of patterns falls under the

    assumed threshold and/or the structurability index is too low.

    The first index is a sort of precondition: if not satisfied, it pre-

    vents us from expanding the node. The second index comes in

    a form of some postcondition: to compute its value, we have

    to cluster the data first and then determine its value. It is also

    stronger as one may encounter cases where there a significantnumber of the data points to warrant clustering in light of the

  • 8/9/2019 16.C-Fuzzy Decision Trees

    5/14

    5 02 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    TABLE IEXPERIMENTAL SETTING OF THE FCM ALGORITHM

    Fig. 5. Traversing a C-fuzzy tree: an implicit mode.

    first criterion; however, the second one concludes that the un-

    derlying structure is too weak, and this may advise us to back-

    track and refuse to expand this particular node.

    These two indexes support decision making that focuses on

    the structural aspects of the data (namely, we decide whether to

    split the data). It does not guarantee that the resulting C-tree will

    be the best from the point of view of classification or prediction

    of continuous output variable. The diversity criterion (sum of

    at the leaves) can be also viewed as another termination cri-

    terion. While conceptually appealing, we may have difficultiesin translating its values into a more tangible and appealing

    descriptor of the tree (obviously the lower, the better). Another

    possible termination option (which may equally well apply to

    each of these three indexes) is to monitor their changes along

    the nodes of the tree once being built; an evident saturation of

    the values of each of them could be treated as a signal to stop

    growing the tree.

    IV. USE OF THE TREE IN THE CLASSIFICATION

    (PREDICTION) MODE

    Once the C-tree has been constructed, it can be used to clas-sify a new input ( ) or predict a value of the associated output

    Fig. 6. Classification boundaries (thick lines) for some configurations of

    the prototypes formed by (13) and hyperboxes: (a) v = [ 1 : 5 1 : 2 ] v =[ 2 : 5 2 : 1 ] v = [ 0 : 6 3 : 5 ] and(b) v = [ 1 : 5 1 : 5 ] v = [ 1 : 5 2 : 6 ] v = [ 0 : 6 3 : 5 ] .Also shown are contour plots of the membership functions of the three clusters.

    variable (denoted here by ). In the calculations, we rely on the

    membership grades computed for each cluster as follows:

    (13)

    where is a distance computed between and (as a

    matter of fact, we have the same expression as used in the FCM

    method, refer to (3)). The calculations pertain to the leaves of

    the C-tree, so for several levels of depth we have to traverse thetree first to reach the specific leaves. This is done by computing

  • 8/9/2019 16.C-Fuzzy Decision Trees

    6/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 503

    Fig. 7. Two-dimensional training data (239 patterns).

    Fig. 8. Two-dimensional testing data (61 patterns).

    for each level of the tree, selecting the corresponding path

    and moving down (Fig. 5). At some level, we determine the

    path , where . Once at the th node, we

    repeat the processthat is, determine ,

    (here, we are dealing with the clusters at the successive level

    of the tree). The process repeats for each level of the tree. The

    predicted value occurring at the final leaf node is equal to

    (refer to (8)).

    It is of interest to show the boundaries of the classification

    regions produced by the clusters (i.e., the implicit method) and

    contrast them with the geometry of classification regions gener-

    ated by the decision trees. In the first case we use a straightfor-

    ward classification rule

    assign to class if exceeds the values of the

    membership in all remaining clusters,

    that is

    For the decision trees, the boundaries are guillotine cuts. As a

    result, we get hyperboxes whose faces are parallel to the coor-

    dinates. When dealing with the FCM, we can exercise the fol-

    lowing method. For the given prototypes, we can project them

    on the individual coordinates (variables), take averages of the

    successive projected prototypes, and build hyperboxes around

    the prototypes in the entire space. This approach is conceptu-

    ally close to the decision trees as leading to the same geometriccharacter of the classifier. The obvious rule holds: assign to

    Fig. 9. Top level of the C-decision tree; note two clusters of different values

    of the variability index; the cluster with its higher value is shaded.

    Fig. 10. Decision tree after the second expansion (iteration).

    class if it falls into the hyperbox formed around prototype

    .

    Some examples of the classification boundaries are shown in

    Fig. 6.As Fig. 6 reveals, the hyperbox model of the classi fication

    boundaries is far more conservative than the one based on

    the maximal membership rule. This is intuitively appealing as

    in the process of forming the hyperboxes we allowed only for

    cuts that are parallel to the coordinates. It becomes apparent

    that the geometry of the decision tree induced in this way

    varies substantially from the far more diversified geometry of

    the FCM-based class boundaries.

    V. EXPERIMENTAL STUDIES

    The experiments conducted in the study involve both

    prediction problems (in which the output variable is contin-uous) and those of classification nature (where we encounter

  • 8/9/2019 16.C-Fuzzy Decision Trees

    7/14

    5 04 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    Fig. 11. Complete C-tree; note that the values of the variability criterion have reached zero at all leaves, and this terminates further growth of the tree.

    several discrete classes). Experiments 1 and 2 concern two-di-mensional (2-D) synthetic data sets. Data used in experi-

    ments 3 and 4 come from the Machine Learning repository

    (http://www.ics.uci.edu/~mlearn/MLRepository.html), which

    makes the experiments fully reproducible and facilitates further

    comparative analysis. The data sets we experimented with are

    as follows: (experiment 3) auto-mpg [9] and (a) pima diabetes

    [9], (b) ionosphere [9], (c) hepatitis [9], (d) dermatology [9],

    and (e) auto data [14] in experiment 4. The first one deals

    with a continuous problem, while the other ones concern dis-

    crete-class data. In all experiments, we use the FCM algorithm

    with the settings summarized in Table I. As far as learning and

    prediction abilities of the tree are concerned, we proceed with a

    fivefold cross-validation that generates an 8020 split by taking

    80% of the data as a training set and testing the tree on the

    remaining 20% of the data set. Furthermore, the experiments

    are repeated five times by taking splits of data into the training

    and testing part, respectively.

    A. Experiment 1

    Experiment 1 is a 2-D synthetic data set generated by uniform

    distribution random generators. The training set comprises 239

    data points (patterns), while the testing set consists of 61 pat-

    terns. These two data sets are visualized in Figs. 7 and 8, re-

    spectively.

    The results of the FCM (with ) are visualized in Fig. 9.Here, we report the prototypes of each cluster ( ), the values

    of splitting criterion ( ), the number of the data points from thetraining set allocated to the cluster ( ), and the predicted value

    at the node (class) that is rounded to the nearest integer value of

    , which is evident as we are dealing with two discrete classes

    (labels) of the patterns.

    In the next step, we select the first node of the tree, which

    is characterized by the highest value of the variability index,

    and expand it by forming two children nodes by applying the

    FCM algorithm to data associated with this original node. The

    decision trees grown in this manner is visualized in Fig. 10.

    At the next step, we select the second node of the tree (the

    one with the highest variability) and expand it in the same way

    as before (see Fig. 11).As expected, the classification error is equal to zero both for

    the training and testing set. This is not surprising considering

    that the classes are positioned quite apart from each other.

    B. Experiment 2

    Two-dimensional synthetic patterns used in this experiment

    are normally distributed with some overlap between the classes

    (see Figs. 12 and 13). The resulting C-decision tree is visualized

    in Fig. 14. For this tree, an average classification error on the

    training data is equal to 0.001 250 (with a standard deviation

    equal to 0.002 013). Forthe testing data, these numbers areequalto 0.188 333 and 0.043 780, respectively.

  • 8/9/2019 16.C-Fuzzy Decision Trees

    8/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 505

    Fig. 12. Two-dimensional training data (240 patterns).

    Fig. 13. Two-dimensional testing data (60 patterns).

    C. Experiment 3

    This auto-mpg data set [9] involves a collection of vehicles

    described by a number of features (such as weight of a vehicle,

    number of cylinders, and fuel consumption).We complete a se-

    ries of experiments in which we sweep through the number of

    the clusters (c) varying it from 1 to 20 and carrying out 20 expan-

    sions (iterations) of the tree (the tree is expanded step by step,

    leading either to its in-depth or breadth expansion). The vari-

    ability observed at all leaves of the tree

    characterizes a process of the growth of the tree (refer to

    Figs. 1 and 2).

    The variability measure is reported for the training and testing

    set as visualized in Figs. 15 and 16. The variability goes down

    with the growth of the tree, and this becomes evident for the

    training and testing data. It is also clear that most of the changes

    in the reduction of the variability occur at the early stages of the

    growth of the trees; afterwards, the changes are quite limited.

    Likewise, we note that the drop in the variability values becomes

    visible when moving from two to three or four clusters. Notice-ably for the increased number of the clusters, the variability is

    Fig. 14. Complete C-tree; as the values of the variability criterion has reachedzero at all leaves, this terminates further growth of the tree.

  • 8/9/2019 16.C-Fuzzy Decision Trees

    9/14

    5 06 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    Fig. 15. Variability of the tree reported during the growth of the tree (training data) for a selected number of clusters.

    Fig. 16. Variability of the tree reported during the growth of the tree (testing data) for a selected number of clusters.

    practically left unaffected (we see a series of barely distinguish-

    able plots for greater than five). This effect is present for the

    training and testing data.

    After taking a careful look at the variability of the tree, we

    conclude that the optimal configuration occurs at 5 clusters

    with the number of expansion equal to seven. In this case, the

    resulting tree is portrayed in Fig. 17. The arrows shown there

    along with the labels (numbers) visualize a growth of the tree,

    namely a way in which the tree is grown (expanded in consecu-

    tive iterations). The numbers in circles denote the node number.

    The last digit of the node number denotes the number of clus-

    ters while the beginning digits denote the parent node number.

    Detailed description of the nodes is given in Table II. Again, wereport the details of the tree, including the number of patterns

    residing at each node ( ) as well as their variability ( ) and

    the predicted value at the node (8).

    While the variability criterion is an underlying measure in the

    design process, the predictive capabilities of the C-decision tree

    are quantified by the following performance index:

    (14)

    In the above expression, denotes the predicted value occur-

    ring at the corresponding terminal node (refer to (8)). More

    specifically, the representative of this node in the output space iscalculated as a weighted sum of those elements from the training

  • 8/9/2019 16.C-Fuzzy Decision Trees

    10/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 507

    Fig. 17. Detailed C-decision tree for the optimal number of clusters and iterations; see a detailed description in text.

    set that contribute to this node, while is the output value en-

    countered in the data set. For a discrete (classification) problem,

    this index is simply a classification error that is determined by

    counting the number of patterns that have been misclassified by

    the C-decision tree.

    The values of the error obtained on the training set for a

    number of different configurations of the tree (number of clus-

    ters and iterations) are shown in Fig. 18. Again, most of the

    changes occur for low values of the clusters and are character-

    istic of the early phases of the growth of the tree.

    We see that low values of do not reducethe error even with asubstantial growth (number of iterations) of the tree. Similarly,

    we observe that the same effect occurs for the testing set (see

    Fig. 19) (obviously, these results are reported for the sake of

    completeness; in practice, we choose the best tree on a basis

    of the training set and test it by means of the testing data).

    D. Experiment 4

    Again, we use a data set from the Machine Learning repos-

    itory [9], a two-class pima-diabetes consisting of 768 patterns

    distributed in an 8-dimensional feature space. In the design of

    the C-tree, we use the development procedure outlined in theprevious section: a node with the maximal value of diversity

    is chosen as a potential candidate to expand (split into clus-

    ters). Prior to this node unfolding, we check if there are a suf-

    ficient number of patterns located there (here, we consider that

    the criterion is satisfied when this number is greater than the

    number of the clusters). Once this holds, we proceed with the

    clustering and then look at the structurability index (9) whose

    values should be greater or equal to 0.05 (this threshold value

    has been selected arbitrarily) to accept the unfolding of the node.

    The number of iterations is set to 10. The plots of the variability

    shown in Fig. 20 point out that there is not a substantial differ-

    ence in the number of iterations on the value of ; the changesoccur mainly during the first few iterations (expansions of the

    tree). Similarly, there are changes when we increase the number

    of the clusters from two up to six, but beyond this the changes

    are not very significant.

    When using the C-decision tree in the predictive mode, its

    performance is evaluated by means of (14). The collection of

    pertinent plots is shown in Figs. 21 and 22 for the training data

    (the performance results are averaged over the series of exper-

    iments). Similarly, the results of the tree on the testing set are

    visualized in Fig. 23. Evidently, with the increase in the number

    of clusters, we note a drop in the error values; yet, valuesof that

    are too high lead to some fluctuations of the error (so it is not evi-

    dent that growing larger trees is still fully legitimate). Such fluc-tuations are even more profound when studying the plots of error

  • 8/9/2019 16.C-Fuzzy Decision Trees

    11/14

    5 08 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    TABLE IIDESCRIPTION OF THE NODES OF THE C-DECISIONS TREE INCLUDED IN FIG. 17

    Fig. 18. Average error of the C-decision tree reported for the training set.

    reported for the testing set. In a search for the optimal con-

    figuration, we have found that the number of clusters between

    three and five and a few iterations led to the best results (see

    Fig. 24). We observe an evident tendency: while growing larger

    trees is definitely beneficial in case of the training data (gener-

    ally, the error is reduced with a few exceptions), the error does

    not change quite visibly on the testing data (where the changes

    are in the range of 1%).

    It is of interest to compare the results produced by theC-decision tree with those obtained when applying stan-

    Fig. 19. Average error of the C-decision tree reported for the testing set.

    dard decision trees, namely C4.5. In this analysis, we

    have experimented with the software available on the Web

    (http://www.cse.unsw.edu.au/~quinlan/), which is C4.5 revi-

    sion 8 run with the standard settings (i.e., selection of the

    attribute that maximizes the information gain and no pruning

    was used). The results are summarized in Table III. Following

    the experimental scenario outlined at the beginning of the

    section, we report the mean values and the standard deviation

    of the error. For the C-decision trees, the number of nodes is

    equal to the number of clusters multiplied by the number ofiterations.

  • 8/9/2019 16.C-Fuzzy Decision Trees

    12/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 509

    Fig. 20. Variability (V ) for the pima-diabetes data (training set).

    Fig. 21. Error (e

    ) as a function of iterations (expansion of the tree) for thetraining set for selected number of clusters.

    Fig.22. Error (e ) as a function of the number of clusters forselected iterations.

    Overall, we note that the C-tree is more compact (in terms of

    the number of nodes). This is not surprising as its nodes are more

    complex than those in the original decision tree. If our intent is

    to have smaller and more compact structures, C-trees become

    quite an appealing architecture. The results on the training sets

    are better for the C-trees at the level of 3%6% improvement

    (for the pima data set). The standard deviation of the error is

    two times lower for these trees in comparison with the C4.5.

    For the testing set, we note that the larger out of the two C-trees

    in Table III produces almost the same results as the C4.5. With

    the smaller C-tree, we note an increase in the classification rateby 1% in comparison with the larger structure however the size

    Fig. 23. Error (e

    ) pima-diabetes (testing set).

    Fig. 24. Classificationerror for the C-decision tree versus successive iterations(expansions) of the tree; c = 3 and 5; both training and testing sets are included.

    of the tree has been reduced to one half of the larger tree (on the

    pima data). The increase in the size of the tree does not dramati-

    cally affect the classification results; the classification rates tend

    to be better but do not differ significantly from structure to struc-

    ture. In general, we note that the C-decision tree produces more

    consistent results in terms of the classification for the training

    and testing sets; these are closer when compared with the results

    produced by the C4.5 tree. In some cases, the results of C-de-

    cision tree are better than C4.5; this happens for the hepatatis

    data.

    VI. CONCLUSION

    The C-decision trees are classification constructs that are built

    on a basis of information granulesfuzzy clusters. The way

    in which these trees are constructed deals with successive re-

    finements of the clusters (granules) forming the nodes of the

    tree. When growing the tree, the nodes (clusters) are split into

    granules of lower diversity (higher homogeneity). In contrast to

    C4.5-like trees, all features are used once at a time, and such a

    development approach promotes more compact trees and a ver-

    satile geometry of the partition of the feature space. The exper-

    imental studies illustrate the main features of the C-trees. Oneof them is quite profound and highly desirable for any practical

  • 8/9/2019 16.C-Fuzzy Decision Trees

    13/14

    5 10 IE EE T RANS ACT ION S O N S YST EMS , MA N, AND CYB ERN ETICSPART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

    TABLE IIIC-DECISIONS TREE AND C4.5: A COMPARATIVE ANALYSIS FOR SEVERAL MACHINE LEARNING DATA SETS:

    (a) PIMA-DIABETES, (b) IONOSPHERE, (c) HEPATITIS ( IN THIS DATA SET ALL MISSING VALUESWERE REPLACED BY THE AVERAGES OF THE CORRESPONDING ATTRIBUTES),

    (d) DERMATOLOGY, AND (e) AUTO DATA

    usage: the difference of performance of C-trees on the trainingand testing sets is lower than the ones reported for the C4.5.

    The C-tree is also sought as a certain blueprint of some de-tailed models that can be formed on a local basis by considering

  • 8/9/2019 16.C-Fuzzy Decision Trees

    14/14

    PEDRYCZ AND SOSNOWSKI: C FUZZY DECISION TREES 511

    data allocated to the individual nodes. At this stage, the models

    are refined by choosing their topology (e.g., linear models and

    neural networks) and making a decision about detailed learning.

    REFERENCES

    [1] W. P. Alexander and S. Grimshaw, Treed regression, J. Computational

    Graphical Statistics, vol. 5, pp. 156175, 1996.[2] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, NewYork: Plenum, 1981.

    [3] L.Breiman, J.H. Friedman, R. A.Olshen,and C. J.Stone, Classificationand Regression Trees. Belmont, CA: Wadsworth, 1984.

    [4] E. Cantu-Paz and C. Kamath, Using evolutionary algorithms to in-duce oblique decision trees, in Proc. Genetic Evolutionary Computa-tion Conf. 2000, D. Whitley, D. E. Goldberg, E. Cantu-Paz, L. Spector,L. Partnee, and H.-G. Beyer, Eds., San Francisco, CA, pp. 1053 1060.

    [5] A. Dobra and J. Gehrke, SECRET: A scalable linear regression treealgorithm, in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery

    Data Mining, Edmonton, AB, Canada, Jul. 2002.[6] A. Ittner and M. Schlosser, Non-linear decision treesNDT, in Proc.

    13th Int. Conf. Machine Learning (ICML96), Bari, Italy, Jul. 36, 1996.[7] A. Ittner, J. Zeidler, R. Rossius, W. Dilger, and M. Schlosser, Feature

    space partitioning by nonlinear fuzzy decision trees, in Proc. Int. FuzzySystems Assoc., pp. 394398.

    [8] A. K. Jain et al., Data clustering: A review, ACM Comput. Surv., vol.31, no. 3, pp. 264323, Sep. 1999.

    [9] C. J. Merz and P. M. Murphy. (1996) UCI Repository for Ma-chine Learning Data-Bases. Dept. of Information and ComputerScience, University of California, , Irvine, CA. [Online]. Available:

    http://www.ics.uci.edu/~mlearn/MLRepository.html[10] S. K. Murthy, S. Kasif, and S. Salzberg, A system for induction of

    oblique decision trees, J. Artificial Intelligence Res., vol. 2, pp. 132,1994.

    [11] W. Pedrycz and Z. A. Sosnowski, Designing decision trees with theuse of fuzzy granulation, IEEE Trans. Syst., Man, Cybern. A, Syst., Hu-mans, vol. 30, no. 2, pp. 151159, Mar. 2000.

    [12] J. R. Quinlann, Induction of decision trees, Mach. Learn. 1, pp.81106, 1986.

    [13] , C4.5: Programs for Machine Learning. San Francisco, CA:Morgan Kaufmann, 1993.

    [14] J. S. Siebert, Vehicle Recognition using Rule-Based Methods, ResearchMemo, Turing Institute, TIRM-87-017, 1987.[15] R. Weber, Fuzzy ID3: A class of methods for automatic knowledge ac-

    quisition, in Proc. 2nd Int. Conf Fuzzy Logic Neural Networks , Iizuka,Japan, Jul. 1722, 1992, pp. 265268.

    [16] O. T. Yildiz and E. Alpaydin, Omnivariate decision trees, IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 15391546, Nov. 2001.

    Witold Pedrycz (M88SM94F99) is a Professorand CanadaResearch Chair (CRC) in theDepartmentof Electrical and Computer Engineering, University

    of Alberta, Edmonton, AB, Canada. He is activelypursuing research in computational intelligence,fuzzy modeling, knowledge discovery and datamining, fuzzy control, including fuzzy controllers,pattern recognition, knowledge-based neural net-

    works, relational computation, bioinformatics, andsoftware engineering. He has published numerouspapers in this area. He is also an author of eight

    research monographs covering various aspects of computational intelligenceand software engineering.

    Dr. Pedrycz has been a member of numerous program committees ofIEEE conferences in the area of fuzzy sets and neurocomputing. He currentlyserves as an Associate Editor of IEEE TRANSACTIONS ON SYSTEMS, MANAND CYBERNETICS, Parts A and B, and the IEEE T RANSACTIONS ON FUZZY

    SYSTEMS. He is an Editor-in-Chief of Information Sciences, and he is Presi-dent-Elect of International Fuzzy Systems Association (IFSA) and President of

    North American Fuzzy Information Processing Society (NAFIPS).

    Zenon A. Sosnowski (M99) received the M.Sc. de-gree in mathematics from the University of Warsaw,Warsaw, Poland, in 1976 and the Ph.D. degree incomputer science from the Warsaw University of

    Technology, Warsaw, Poland, in 1986.He has been with the Technical University of Bi-

    alystok, Bialystok, Poland, since 1976, where he isan Assistant Professor at the Department of Com-puter Science. In 19881989, he had been atthe DelftUniversity of Technology in the Netherlands for fivemonths. He spent two years (19901991) with the

    Knowledge Systems Laboratory of the National Research Councils Institutefor

    Information Technology, Ottawa, ON, Canada. His research interests includeartificial intelligence, expert systems, approximate reasoning, fuzzy sets, andknowledge engineering.

    Dr. Sosnowski is a Member of the IEEE Systems, Man, and Cybernetics So-ciety.