ieee_tfs

download ieee_tfs

of 23

Transcript of ieee_tfs

  • 8/8/2019 ieee_tfs

    1/23

    Adaptive Neuro-Fuzzy Inference with Statistical Distance Metrics and

    Hybrid Learning algorithms for Data Mining of Symbolic Data

    S.PAPADIMITRIOU1,2S. MAVROUDI1

    L. VLADUTU1 G. PAVLIDES2 A. BEZERIANOS1

    1. Department of Medical Physics, School of Medicine, University of Patras,

    26500 Patras, Greece, tel: +30-61-996115,

    email: [email protected]

    2. Dept. of Computer Engineering and Informatics,

    University of Patras,26500 Patras, Greece, tel: +30-61-997804,

    email: [email protected]

    Abstract The application of neuro-fuzzy systems to domains involving prediction and classification of symbolic data

    requires a reconsideration and a careful definition of the concept of distance between patterns. Traditional distances are

    inadequate to provide information about the proximity between the symbolic patterns. This work proposes a new

    architecture of neurofuzzy systems, the Symbolic Adaptive Neuro Fuzzy Inference System (SANFIS) that utilizes effectivelya statistically extracted distance measure. The learning approach is a hybrid one and consists of a sequence of steps some

    of which are essential and some others are used to optimize further the performance. Initially, a Statistical Distance

    Metric space is computed from the information provided with the training set. The premise parameters are subsequently

    evaluated with a three-phase Instance Based Learning (IBL) scheme that estimates the input membership function centers

    and spreads and constructs the corresponding fuzzy rules. The first phase of the IBL scheme explores heuristic approaches

    that can uncover information for the relative importance and reliability of the examples. The second phase exploits this

    information and extracts an adequate subset of the training patterns for the construction of the fuzzy rules. The concept of

    fuzzy adaptive subsethood is used at the third phase, for the reduction of the number of the fuzzy sets used as input

    membership functions. The consequent parameters are estimated with an efficient linear least squares formulation. The

    obtained performances from the SANFIS trained with the hybrid learning methods are significantly better than the

    traditional nearest neighbour Instance Based Learning schemes in many data mining problems and the system offers

    enhanced explanation ability.

    Keywords: Neuro-fuzzy Learning, Data Mining, Symbolic Data Classification, Radial Basis Functions, Heuristic

    Learning, Instance Based Learning.

    1

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/8/2019 ieee_tfs

    2/23

    1 Introduction

    The emergence of neuro-fuzzy network technology [1,4] offers valuable insight to confront complicated data

    mining problems. In this context, neuro-fuzzy networks can be viewed as advanced mathematical models for

    discovering complex dependencies between variables of physical processes from a set of perturbed

    observations. In designing a neuro-fuzzy network model, for modeling a complex physical process, we are in

    effect building a nonlinear model of the physical process that generates the attribute sets and the corresponding

    outcomes [2,3,4]. However, the application of neuro-fuzzy networks although proven very successful in problem

    domains, as signal processing and pattern recognition has not yielded adequate performances in the domain of

    data mining of symbolic data. Patterns arising both from commercial databases and from many engineering

    databases (as those that describe biosequences [10]) involve data defined over a space that lacks the fundamental

    properties of distance metric spaces. In order to confront these difficulties the paper presents the symbolic

    Adaptive Neuro Fuzzy Inference System (SANFIS) architecture and learning algorithms.

    This work first adapts within the context of the peculiarities of the SANFIS, a distance metric, for expressing

    the distance between values of features in symbolic domains. This metric was initially proposed at the context

    of nearest neighbor schemes as a means for capturing effectively the proximity information for symbolic

    patterns[7,14,16]. The data mining techniques that the paper presents have a wide span of possible application

    since they are adapted to heterogeneous data records having both symbolic and numeric data attributes. For the

    symbolic attributes the Statistical Distance Metrics adapted from the Modified Value Difference Metric

    (MVDM) [7,16] has resulted the best performance. However, for the numeric attributes the optimal policy for

    evaluating the distance is not obvious. In many cases the generalization improvement is obtained by handling

    the numeric values with a normalized numeric distance type, e.g. with an Euclidean distance metric normalized

    with the standard deviation. Nevertheless, frequently numeric attribute domains are better handled with a

    discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.

    the Interpolated Value Difference Metric [16].

    After the computation of a Statistical Distance metric space for the symbolic attributes the main steps of the

    presented approach for the construction of the SANFIS system for data mining can be summarized as follows:

    Initial estimation of input Membership Functions (MF) and rules: Estimation of the reliability and

    importance of each training example with Instance Based Learning techniques. The outcome of this

    learning phase is an ordering of the examples in terms of their significance and reliability. Weight

    parameters are computed that quantify the importance of the examples. The SANFIS network can be

    designed to exploit effectively the irregularity of the problems state space through the selection of the

    proper training examples as rule centers and the determination of parameters for each such rule center.

    These parameters serve for accounting the significance and reliability of the corresponding example.

    We describe three different Instance Based Learning (IBL) algorithms for the implementation of this

    learning step.

    2

  • 8/8/2019 ieee_tfs

    3/23

  • 8/8/2019 ieee_tfs

    4/23

    where we use Gaussian membership functions, i.e. ( )

    =

    2

    expi

    iA

    mxx

    i and iim , are the premise

    parameters that correspond to the centers and spreads of the Gaussians. We should note here that for numeric

    features, x is simply the numeric value as it is feeded from Layer 1.

    However, for symbolic ones the statistical distance ( )imxSD , , between the input feature value x and the

    symbolic value im that forms the center of the corresponding MF is computed. The distance ( )imxSD , and

    the spreading i are computed according to the methods described in the next section. For symbolic features,

    the Layer 2 of SANFIS evaluates the following Gaussian membership function:

    ( )( )

    =

    2,

    expi

    iA

    mxSDx

    i .

    Layer 3 The conjunction layer computes the firing strength of each rule. The algebraic product is used as the

    conjunction operator. The output values of this layer is a circle node labeled and forms the product of the

    incoming signals, i.e. for figure 1, ( ) ( ) 2,1,21 == iw xBxAi ii .

    Layer 4 The normalization layer normalizes the firing strengths of rules by dividing by the sum of these

    strengths, i.e. 2,1,

    21

    =

    +

    = iww

    ww ii . The output values of this node are called normalized firing strengths.

    Layer 5 A zeroth order Takagi-Sugeno type is used for the specification of the outputs of the fuzzy rules [24]

    as ( ) xbwxO jij =4 .

    Layer 6 A summation operation performed by this layer computes the total output of the fuzzy system as the

    sum of the local rule outputs.

    3. The Statistical Distance Metric (SDM)

    The key problem for applications involving symbolic features is the definition of the distance metric. In

    domains where features are numeric, it is straightforward to compute the distance between two points in the

    pattern space in terms of a geometric distance (e.g. Euclidean, Manhatan). Indeed, the traditional neurofuzzy

    learning algorithms have been formulated on these distance metrics and operate effectively in numeric domains

    with such distances. However, when the features are symbolic (as usually happens in bioinformatics and in

    commercial data mining applications) the utilization of the traditional types of distances yields inadequate

    performance. Two common approaches for handling symbolic information is the overlap method and the

    orthogonal representation [5, 7]. The overlap method simply counts the number of feature values that two

    4

  • 8/8/2019 ieee_tfs

    5/23

    instances have in common. This distance metric oversimplifies the pattern space, ignores all the information

    presented within the training set and yields (as expected) poor performance. The orthogonal representation

    encodes binary vectors with the symbolic features in such a way that the numerical distance between different

    features is the same. This method suffers from the same inadequacies to extract any information embeded

    within the training set.

    In order to be able to obtain an effective formulation of the distances between patterns with symbolic feature

    values we have adapted distance measures of the type proposed in [7, 16]. The statistical distance measure takes

    into account the overall similarity of classification of all instances for each possible value of each feature. This

    method extracts with a statistical approach from the training set, a matrix that defines the distances between all

    possible values of a given feature. Therefore, a separate matrix for each feature is obtained. The distance

    measure for a specific feature f is defined according to the following equation:

    =

    =

    c

    cc

    N

    c

    k

    B

    B

    A

    ABAf

    C

    C

    C

    C

    VVSD1

    ),((11)

    In the equation above, AV and BV denote two possible values for the feature f , e.g. for the DNA

    promoter data they will be two nucleotides. The distance between the values is the sum over all the cN

    classes. For example, for the DNA promoter example (discussed below) there are two classes, either the

    sequence is a promoter (i.e. a sequence that initiates a process called transcription) or not. The number of

    patterns for which their feature f has value AV ( BV ) and are classified to class c, is denoted by cAC (

    cBC ). Also, the total number of patterns that have value AV ( BV ) for feature f is denoted by AC (

    BC ), and k is a constant usually set to 1. These counts are computed over all patterns of the training set. It

    becomes easily evident from (1) that the distance between feature values labeled with the same relative

    frequency for all possible classes is zero.

    Furthermore, the more correlated are the classifications pertaining to two values for a feature the smaller is their

    statistical distance computed with equation (1). Therefore for feature values with similar classifications a small

    statistical distance will be computed. Equation (1) accounts for the overall similarity of classification of all

    training instances for each possible class.

    The distance ),( YXD between two patterns YX, is obtained by a weighted sum of distances between the

    values of the individual features of these patterns (obtained from equation 1).

    =

    =

    F

    i

    rYXff iii

    VVSDwYXD

    1

    ),(),(

    (22)

    where F is the number of features, fw accounts for the weight assigned to feature f reflecting its

    significance and r is a parameter that controls how distances between individual features scale for the

    5

  • 8/8/2019 ieee_tfs

    6/23

    computation of the total pattern distance (usually 1=r or 2). Also, fXV and fYV denote the values for the

    ith feature ofXand Y.

    The SDM is defined for symbolic attributes in a way that resembles the Modified Value Difference Metric

    (MVDM) that was used at the PEBLS system [19]. However, for numeric features we have two major choices:

    a) to adopt a numerical distance measure along these dimensions, therefore obtaining a distance metric of

    heterogeneous type [16].

    b) To discretize the numerical ranges and to extend the functionality of the SDM. A parameter of

    particular importance in this discretization is the number s of equal width intervals. Although, it is

    difficult to define general guidelines the heuristic rule of setting s to the larger of 5 or C , where C

    is the number of output classes of the problem domain, proves effective in practice.

    Experience gained from real data sets indicates that it is difficult to judge beforehand which of these approaches

    is better.

    4. Neurofuzzy inference with the Statistical Distance Metric (SDM)

    In contrast to example based nearest neighbor learning schemes, the SANFIS learns a smooth functional that

    weights the contribution of each examplar. This provides some intuition for the superior performance of

    SANFIS networks related to simple Instance Based Learning (IBL) schemes. Nevertheless, in order for the

    SANFIS solution to obtain better generalization performance than the nearest neighbourhood schemes a careful

    design of its parameters is necessary. This section discusses how the peculiarities of the Statistical Distance

    Metric space can be described properly with the tuning of SANFIS parameters. Actually the extraction of the

    Statistical Distance Metric space (described at the previous section) and the corresponding tuning of SANFIS

    parameters (described at the current section) can be viewed as a preprocessing stage that extracts from the

    representation of the symbolic features useful distance measures for neurofuzzy inference.

    A parameter of particular importance is the region of influence of the Gaussian Radial Basis Function MFs that

    is determined by their spread parameter. The determination of proper spread parameters for the membership

    functions becomes more complicated within the domain of statistical distances. At the neural network literature

    significant work exists on the estimation of spreads for Radial Basis Function (RBF) networks [2,3,6,8,9]. The

    heuristic suggestions of [1] compute these spreads for RBF networks as md 2max= , where maxd is

    the maximum distance between patterns and m the number of RBF centers, has not been proven very effective

    in practice at the SDM case. One basic reason for the inefficiency of the above formula is that symbolic features

    can have significant differences between them. Therefore, the spreading of the MFs corresponding to each

    feature dimension should be computed by designing the corresponding distance metric independently from the

    other features. Thus, the design of the SANFIS network proceeds by computing different feature spreading

    scaling factors for each feature and different weights for the rule centers. The weight parameters adjust the

    6

  • 8/8/2019 ieee_tfs

    7/23

    region of influence of a rule center that relates to the importance and the reliability of the example and their

    computation is considered at the next section.

    The feature spreading scaling factors adjust the spread of the Gaussian kernels along the dimension that

    corresponds to the feature. In order to obtain an effective general setting for the computation of the feature

    spreading factors, a sensible approach is to obtain at a first step an estimate of the average distance favd , , of

    patterns within the space defined with the SDM independently for each feature f . The parameter favd , is

    used as the feature spreading scaling factor, since it is learned from the peculiarities of the particular feature and

    normalizes effectively the distances along the dimensions determined by the feature. Then the region of

    influence of the MF kernels is designed by requiring that at a particular distance Spread from a MF center

    expressed in units of favd , , the attenuation of influence is decreased by a . The meaning of these parameters in

    practice is illustrated with Figure 2. The figure illustrates the envelope of the Gaussian that is used as

    membership function for a feature dimension f with an average distance parameter 5.0, =favd . The

    requirements fullfiled by this MF is that the set membership reduces to 10% (i.e. the attenuation parameter is

    1.0=a ) at a distance three times favd , (i.e. Spread = 3).

    We should note, that with this method for the estimation of the MF spreads, the number of rules is considered

    only implicitly, since this number determines to a large extent the average distance parameter favd , .

    Mathematically, the requirements for the rate of decaying of MFs along each feature dimension are designed by

    seeking a parameter f such that :

    aSpreadd favf = )exp( , (33)

    and therefore the required parameter f is derived as:

    )()log( , favf dSpreada = (44)

    Values of these parameters that realize good results are those of the example presented in Figure 2, i.e.

    Spread=3 and a=0.1. These parameters imply that at a distance from a rule center along the corresponding

    feature dimension 3 times larger than the average distance between patterns, the influence of the MF for the

    particular feature f attenuates with a factor of 0.1. For these parameters we obtain

    5351.1)()log( , == favf dSpreada , and the Gaussian envelope that is plotted in Figure 2

    corresponds to )5351.1exp( x .

    The evaluation of rule activation proceeds by computing the response for the conjunctive premise conditions

    along each feature dimension independently and by multiplying the individual responses. The MF evaluation

    corresponding to a distance fx from the center of the MF i for featuref is )exp()(, ffffi xxMF = .

    Note that all the MFs for a feature dimension f , have the same spreading f at all the learning steps, except

    the last, i.e. the MF merging step, which generally modifies the Gaussian MF spreading by merging the

    7

  • 8/8/2019 ieee_tfs

    8/23

    appropriate MFs. Finally, the total activation of the rule )(xRULE is obtained by the mutliplicative

    conjunction rule as: == f f

    ffff xxRULE )exp()exp()( x . This scheme is easily augmented

    to account for feature weighting as:

    ( ) ( ) ( )

    === f

    fff f f

    ffffff wxxwxwxRULE'' expexpexp)( , where ff ww log

    '=

    .

    We can elaborate further on the design of the SANFIS neurofuzzy network by exploiting the results of the

    Instance Based Learning, that is described at the next section, about the significance and reliability of each

    example. Based on these results, for each example Rrr ,,1, =x used for the construction of the fuzzy

    system a weight parameter rw is determined. This weight parameter is assigned to the corresponding rule r,

    such that the spreading rf, of each MF corresponding to feature f and belonging to the premise part of the

    same rule r, is further defined asr

    frf

    w

    =, . The designed MF centers own an influence at a distancex from

    their center formulated by )exp( , xrf for each feature dimension. Clearly, the larger the values of these

    weight parameters, the smaller the parameter cf, and therefore the more extended is the region

    of influence of the corresponding Gaussian MF. Consequently, this means that the rule that is created based on

    example rx is a reliable classifier.

    Therefore the above scheme with the spreads rf, dependent on both rule r and feature f ,

    trains locally the spreads of the MFs, accounting for the peculiarities and irregularities of the state space. The

    Instance Based Learning (IBL) step can estimate the relative importance of each rule summarized with the

    parameter rw and therefore can improve the performance of the designed neurofuzzy solution.

    8

  • 8/8/2019 ieee_tfs

    9/23

    55 SANFIS Learning

    Although the architecture of the SANFIS is similar to ANFIS [20,21,22,23], the training algorithms are

    significantly different. Instead of using gradient descent optimization of the premise and

    subsequent parameters we use a two stage learning strategy. The first stage, performs structure

    learningwith a multistep Instance Based Learning (IBL) approach . At this stage the reliability

    and importance of the examples is estimated as described in Subsection 5.1. These examples

    are ordered according in a list according to their significance. The next step described in

    Subsection 5.2 is an incremental fuzzy system construction process that exploits the

    information gathered for the reliability of examples and for the peculiarities of the statistical

    distance space along each feature dimension. Exploiting their similarity with a fuzzy mutual

    subsethood measure reduces the number of membership functions and can simplify the

    SANFIS structure without generally a degradation at the generalization performance. This

    pruning process is described in Subsection 5.3. Finally, at Subsection 5.4 the consequent

    parameters are learned with a linear least squares formulation that can be computed efficiently

    with the pseudoinverse solution.

    5.1 Instance Based Learning (IBL) for the selection of MF centers and the determination of their

    parameters

    Some examples of the training set are more reliable classifiers than others. It is highly desirable to detect the

    reliable examples and to exploit them for the rule construction. Also, the more reliable an example is, the larger

    should be the region of influence of the corresponding MF when the example is used as a MF rule. The extent

    of the region of influence is expressed with the spreading parameter of the rule center. Instead of using

    clustering self-organizing techniques [2] or an approach that exploits the principle of structural risk

    minimization [15], a heuristically driven learning strategy is adopted for the determination of the examples that

    should be used as rules and of their widths. The former approaches are well suited for the effective learning of

    smooth functionals even if these lie in high dimensional spaces but the irregularities of the statistical distance

    metric space extracted from symbolic data and the need to cope with many artefacted examples poses several

    problems at the application of their mathematical framework.

    The proposed neurofuzzy training approach consists of two phases. At the first phase, the Instance Based

    Learning(IBL) step, the potential of each example for serving as a rule (i.e. how representative and reliable the

    example is) is evaluated. This step is of heuristic type and it tries to discover the reliability and the importance

    of the training examples with an Instance Based Learning (IBL) scheme that resembles the functionality of

    PEBLS [7]. The IBL learning step offers the potentiality to detect the noisy examples of the designed

    classification system at the initial approximate solution and therefore the feasibility to remove them from the

    second neurofuzzy learning phase. The IBL step is implemented with nearest neighbor based schemes [ 14, 16].

    The classification function of IBL viewed as an input-output mapping tends to have many class boundaries and

    9

  • 8/8/2019 ieee_tfs

    10/23

    discontinuous islands of misclassified regions placed near erroneously classified examples. The structure of

    the decision boundaries is smoothed and most of the artefacted regions are extracted to reject the influence of

    noisy examples at the designed classification system. The examples that do not yield satisfactory performance

    at the initial IBL learning step, are detected and marked in order to avoid their use at the rule discovery process.

    This learning step detects with Instance Based Learning the good examples and uses them to construct rules

    by placing the MF centers and by tuning the MF spreads according to the structure of the SDM along the

    corresponding dimension. This can be viewed as astructure identification step.

    The second learning phase is the neurofuzzy training phase. The first learning step, of this phase, the premise

    parameter identification is accomplished with Instance Based Learning techniques. These techniques construct

    the MFs, place their centers and compute their spreads. All these parameters are computed within the Statistical

    Distance Metric space. This step can be considered to perform an initial approximation to the structure of the

    solution space analogous to the one performed with Instance Based Learning nearest neighbor schemes. For

    every example rand feature f , a membership function of the form ),(_ frtoclose is constructed along the

    dimension of f in order to describe how the outcome depends on the closeness to the example r. Thus, the

    rules have the following form:

    if ),(_)2,(_)1,(_ Frtoclosertoclosertoclose then )(__ re x a m p lo fc la s s

    where the consequent class_of_example(r) is expressed with a scalar constant rb and F denotes the number of

    features. Clearly, these rules constitute a zeroth order Takagi-Sugeno system.

    Then for the second phase we avoid the formulation of gradient descent based techniques to further tune the

    premise parameters and to compute the consequent parameters. The approximate structure and the lack of

    pattern coordinates at the Statistical Distance Metric space perplexes significantly the formulation of the

    gradient. At the second learning step, the subsequentparameter learningstep, the fuzzy system is designed to

    construct a smooth solution that fits well with the training set.

    Three basic approaches have been exploited for the implementation of the structure identification learning pass,

    based on Instance Based Learning techniques. The one pass approach is an exemplar weighting method that is

    used in combination with the nearest neighbor parameter which must attain a value of larger than one. The

    learning is accomplished with only one pass through the training examples. At this pass, for each training

    instance, its k nearest neighbors are detected from among the remaining training set. Ifj neighbors have a

    matching class then the weight is assigned to the current instance according to the simple formula:

    jkweight +=1 . Therefore, the more the class of the exemplar is reinforced by its neighbors, the less the

    weight (i.e. the more reliable the exemplar is). Algorithmically, the one pass instance based learning algorithm

    takes the form:

    foreach pattern P of the training setdo

    Detect the k nearest neighbors to P from the training set according to the Statistical Distance Metric;

    10

  • 8/8/2019 ieee_tfs

    11/23

    Let j = number of nearest neighbors with the same class label as the actual class of P;

    Set the weight parameter that quantifies the reliability of the exemplar as jkweight +=1 ;

    endfor;

    As a particular example, consider the case that the nearest neighborhoodparameter is set to six. The exemplar

    weights will range from one to seven, depending on the number of neighbors that have a matching class. A

    weight of one means that all the neighbors reinforce the class assignment of the example (i.e. kj = ), therefore

    this example is reliable. Conversely, at the other extreme case, a weight of seven, implies with a high

    probability that the isolated exemplar is artefacted and therefore it should not account too much to the final

    solution.

    The one pass technique succeds in attenuating significantly the effect of artefacted exemplars. Evidently, it

    attenuates also the exemplars placed near class separation boundaries. Even so, since the strength of the

    boundary exemplars is symmetrically reduced, the algorithm is not biased towards favoring a particular class.

    An alternative IBL heuristic weighting approach for the determination of the influence of the rule centers

    evaluates theperformance history related to each exemplar. The performance history is evaluated by running a

    large number of classification trials. Usually, all the patterns of the training set are used at those trials. At each

    such trial a sample pattern P is picked from the training set and its classification )(Pclass is evaluated as

    the classification of its nearest neighbor nP , i.e. )()( nPclassPclass = . The assigned classification

    )(Pclass with the nearest neighbor rule is compared with the actual class of the pattern. Exemplars that have

    been used successfully to classify their neighbors are assigned a small weight (correspondingly a large region of

    influence). Therefore, the spreading of the MFs is adjusted by considering its success ratio, as it is evaluated

    from the percentage of times that it was used to classify correctly. This percentage evaluates the effectiveness of

    a rule as a classifier. Denoting by used(i) the number of times the exemplar i (that serves as the rule center) is

    used to classify (as the nearest neighbour example), and by correct(i) the number of times it is used correctly,

    the formula for weighting becomes:

    )()( icorrectiusedweight =

    We should note that at the evaluation of the nearest neighbour, a weighting scheme for the exemplars is notused, i.e. all examples are treated equally at the distance computations.

    The algorithm for the used correct approach is formulated as:

    forall patterns p of the training setdo

    used[p] = correct[p] = 1;

    endfor;

    forall patterns p of the training setdo

    Cp = class(p); // actual class of pattern p

    11

  • 8/8/2019 ieee_tfs

    12/23

  • 8/8/2019 ieee_tfs

    13/23

    ( )

    ( )

    = =

    = =

    =R

    r

    F

    f

    rf

    rff

    R

    r

    F

    f rf

    rffr

    mxSD

    mxSDb

    1 1

    2

    ,

    ,

    1 1

    2

    ,

    ,

    ,

    2

    1exp

    ,

    2

    1exp

    )|(

    xF

    (55)

    Where

    [ ]Fxxx 21=x is the input vector.

    rfrfr mb ,, ,, = , Rr ,,1= , Ff ,,1= , denotes the parameter set of the SANFIS system.

    R is the number of fuzzy rules

    F is the number of input variables, i.e. the number of features.

    rf

    m, is a numeric value assigned to the symbolic feature where the fth membership function for the

    rth fuzzy rule achieves a maximum. The actual numeric value of rfm , does not have a particular

    meaning for the application because it is usually an integer mapped to the symbolic feature arbitrarily

    (e.g. consecutive integers can be used for tokenizing features).

    rf, is the width (spread) of the membership function for the f th feature and r th fuzzy rule.

    rfff mxSD ,, is the statistical distance between the symbolic feature value fx and the symbolic

    feature rule rfm , computed according to equation (1).

    rb is the scalar output of the r th Takagi-Sugeno zeroth-order fuzzy rule.

    The algorithmic steps for fuzzy system construction have as follows:

    1. Retrieve the most significant example ),( dx , where x denotes the attribute vector and

    d is the corresponding class label and construct the first fuzzy rule with the

    ExampleToFuzzyRule construction algorithm

    2. while the list of examples is not empty do

    3. Retrieve the next significant example ),( dx

    4. if the currently constructed fuzzy system cannot explain adequately the mapping ),( dx

    then

    5. Construct new fuzzy rule forx and expand the fuzzy system.

    endif

    endwhile

    The fuzzy system construction proceeds by considering before expansion if the current set of fuzzy rules is

    adequate to explain the currently considered example x . If

    13

  • 8/8/2019 ieee_tfs

    14/23

    ( ) Fy |xF

    then the fuzzy system F already satisfactorily represents the information in the corresponding example and

    hence no rule is added to F and the next training data is considered by performing the same type of F test.

    Suppose that

    Fy >)|(xF

    Then a new fuzzy rule newr is added to represent the information contained in ( )y,x . The widths newrf,

    for the new rule newr are adapted in order to adjust the spacing between the membership functions so that

    the new rule does not distort what has already been learned by the fuzzy system.

    a smooth interpolation is performed between training points.

    This modification of the newrf, , for each feature dimension f, is accomplished by determining the nearest

    neighbour for the fth MF center nearrfm , , from all the MF centers of the same feature fof the fuzzy system

    constructed up to this point, according to the statistical distance metric. The spreading newrf, is updated

    according to

    n e a rn e wn e w rfrfrfmm ,,,

    1

    =

    for Ff ,,1

    = , where is a weighting factor that determines the amount of overlap of the membership

    functions. The weighting factor and the weights newrf, have an inverse relationship. A value of 2=

    attains good results in practice.

    5.3 The Mutual Subsethood measure and Membership Function aggregation

    Membership functions FfRrrf ,,1,,,1, == are used for all the R reliable examples in order to

    provide the region of influence of this example along the corresponding attribute (i.e. dimension) coordinates

    Ff ,,1= , within the Statistical Distance Metric space. Each membership function rf owns a center at

    the corresponding feature coordinate and a dispersion computed with the evaluation of the dispersion of the

    statistical distance metric along the corresponding feature dimensions according to the formulation of (3).

    However, this technique has a tendency to produce a large number of MFs. Clearly, for R reliable examples of

    feature dimensionality F used at the SANFIS construction the resulted fuzzy system has FR MFs and R

    rules. The initially large fuzzy system can in most cases be significantly reduced by agglomerating adjacent

    Gaussian MFs, belonging to the same featuref

    and owning a large degree of overlap, to equivalent ones. Thisaggregation function is controlled by computing a measure that evaluates the degree of equivalence among the

    14

  • 8/8/2019 ieee_tfs

    15/23

    corresponding fuzzy sets of the MFs. The algorithms for the reduction of MFs based on the mutual subsethood

    measure are adapted from [25], Chapter 13.

    The measure of mutual subsethood [25] quantifies the degree to which two fuzzy sets are equal or how much

    they are similar. Suppose [ ]1,0: Xa and [ ]1,0: Xb are the set functions of fuzzy sets A and B .

    Then fuzzy sets A and B are equal if ( ) ( )xx ba = for all Xx . If BA = then BA and AB . Here

    BA if ( ) ( )xx ba for every Xx . Then a fuzzy measure of equality or mutual subsethood measure

    that quantifies the degree to which fuzzy set A equals fuzzy set B is defined as

    ( ) ( ) ( )ABBADegreeBADegreeBAE === and,

    To resolve the above definition the notion ofsize or cardinality of fuzzy set A denoted by ( )Ac needs to be

    defined. It equals the sum of the fit values of A , i.e. ( ) ( )=

    X

    aAc

    x

    x .

    The mutual subsethood measure ( )BAE , can be computed in terms of a ratio of counts as in [25]:

    ( )( )( )BAc

    BAcBAE

    =, .

    The approximate mutual subsethood measure proposed is computed in [25] for the quantification of the

    similarity between two Gaussian membership functions. According to [25] for Gaussian MFs with centers

    21,mm and variances 21, this measure can be approximated with the following formulas:

    ( )

    ( )

    ( )

    ( )

    ( )BAcBAc

    BAc

    BAc

    BAE +

    =

    =

    21,

    (66)

    where

    and ),0max()( xxh = .

    15

    ( )( )( )

    ( )

    ( )( )( )

    ( )( )

    ( )21

    21122

    12

    21122

    21

    21122

    2

    1

    2

    1

    2

    1

    +

    ++

    +

    ++=

    mmh

    mmh

    mmhBAc

  • 8/8/2019 ieee_tfs

    16/23

  • 8/8/2019 ieee_tfs

    17/23

    We have applied the neurofuzzy based solutions to many standard data sets from the UCI machine learning

    repository. The results are illustrated by Table 1. This table compares the PEBLS performance with that

    obtained with the proposed neurofuzzy solution. The 3nd column displays the average generalization

    performance of the neurofuzzy system without the Instance Based Learning (IBL) estimation of parameters.

    The next three columns (4th to 6th) summarize the corresponding average classification performance for the

    results obtained by applying the neurofuzzy with the three approaches to IBL described, i.e. the one-pass, the

    used-correct and the increment approaches. The improvements achieved with the IBL learning are evident. It is

    important to note at this point that the utilization of IBL learning allows to consider for the construction of the

    fuzzy rules only a small percentage (about 5%-20% dependent on the particular application) of the training

    examples which are the most important and reliable. Therefore, we can obtain a small SANFIS system that

    generalizes well.

    Table 2 presents results for the same data sets as in Table 1 with the application of the mutual subsethood

    pruning of the SANFIS system. The additional column displays the percentage of reduction of the MFs. This

    percentage depends on the distribution of the training data points and tends to increase with the increase of the

    training set size. We conclude that the merging of MFs results in a simpler SANFIS system of similar overall

    performance. Clearly, from the results displayed at Tables 1 and 2 we can conclude that the neurofuzzy

    solutions outperform the simple nearest neighbor schemes.

    Database PEBLS NeuroFuzzy NeuroFuzzy

    +IBL

    One-pass

    NeuroFuzzy

    +IBL

    Used Correct

    NeuroFuzzy

    +IBL

    IncrementHypothyroid 97.90 98.14 98.33 98.44 98.37

    Iris 94.62 95.40 96.32 96.8 96.12

    Hepatitis 76.59 78.33 79.56 84.37 81.34

    Breast Cancer 94.23 95.90 96.11 96.22 96.18

    Heart Disease 81.90 83.41 82.15 83.87 85.21

    Audiology 77.9 78.12 81.26 81.14 79.94

    Liver Disorders 63.45 63.20 65.91 72.54 74.58

    Table 1 Illustration of the performance of the proposed NeuroFuzzy + IBL data mining algorithms compared

    to the plain IBL as implemented with the PEBLS learning system. We can observe that the utilization of IBL

    within the framework improves the generalization results. However, we cannot easily conclude that a

    particular IBL learning approach is better.

    Database PEBLS NeuroFuzzy NeuroFuzzy

    +IBL

    One-pass

    NeuroFuzzy

    +IBL

    Used Correct

    NeuroFuzzy

    +IBL

    Increment

    Percentage of

    reduction of

    MFs

    Hypothyroid 97.90 98.04 98.46 98.52 97.73 42 %

    17

  • 8/8/2019 ieee_tfs

    18/23

    Iris 94.62 95.30 96.32 96.8 96.14 36 %

    Hepatitis 76.59 78.43 79.56 85.11 82.14 29 %

    Breast Cancer 94.23 95.93 96.11 96.13 96.18 41 %

    Heart Disease 81.90 84.52 82.18 84.12 85.10 43 %

    Audiology 77.9 79.21 81.15 82.14 80.12 45 %

    Liver

    Disorders

    63.45 63.25 64.92 71.89 73.67 47 %

    Table 2 The results for the same data sets as in Table 1 with the application of the mutual subsethood pruning

    of the SANFIS system. The additional column displays the percentage of reduction of the MFs. We conclude

    that the merging of MFs results in a simpler SANFIS system of similar overall performance.

    Below we describe in more detail one application from bioinformatics and one from data mining of

    commercial databases. The application from bioinformatics concerns the prediction of promoter sequences

    [5,10]. This task involves predicting whether or not a given subsequence of a DNA sequence is a promoter, i.e. a

    sequence of genes that initiates a process called transcription. The data set contains 106 examples, 53 of which

    were positive examples (promoters) and the rest negative ones. A training pattern consists of a sequence of 57

    nucleotides (features) from the alphabet a, c, g and t with the respective classification (promoter or not

    promoter). Since the available number of patterns were small the classification performance was tested with the

    leave-one-out methodology, i.e. repeatedly trials have been performed by training on 105 examples and testing

    on the remaining one. The computed performance was 2/106 (i.e. an average of 2 errors over 106 trials) versus

    4/106 for a competitive experiment that used the KBANN neural network model [12].

    Another application concerns a different domain: the data mining of commercial databases. A large training set

    is extracted from a database kept by a mobile telecommunication company. This set contains 62000 records

    concerning some attributes of its customers (e.g. job, area, sex, and pay method) and the classification in terms

    of their quality as customers. This classification is in six classes, ranging from 0 to 5 in terms of increasing

    customer quality. The objective for the learning system was to perform well at predicting the class of a new

    customer from its attributes and therefore to guide the company strategy. For this problem the presented

    statistical distance with neurofuzzy learning has obtained the best classification performance, i.e. around 60%

    success at the prediction of the customer class. In contrast, a nearest neighbourhood classification scheme based

    on the same distance obtains only around 30% and a Self-Organizing Map (SOM) operating with the traditional

    distance types with a numerical coding of feature values (it is not easy to adapt the statistical distance at the

    context of the SOM) obtains a performance around 42%.

    6. Conclusions

    Neuro-fuzzy network algorithms designed for learning are very effective in domains in which all features have

    numeric values. In these domains, the examples are treated as points and distance metrics obey to standard

    18

  • 8/8/2019 ieee_tfs

    19/23

    definitions. However, the usual domain of data mining applications is the symbolic domain. In this domain the

    utilization of the traditional distance metrics usually results in incompetent results.

    This work has adapted a Statistical Distance Metric (SDM), for expressing the distance between values of

    features in symbolic domains that was proposed for nearest neighbor schemes, to the context of the peculiarities

    of the Symbolic Adaptive Neuro-Fuzzy Inference System (SANFIS). The potential of this distance metric toregularize the solution of the neural fuzzy networks is the theoretical justification of the improved performance

    related to the simple nearest neighbor schemes.

    Additional performance improvement has been obtained by exploiting the fact that the examples of the training

    set are not all of the same importance and reliability. Therefore, a learning mechanism is implemented for the

    estimation of the reliability and significance of each example. The SANFIS can be designed to exploit

    effectively the irregularity of the problems state space through the selection of the proper training examples for

    the construction of the fuzzy rules with an Instance Based Learning (IBL) scheme. Also, the weight parameters

    for each rule are determined with the IBL learning pass. These weight parameters serve for accounting the

    significance and reliability of the corresponding example. We have described three different Instance Based

    Learning (IBL) algorithms for the implementation of this learning step. The fuzzy rules are constructed with an

    incremental process, that creates a new rule only in the case that the currently assembled system is inadequate

    to explain the new example. The concept of fuzzy adaptive subsethood is used at the third phase, for the

    reduction of the number of the fuzzy sets used as membership functions. Finally, the consequent parameters are

    estimated with an efficient linear least squares formulation. The obtained performances with this multilevel

    learning method are significantly better than the traditional nearest neighbour schemes in many data mining

    problems and the system offers enhanced explanation ability with the learned fuzzy rules.

    The neurofuzzy data mining system that the paper has presented has a wide span of applications and has been

    adapted to heterogeneous data records having both symbolic and numeric data attributes. For the symbolic

    attributes the Statistical Distance Metric has resulted the best performance. However, for the numeric attributes

    the optimal policy for evaluating the distance is not obvious. In many cases the generalization improvement is

    obtained by handling the numeric values with a normalized numeric distance type, e.g. with Euclidean distance

    metric normalized with the standard deviation. Furthermore, some attribute domains are better handled with a

    discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.

    the Interpolated Value Difference Metric [16]).

    Future work to upgrade further the proposed SANFIS and IBL hybrid data mining algorithms can proceed along

    many different directions. Specifically, more elaborated schemes for treating numerical attributes by finding

    optimal multisplits [13] can be incorporated within the context of the current work in order to enhance further

    the performance. Also, another approach for the effective discretization of continuous attributes using a

    simulated annealing algorithm [11] can improve the results obtained by treating numeric attributes as discretized

    symbolic. Furthermore, a nonlinear optimization of the consequent parameters with an approach like the

    Levenberg-Marquardt algorithm [18] can (at least theoretically) obtain better performance at the cost of a more

    complex (and perhaps more unstable) implementation.

    19

  • 8/8/2019 ieee_tfs

    20/23

    Acknowledgment

    The authors wish to thank the Research Committee of the University of Patras, for the partial financial

    support of this research with the contract KARATHEODORIS, 2454

    References

    [11] Simon Haykin,Neural Networks, MacMillan College Publishing Company, Second Edition, 1999.

    [22] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer perceptrons, Science 247(1990):978-982.

    [33] T. Poggio, and F. Girosi, Networks for approximation and learning,Proceedings of the IEEE, vol. 78, pp. 1481-1497, 1991.

    [44] Vapnik., V. N. 1998, Statistical Learning Theory,New York, Wiley.

    [55] Pierre Baldi, Soren Brunak,Bioinformatics, MIT Press, 1998.

    [66] Federico Girosi, "An Equivalence Between Sparse Approximation and Support Vector Machines", Neural Computation, 10:6, (1998) , pp. 1455-

    1480.

    [77] C. Stanfill, D. Waltz, Toward memory-based reasoning, Communications of the ACM, 29:12 (1986) , pp.1213-1228.

    [88] S. Papadimitriou, A. Bezerianos, A. Bountis, Radial Basis Function Networks as Chaotic Generators for Secure Communication Systems,

    International Journal On Bifurcation and Chaos, Vol. 9, No. 1 (1999), pp. 221-232.

    [99] A. Bezerianos, S. Papadimitriou, D. Alexopoulos, Radial Basis Function Neural Networks for the Characterization of Heart Rate Variability

    Dynamics, Artificial Intelligence in Medicine 15 (1999) pp. 215-234.

    [1010] Vladimir Brusic, and John Zeleznikow, Knowledge discovery and data mining in biological databases, The Knowledge Engineering Review,

    Vol. 14:3, 1999, pp. 257-277

    [1111] Justin C. W. Debuse and Victor Jayward-Smith, Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data

    Mining AlgorithmApplied Intelligence 11, pp. 285-295 (1999)

    [1212] G. Towell, J. Shavlik, M. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, Proceedings Eight

    National Conference on Artificial Intelligence,pp. 861-866, Menlo Park, CA:AAAI Press, 1990

    [1313] Tapio Elomaa, Jho Rousu, General and Efficient Multisplitting of Numerical Attributes, Machine Learning, 36,pp. 201-244, (1999)

    [1414] Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel, Thomas Seidl, Indexing the Solution Space: A New Technique for Nearest Neighbor

    Search in High-Dimensional Space,IEEE Transactions On Knowledge and Data Engineering, Vol. 12, No. 1, January/February 2000

    [1515] Peter Bartlett & John Shawe-Taylor, Generalization Performance of Support Vector Machines and Other Pattern Classifiers, in Advances in

    Kernel Methods, Support Vector Learning, MIT Press, 1999, pp.43-54

    [1616] D. Randall Wilson, Tony R. Martinez, Improved Heterogenous Distance Functions,Journal of Artificial Intelligence Research 6 (1997), p. 1-34

    [1717] J. Barry Gomm, Ding Li Yu, Selecting Radial Basis Function Network Centers with Recursive Orthogonal Least Squares Training, IEEE

    Transactions On Neural Networks, Vol. 11, No 2, March 2000, p. 306-314

    [1818] Christopher M. Bishop,Neural Networks for Pattern Recognition (Clarendon Press-Oxford, 1996)

    [1919] Scott Cos and Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features, Machine Learning, Vol. 10, pp.

    57-78, 1993

    [2020] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, `` Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine

    Intelligence,'' Prentice Hall, 1996.

    [2121] J.-S. R. Jang and C.-T. Sun, ` Neuro-Fuzzy Modeling and Control,'' The Proceedings of the IEEE, vol. 83, pp. 378-406, Mar. 1995.

    20

  • 8/8/2019 ieee_tfs

    21/23

    [2222] J.-S. R. Jang, `` ANFIS: Adaptive-Network-based Fuzzy Inference Systems,'' IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, pp. 665-

    685, May 1993.

    [2323] J.-S. R. Jang, `` Self-Learning Fuzzy Controller Based on Temporal Back-Propagation,'' IEEE Trans. on Neural Networks, vol. 3, pp. 714-723,

    Sept. 1992.

    [2424] T. Takagi and M. Sugeno, Fuzzy identification of systems and its application to modeling and control, IEEE Trans. Syst., Man, Cybern., vol.

    15, pp. 116-132, Jan. 1985

    [2525] Bart Kosko, Fuzzy Engineering, Prentice Hall, 1997

    [2626] Hans Hellendoorn, Dimiter Driankov (Eds),Fuzzy Model Identification , Springer-Verlang, 1997

    21

  • 8/8/2019 ieee_tfs

    22/23

    11 bw

    22 bw2w2w

    1w1wA

    1

    A2

    B1

    B2

    X2

    X1

    X 2

    X1

    X2

    11 bw

    22 bw

    2w2w

    1w1w

    L a y e r 1 L a y e r 2 L a y e r 3

    Figure 11: The architecture of the Symbolic Adaptive Neuro Fuzzy System (SANFIS)

    Figure 22: Illustration of the design of the Gaussian MF spreading along each feature dimension.

    22

  • 8/8/2019 ieee_tfs

    23/23

    Figure 33: The mutual subsethood measure quantifies the amount of overlap of Gaussian MFs. At the left subplot the

    amount of overlap is small and the corresponding mutual subsethood measure takes small values. The opposite case is

    illustrated at the right subplot.