ieee_tfs

8/8/2019 ieee_tfs

1/23

Adaptive Neuro-Fuzzy Inference with Statistical Distance Metrics and

Hybrid Learning algorithms for Data Mining of Symbolic Data

S.PAPADIMITRIOU1,2S. MAVROUDI1

L. VLADUTU1 G. PAVLIDES2 A. BEZERIANOS1

1. Department of Medical Physics, School of Medicine, University of Patras,

26500 Patras, Greece, tel: +30-61-996115,

email: [email protected]

2. Dept. of Computer Engineering and Informatics,

University of Patras,26500 Patras, Greece, tel: +30-61-997804,

email: [email protected]

Abstract The application of neuro-fuzzy systems to domains involving prediction and classification of symbolic data

requires a reconsideration and a careful definition of the concept of distance between patterns. Traditional distances are

inadequate to provide information about the proximity between the symbolic patterns. This work proposes a new

architecture of neurofuzzy systems, the Symbolic Adaptive Neuro Fuzzy Inference System (SANFIS) that utilizes effectivelya statistically extracted distance measure. The learning approach is a hybrid one and consists of a sequence of steps some

of which are essential and some others are used to optimize further the performance. Initially, a Statistical Distance

Metric space is computed from the information provided with the training set. The premise parameters are subsequently

evaluated with a three-phase Instance Based Learning (IBL) scheme that estimates the input membership function centers

and spreads and constructs the corresponding fuzzy rules. The first phase of the IBL scheme explores heuristic approaches

that can uncover information for the relative importance and reliability of the examples. The second phase exploits this

information and extracts an adequate subset of the training patterns for the construction of the fuzzy rules. The concept of

fuzzy adaptive subsethood is used at the third phase, for the reduction of the number of the fuzzy sets used as input

membership functions. The consequent parameters are estimated with an efficient linear least squares formulation. The

obtained performances from the SANFIS trained with the hybrid learning methods are significantly better than the

traditional nearest neighbour Instance Based Learning schemes in many data mining problems and the system offers

enhanced explanation ability.

Keywords: Neuro-fuzzy Learning, Data Mining, Symbolic Data Classification, Radial Basis Functions, Heuristic

Learning, Instance Based Learning.

1
mailto:[email protected]:[email protected]:[email protected]:[email protected]

8/8/2019 ieee_tfs

2/23

1 Introduction

The emergence of neuro-fuzzy network technology [1,4] offers valuable insight to confront complicated data

mining problems. In this context, neuro-fuzzy networks can be viewed as advanced mathematical models for

discovering complex dependencies between variables of physical processes from a set of perturbed

observations. In designing a neuro-fuzzy network model, for modeling a complex physical process, we are in

effect building a nonlinear model of the physical process that generates the attribute sets and the corresponding

outcomes [2,3,4]. However, the application of neuro-fuzzy networks although proven very successful in problem

domains, as signal processing and pattern recognition has not yielded adequate performances in the domain of

data mining of symbolic data. Patterns arising both from commercial databases and from many engineering

databases (as those that describe biosequences [10]) involve data defined over a space that lacks the fundamental

properties of distance metric spaces. In order to confront these difficulties the paper presents the symbolic

Adaptive Neuro Fuzzy Inference System (SANFIS) architecture and learning algorithms.

This work first adapts within the context of the peculiarities of the SANFIS, a distance metric, for expressing

the distance between values of features in symbolic domains. This metric was initially proposed at the context

of nearest neighbor schemes as a means for capturing effectively the proximity information for symbolic

patterns[7,14,16]. The data mining techniques that the paper presents have a wide span of possible application

since they are adapted to heterogeneous data records having both symbolic and numeric data attributes. For the

symbolic attributes the Statistical Distance Metrics adapted from the Modified Value Difference Metric

(MVDM) [7,16] has resulted the best performance. However, for the numeric attributes the optimal policy for

evaluating the distance is not obvious. In many cases the generalization improvement is obtained by handling

the numeric values with a normalized numeric distance type, e.g. with an Euclidean distance metric normalized

with the standard deviation. Nevertheless, frequently numeric attribute domains are better handled with a

discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.

the Interpolated Value Difference Metric [16].

After the computation of a Statistical Distance metric space for the symbolic attributes the main steps of the

presented approach for the construction of the SANFIS system for data mining can be summarized as follows:

Initial estimation of input Membership Functions (MF) and rules: Estimation of the reliability and

importance of each training example with Instance Based Learning techniques. The outcome of this

learning phase is an ordering of the examples in terms of their significance and reliability. Weight

parameters are computed that quantify the importance of the examples. The SANFIS network can be

designed to exploit effectively the irregularity of the problems state space through the selection of the

proper training examples as rule centers and the determination of parameters for each such rule center.

These parameters serve for accounting the significance and reliability of the corresponding example.

We describe three different Instance Based Learning (IBL) algorithms for the implementation of this

learning step.

2

8/8/2019 ieee_tfs

3/23

8/8/2019 ieee_tfs

4/23

where we use Gaussian membership functions, i.e. ( )

=

2

expi

iA

mxx

i and iim , are the premise

parameters that correspond to the centers and spreads of the Gaussians. We should note here that for numeric

features, x is simply the numeric value as it is feeded from Layer 1.

However, for symbolic ones the statistical distance ( )imxSD , , between the input feature value x and the

symbolic value im that forms the center of the corresponding MF is computed. The distance ( )imxSD , and

the spreading i are computed according to the methods described in the next section. For symbolic features,

the Layer 2 of SANFIS evaluates the following Gaussian membership function:

( )( )

=

2,

expi

iA

mxSDx

i .

Layer 3 The conjunction layer computes the firing strength of each rule. The algebraic product is used as the

conjunction operator. The output values of this layer is a circle node labeled and forms the product of the

incoming signals, i.e. for figure 1, ( ) ( ) 2,1,21 == iw xBxAi ii .

Layer 4 The normalization layer normalizes the firing strengths of rules by dividing by the sum of these

strengths, i.e. 2,1,

21

=

+

= iww

ww ii . The output values of this node are called normalized firing strengths.

Layer 5 A zeroth order Takagi-Sugeno type is used for the specification of the outputs of the fuzzy rules [24]

as ( ) xbwxO jij =4 .

Layer 6 A summation operation performed by this layer computes the total output of the fuzzy system as the

sum of the local rule outputs.

3. The Statistical Distance Metric (SDM)

The key problem for applications involving symbolic features is the definition of the distance metric. In

domains where features are numeric, it is straightforward to compute the distance between two points in the

pattern space in terms of a geometric distance (e.g. Euclidean, Manhatan). Indeed, the traditional neurofuzzy

learning algorithms have been formulated on these distance metrics and operate effectively in numeric domains

with such distances. However, when the features are symbolic (as usually happens in bioinformatics and in

commercial data mining applications) the utilization of the traditional types of distances yields inadequate

performance. Two common approaches for handling symbolic information is the overlap method and the

orthogonal representation [5, 7]. The overlap method simply counts the number of feature values that two

4

8/8/2019 ieee_tfs

5/23

instances have in common. This distance metric oversimplifies the pattern space, ignores all the information

presented within the training set and yields (as expected) poor performance. The orthogonal representation

encodes binary vectors with the symbolic features in such a way that the numerical distance between different

features is the same. This method suffers from the same inadequacies to extract any information embeded

within the training set.

In order to be able to obtain an effective formulation of the distances between patterns with symbolic feature

values we have adapted distance measures of the type proposed in [7, 16]. The statistical distance measure takes

into account the overall similarity of classification of all instances for each possible value of each feature. This

method extracts with a statistical approach from the training set, a matrix that defines the distances between all

possible values of a given feature. Therefore, a separate matrix for each feature is obtained. The distance

measure for a specific feature f is defined according to the following equation:

=

=

c

cc

N

c

k

B

B

A

ABAf

C

C

C

C

VVSD1

),((11)

In the equation above, AV and BV denote two possible values for the feature f , e.g. for the DNA

promoter data they will be two nucleotides. The distance between the values is the sum over all the cN

classes. For example, for the DNA promoter example (discussed below) there are two classes, either the

sequence is a promoter (i.e. a sequence that initiates a process called transcription) or not. The number of

patterns for which their feature f has value AV ( BV ) and are classified to class c, is denoted by cAC (

cBC ). Also, the total number of patterns that have value AV ( BV ) for feature f is denoted by AC (

BC ), and k is a constant usually set to 1. These counts are computed over all patterns of the training set. It

becomes easily evident from (1) that the distance between feature values labeled with the same relative

frequency for all possible classes is zero.

Furthermore, the more correlated are the classifications pertaining to two values for a feature the smaller is their

statistical distance computed with equation (1). Therefore for feature values with similar classifications a small

statistical distance will be computed. Equation (1) accounts for the overall similarity of classification of all

training instances for each possible class.

The distance ),( YXD between two patterns YX, is obtained by a weighted sum of distances between the

values of the individual features of these patterns (obtained from equation 1).

=

=

F

i

rYXff iii

VVSDwYXD

1

),(),(

(22)

where F is the number of features, fw accounts for the weight assigned to feature f reflecting its

significance and r is a parameter that controls how distances between individual features scale for the

5

8/8/2019 ieee_tfs

6/23

computation of the total pattern distance (usually 1=r or 2). Also, fXV and fYV denote the values for the

ith feature ofXand Y.

The SDM is defined for symbolic attributes in a way that resembles the Modified Value Difference Metric

(MVDM) that was used at the PEBLS system [19]. However, for numeric features we have two major choices:

a) to adopt a numerical distance measure along these dimensions, therefore obtaining a distance metric of

heterogeneous type [16].

b) To discretize the numerical ranges and to extend the functionality of the SDM. A parameter of

particular importance in this discretization is the number s of equal width intervals. Although, it is

difficult to define general guidelines the heuristic rule of setting s to the larger of 5 or C , where C

is the number of output classes of the problem domain, proves effective in practice.

Experience gained from real data sets indicates that it is difficult to judge beforehand which of these approaches

is better.

4. Neurofuzzy inference with the Statistical Distance Metric (SDM)

In contrast to example based nearest neighbor learning schemes, the SANFIS learns a smooth functional that

weights the contribution of each examplar. This provides some intuition for the superior performance of

SANFIS networks related to simple Instance Based Learning (IBL) schemes. Nevertheless, in order for the

SANFIS solution to obtain better generalization performance than the nearest neighbourhood schemes a careful

design of its parameters is necessary. This section discusses how the peculiarities of the Statistical Distance

Metric space can be described properly with the tuning of SANFIS parameters. Actually the extraction of the

Statistical Distance Metric space (described at the previous section) and the corresponding tuning of SANFIS

parameters (described at the current section) can be viewed as a preprocessing stage that extracts from the

representation of the symbolic features useful distance measures for neurofuzzy inference.

A parameter of particular importance is the region of influence of the Gaussian Radial Basis Function MFs that

is determined by their spread parameter. The determination of proper spread parameters for the membership

functions becomes more complicated within the domain of statistical distances. At the neural network literature

significant work exists on the estimation of spreads for Radial Basis Function (RBF) networks [2,3,6,8,9]. The

heuristic suggestions of [1] compute these spreads for RBF networks as md 2max= , where maxd is

the maximum distance between patterns and m the number of RBF centers, has not been proven very effective

in practice at the SDM case. One basic reason for the inefficiency of the above formula is that symbolic features

can have significant differences between them. Therefore, the spreading of the MFs corresponding to each

feature dimension should be computed by designing the corresponding distance metric independently from the

other features. Thus, the design of the SANFIS network proceeds by computing different feature spreading

scaling factors for each feature and different weights for the rule centers. The weight parameters adjust the

6

8/8/2019 ieee_tfs

7/23

region of influence of a rule center that relates to the importance and the reliability of the example and their

computation is considered at the next section.

The feature spreading scaling factors adjust the spread of the Gaussian kernels along the dimension that

corresponds to the feature. In order to obtain an effective general setting for the computation of the feature

spreading factors, a sensible approach is to obtain at a first step an estimate of the average distance favd , , of

patterns within the space defined with the SDM independently for each feature f . The parameter favd , is

used as the feature spreading scaling factor, since it is learned from the peculiarities of the particular feature and

normalizes effectively the distances along the dimensions determined by the feature. Then the region of

influence of the MF kernels is designed by requiring that at a particular distance Spread from a MF center

expressed in units of favd , , the attenuation of influence is decreased by a . The meaning of these parameters in

practice is illustrated with Figure 2. The figure illustrates the envelope of the Gaussian that is used as

membership function for a feature dimension f with an average distance parameter 5.0, =favd . The

requirements fullfiled by this MF is that the set membership reduces to 10% (i.e. the attenuation parameter is

1.0=a ) at a distance three times favd , (i.e. Spread = 3).

We should note, that with this method for the estimation of the MF spreads, the number of rules is considered

only implicitly, since this number determines to a large extent the average distance parameter favd , .

Mathematically, the requirements for the rate of decaying of MFs along each feature dimension are designed by

seeking a parameter f such that :

aSpreadd favf = )exp( , (33)

and therefore the required parameter f is derived as:

)()log( , favf dSpreada = (44)

Values of these parameters that realize good results are those of the example presented in Figure 2, i.e.

Spread=3 and a=0.1. These parameters imply that at a distance from a rule center along the corresponding

feature dimension 3 times larger than the average distance between patterns, the influence of the MF for the

particular feature f attenuates with a factor of 0.1. For these parameters we obtain

5351.1)()log( , == favf dSpreada , and the Gaussian envelope that is plotted in Figure 2

corresponds to )5351.1exp( x .

The evaluation of rule activation proceeds by computing the response for the conjunctive premise conditions

along each feature dimension independently and by multiplying the individual responses. The MF evaluation

corresponding to a distance fx from the center of the MF i for featuref is )exp()(, ffffi xxMF = .

Note that all the MFs for a feature dimension f , have the same spreading f at all the learning steps, except

the last, i.e. the MF merging step, which generally modifies the Gaussian MF spreading by merging the

7

8/8/2019 ieee_tfs

8/23

appropriate MFs. Finally, the total activation of the rule )(xRULE is obtained by the mutliplicative

conjunction rule as: == f f

ffff xxRULE )exp()exp()( x . This scheme is easily augmented

to account for feature weighting as:

( ) ( ) ( )

=== f

fff f f

ffffff wxxwxwxRULE'' expexpexp)( , where ff ww log

'=

.

We can elaborate further on the design of the SANFIS neurofuzzy network by exploiting the results of the

Instance Based Learning, that is described at the next section, about the significance and reliability of each

example. Based on these results, for each example Rrr ,,1, =x used for the construction of the fuzzy

system a weight parameter rw is determined. This weight parameter is assigned to the corresponding rule r,

such that the spreading rf, of each MF corresponding to feature f and belonging to the premise part of the

same rule r, is further defined asr

frf

w

=, . The designed MF centers own an influence at a distancex from

their center formulated by )exp( , xrf for each feature dimension. Clearly, the larger the values of these

weight parameters, the smaller the parameter cf, and therefore the more extended is the region

of influence of the corresponding Gaussian MF. Consequently, this means that the rule that is created based on

example rx is a reliable classifier.

Therefore the above scheme with the spreads rf, dependent on both rule r and feature f ,

trains locally the spreads of the MFs, accounting for the peculiarities and irregularities of the state space. The

Instance Based Learning (IBL) step can estimate the relative importance of each rule summarized with the

parameter rw and therefore can improve the performance of the designed neurofuzzy solution.

8

8/8/2019 ieee_tfs

9/23

55 SANFIS Learning

Although the architecture of the SANFIS is similar to ANFIS [20,21,22,23], the training algorithms are

significantly different. Instead of using gradient descent optimization of the premise and

subsequent parameters we use a two stage learning strategy. The first stage, performs structure

learningwith a multistep Instance Based Learning (IBL) approach . At this stage the reliability

and importance of the examples is estimated as described in Subsection 5.1. These examples

are ordered according in a list according to their significance. The next step described in

Subsection 5.2 is an incremental fuzzy system construction process that exploits the

information gathered for the reliability of examples and for the peculiarities of the statistical

distance space along each feature dimension. Exploiting their similarity with a fuzzy mutual

subsethood measure reduces the number of membership functions and can simplify the

SANFIS structure without generally a degradation at the generalization performance. This

pruning process is described in Subsection 5.3. Finally, at Subsection 5.4 the consequent

parameters are learned with a linear least squares formulation that can be computed efficiently

with the pseudoinverse solution.

5.1 Instance Based Learning (IBL) for the selection of MF centers and the determination of their

parameters

Some examples of the training set are more reliable classifiers than others. It is highly desirable to detect the

reliable examples and to exploit them for the rule construction. Also, the more reliable an example is, the larger

should be the region of influence of the corresponding MF when the example is used as a MF rule. The extent

of the region of influence is expressed with the spreading parameter of the rule center. Instead of using

clustering self-organizing techniques [2] or an approach that exploits the principle of structural risk

minimization [15], a heuristically driven learning strategy is adopted for the determination of the examples that

should be used as rules and of their widths. The former approaches are well suited for the effective learning of

smooth functionals even if these lie in high dimensional spaces but the irregularities of the statistical distance

metric space extracted from symbolic data and the need to cope with many artefacted examples poses several

problems at the application of their mathematical framework.

The proposed neurofuzzy training approach consists of two phases. At the first phase, the Instance Based

Learning(IBL) step, the potential of each example for serving as a rule (i.e. how representative and reliable the

example is) is evaluated. This step is of heuristic type and it tries to discover the reliability and the importance

of the training examples with an Instance Based Learning (IBL) scheme that resembles the functionality of

PEBLS [7]. The IBL learning step offers the potentiality to detect the noisy examples of the designed

classification system at the initial approximate solution and therefore the feasibility to remove them from the

second neurofuzzy learning phase. The IBL step is implemented with nearest neighbor based schemes [ 14, 16].

The classification function of IBL viewed as an input-output mapping tends to have many class boundaries and

9

8/8/2019 ieee_tfs

10/23

discontinuous islands of misclassified regions placed near erroneously classified examples. The structure of

the decision boundaries is smoothed and most of the artefacted regions are extracted to reject the influence of

noisy examples at the designed classification system. The examples that do not yield satisfactory performance

at the initial IBL learning step, are detected and marked in order to avoid their use at the rule discovery process.

This learning step detects with Instance Based Learning the good examples and uses them to construct rules

by placing the MF centers and by tuning the MF spreads according to the structure of the SDM along the

corresponding dimension. This can be viewed as astructure identification step.

The second learning phase is the neurofuzzy training phase. The first learning step, of this phase, the premise

parameter identification is accomplished with Instance Based Learning techniques. These techniques construct

the MFs, place their centers and compute their spreads. All these parameters are computed within the Statistical

Distance Metric space. This step can be considered to perform an initial approximation to the structure of the

solution space analogous to the one performed with Instance Based Learning nearest neighbor schemes. For

every example rand feature f , a membership function of the form ),(_ frtoclose is constructed along the

dimension of f in order to describe how the outcome depends on the closeness to the example r. Thus, the

rules have the following form:

if ),(_)2,(_)1,(_ Frtoclosertoclosertoclose then )(__ re x a m p lo fc la s s

where the consequent class_of_example(r) is expressed with a scalar constant rb and F denotes the number of

features. Clearly, these rules constitute a zeroth order Takagi-Sugeno system.

Then for the second phase we avoid the formulation of gradient descent based techniques to further tune the

premise parameters and to compute the consequent parameters. The approximate structure and the lack of

pattern coordinates at the Statistical Distance Metric space perplexes significantly the formulation of the

gradient. At the second learning step, the subsequentparameter learningstep, the fuzzy system is designed to

construct a smooth solution that fits well with the training set.

Three basic approaches have been exploited for the implementation of the structure identification learning pass,

based on Instance Based Learning techniques. The one pass approach is an exemplar weighting method that is

used in combination with the nearest neighbor parameter which must attain a value of larger than one. The

learning is accomplished with only one pass through the training examples. At this pass, for each training

instance, its k nearest neighbors are detected from among the remaining training set. Ifj neighbors have a

matching class then the weight is assigned to the current instance according to the simple formula:

jkweight +=1 . Therefore, the more the class of the exemplar is reinforced by its neighbors, the less the

weight (i.e. the more reliable the exemplar is). Algorithmically, the one pass instance based learning algorithm

takes the form:

foreach pattern P of the training setdo

Detect the k nearest neighbors to P from the training set according to the Statistical Distance Metric;

10

8/8/2019 ieee_tfs

11/23

Let j = number of nearest neighbors with the same class label as the actual class of P;

Set the weight parameter that quantifies the reliability of the exemplar as jkweight +=1 ;

endfor;

As a particular example, consider the case that the nearest neighborhoodparameter is set to six. The exemplar

weights will range from one to seven, depending on the number of neighbors that have a matching class. A

weight of one means that all the neighbors reinforce the class assignment of the example (i.e. kj = ), therefore

this example is reliable. Conversely, at the other extreme case, a weight of seven, implies with a high

probability that the isolated exemplar is artefacted and therefore it should not account too much to the final

solution.

The one pass technique succeds in attenuating significantly the effect of artefacted exemplars. Evidently, it

attenuates also the exemplars placed near class separation boundaries. Even so, since the strength of the

boundary exemplars is symmetrically reduced, the algorithm is not biased towards favoring a particular class.

An alternative IBL heuristic weighting approach for the determination of the influence of the rule centers

evaluates theperformance history related to each exemplar. The performance history is evaluated by running a

large number of classification trials. Usually, all the patterns of the training set are used at those trials. At each

such trial a sample pattern P is picked from the training set and its classification )(Pclass is evaluated as

the classification of its nearest neighbor nP , i.e. )()( nPclassPclass = . The assigned classification

)(Pclass with the nearest neighbor rule is compared with the actual class of the pattern. Exemplars that have

been used successfully to classify their neighbors are assigned a small weight (correspondingly a large region of

influence). Therefore, the spreading of the MFs is adjusted by considering its success ratio, as it is evaluated

from the percentage of times that it was used to classify correctly. This percentage evaluates the effectiveness of

a rule as a classifier. Denoting by used(i) the number of times the exemplar i (that serves as the rule center) is

used to classify (as the nearest neighbour example), and by correct(i) the number of times it is used correctly,

the formula for weighting becomes:

)()( icorrectiusedweight =

We should note that at the evaluation of the nearest neighbour, a weighting scheme for the exemplars is notused, i.e. all examples are treated equally at the distance computations.

The algorithm for the used correct approach is formulated as:

forall patterns p of the training setdo

used[p] = correct[p] = 1;

endfor;

forall patterns p of the training setdo

Cp = class(p); // actual class of pattern p

11

8/8/2019 ieee_tfs

12/23

8/8/2019 ieee_tfs

13/23

( )

( )

= =

= =

=R

r

F

f

rf

rff

R

r

F

f rf

rffr

mxSD

mxSDb

1 1

2

,

,

1 1

2

,

,

,

2

1exp

,

2

1exp

)|(

xF

(55)

Where

[ ]Fxxx 21=x is the input vector.

rfrfr mb ,, ,, = , Rr ,,1= , Ff ,,1= , denotes the parameter set of the SANFIS system.

R is the number of fuzzy rules

F is the number of input variables, i.e. the number of features.

rf

m, is a numeric value assigned to the symbolic feature where the fth membership function for the

rth fuzzy rule achieves a maximum. The actual numeric value of rfm , does not have a particular

meaning for the application because it is usually an integer mapped to the symbolic feature arbitrarily

(e.g. consecutive integers can be used for tokenizing features).

rf, is the width (spread) of the membership function for the f th feature and r th fuzzy rule.

rfff mxSD ,, is the statistical distance between the symbolic feature value fx and the symbolic

feature rule rfm , computed according to equation (1).

rb is the scalar output of the r th Takagi-Sugeno zeroth-order fuzzy rule.

The algorithmic steps for fuzzy system construction have as follows:

1. Retrieve the most significant example ),( dx , where x denotes the attribute vector and

d is the corresponding class label and construct the first fuzzy rule with the

ExampleToFuzzyRule construction algorithm

2. while the list of examples is not empty do

3. Retrieve the next significant example ),( dx

4. if the currently constructed fuzzy system cannot explain adequately the mapping ),( dx

then

5. Construct new fuzzy rule forx and expand the fuzzy system.

endif

endwhile

The fuzzy system construction proceeds by considering before expansion if the current set of fuzzy rules is

adequate to explain the currently considered example x . If

13

8/8/2019 ieee_tfs

14/23

( ) Fy |xF

then the fuzzy system F already satisfactorily represents the information in the corresponding example and

hence no rule is added to F and the next training data is considered by performing the same type of F test.

Suppose that

Fy >)|(xF

Then a new fuzzy rule newr is added to represent the information contained in ( )y,x . The widths newrf,

for the new rule newr are adapted in order to adjust the spacing between the membership functions so that

the new rule does not distort what has already been learned by the fuzzy system.

a smooth interpolation is performed between training points.

This modification of the newrf, , for each feature dimension f, is accomplished by determining the nearest

neighbour for the fth MF center nearrfm , , from all the MF centers of the same feature fof the fuzzy system

constructed up to this point, according to the statistical distance metric. The spreading newrf, is updated

according to

n e a rn e wn e w rfrfrfmm ,,,

1

=

for Ff ,,1

= , where is a weighting factor that determines the amount of overlap of the membership

functions. The weighting factor and the weights newrf, have an inverse relationship. A value of 2=

attains good results in practice.

5.3 The Mutual Subsethood measure and Membership Function aggregation

Membership functions FfRrrf ,,1,,,1, == are used for all the R reliable examples in order to

provide the region of influence of this example along the corresponding attribute (i.e. dimension) coordinates

Ff ,,1= , within the Statistical Distance Metric space. Each membership function rf owns a center at

the corresponding feature coordinate and a dispersion computed with the evaluation of the dispersion of the

statistical distance metric along the corresponding feature dimensions according to the formulation of (3).

However, this technique has a tendency to produce a large number of MFs. Clearly, for R reliable examples of

feature dimensionality F used at the SANFIS construction the resulted fuzzy system has FR MFs and R

rules. The initially large fuzzy system can in most cases be significantly reduced by agglomerating adjacent

Gaussian MFs, belonging to the same featuref

and owning a large degree of overlap, to equivalent ones. Thisaggregation function is controlled by computing a measure that evaluates the degree of equivalence among the

14

8/8/2019 ieee_tfs

15/23

corresponding fuzzy sets of the MFs. The algorithms for the reduction of MFs based on the mutual subsethood

measure are adapted from [25], Chapter 13.

The measure of mutual subsethood [25] quantifies the degree to which two fuzzy sets are equal or how much

they are similar. Suppose [ ]1,0: Xa and [ ]1,0: Xb are the set functions of fuzzy sets A and B .

Then fuzzy sets A and B are equal if ( ) ( )xx ba = for all Xx . If BA = then BA and AB . Here

BA if ( ) ( )xx ba for every Xx . Then a fuzzy measure of equality or mutual subsethood measure

that quantifies the degree to which fuzzy set A equals fuzzy set B is defined as

( ) ( ) ( )ABBADegreeBADegreeBAE === and,

To resolve the above definition the notion ofsize or cardinality of fuzzy set A denoted by ( )Ac needs to be

defined. It equals the sum of the fit values of A , i.e. ( ) ( )=

X

aAc

x

x .

The mutual subsethood measure ( )BAE , can be computed in terms of a ratio of counts as in [25]:

( )( )( )BAc

BAcBAE

=, .

The approximate mutual subsethood measure proposed is computed in [25] for the quantification of the

similarity between two Gaussian membership functions. According to [25] for Gaussian MFs with centers

21,mm and variances 21, this measure can be approximated with the following formulas:

( )

( )

( )

( )

( )BAcBAc

BAc

BAc

BAE +

=

=

21,

(66)

where

and ),0max()( xxh = .

15

( )( )( )

( )

( )( )( )

( )( )

( )21

21122

12

21122

21

21122

2

1

2

1

2

1

+

++

+

++=

mmh

mmh

mmhBAc

8/8/2019 ieee_tfs

16/23

8/8/2019 ieee_tfs

17/23

We have applied the neurofuzzy based solutions to many standard data sets from the UCI machine learning

repository. The results are illustrated by Table 1. This table compares the PEBLS performance with that

obtained with the proposed neurofuzzy solution. The 3nd column displays the average generalization

performance of the neurofuzzy system without the Instance Based Learning (IBL) estimation of parameters.

The next three columns (4th to 6th) summarize the corresponding average classification performance for the

results obtained by applying the neurofuzzy with the three approaches to IBL described, i.e. the one-pass, the

used-correct and the increment approaches. The improvements achieved with the IBL learning are evident. It is

important to note at this point that the utilization of IBL learning allows to consider for the construction of the

fuzzy rules only a small percentage (about 5%-20% dependent on the particular application) of the training

examples which are the most important and reliable. Therefore, we can obtain a small SANFIS system that

generalizes well.

Table 2 presents results for the same data sets as in Table 1 with the application of the mutual subsethood

pruning of the SANFIS system. The additional column displays the percentage of reduction of the MFs. This

percentage depends on the distribution of the training data points and tends to increase with the increase of the

training set size. We conclude that the merging of MFs results in a simpler SANFIS system of similar overall

performance. Clearly, from the results displayed at Tables 1 and 2 we can conclude that the neurofuzzy

solutions outperform the simple nearest neighbor schemes.

Database PEBLS NeuroFuzzy NeuroFuzzy

+IBL

One-pass

NeuroFuzzy

+IBL

Used Correct

NeuroFuzzy

+IBL

IncrementHypothyroid 97.90 98.14 98.33 98.44 98.37

Iris 94.62 95.40 96.32 96.8 96.12

Hepatitis 76.59 78.33 79.56 84.37 81.34

Breast Cancer 94.23 95.90 96.11 96.22 96.18

Heart Disease 81.90 83.41 82.15 83.87 85.21

Audiology 77.9 78.12 81.26 81.14 79.94

Liver Disorders 63.45 63.20 65.91 72.54 74.58

Table 1 Illustration of the performance of the proposed NeuroFuzzy + IBL data mining algorithms compared

to the plain IBL as implemented with the PEBLS learning system. We can observe that the utilization of IBL

within the framework improves the generalization results. However, we cannot easily conclude that a

particular IBL learning approach is better.

Database PEBLS NeuroFuzzy NeuroFuzzy

+IBL

One-pass

NeuroFuzzy

+IBL

Used Correct

NeuroFuzzy

+IBL

Increment

Percentage of

reduction of

MFs

Hypothyroid 97.90 98.04 98.46 98.52 97.73 42 %

17

8/8/2019 ieee_tfs

18/23

Iris 94.62 95.30 96.32 96.8 96.14 36 %

Hepatitis 76.59 78.43 79.56 85.11 82.14 29 %

Breast Cancer 94.23 95.93 96.11 96.13 96.18 41 %

Heart Disease 81.90 84.52 82.18 84.12 85.10 43 %

Audiology 77.9 79.21 81.15 82.14 80.12 45 %

Liver

Disorders

63.45 63.25 64.92 71.89 73.67 47 %

Table 2 The results for the same data sets as in Table 1 with the application of the mutual subsethood pruning

of the SANFIS system. The additional column displays the percentage of reduction of the MFs. We conclude

that the merging of MFs results in a simpler SANFIS system of similar overall performance.

Below we describe in more detail one application from bioinformatics and one from data mining of

commercial databases. The application from bioinformatics concerns the prediction of promoter sequences

[5,10]. This task involves predicting whether or not a given subsequence of a DNA sequence is a promoter, i.e. a

sequence of genes that initiates a process called transcription. The data set contains 106 examples, 53 of which

were positive examples (promoters) and the rest negative ones. A training pattern consists of a sequence of 57

nucleotides (features) from the alphabet a, c, g and t with the respective classification (promoter or not

promoter). Since the available number of patterns were small the classification performance was tested with the

leave-one-out methodology, i.e. repeatedly trials have been performed by training on 105 examples and testing

on the remaining one. The computed performance was 2/106 (i.e. an average of 2 errors over 106 trials) versus

4/106 for a competitive experiment that used the KBANN neural network model [12].

Another application concerns a different domain: the data mining of commercial databases. A large training set

is extracted from a database kept by a mobile telecommunication company. This set contains 62000 records

concerning some attributes of its customers (e.g. job, area, sex, and pay method) and the classification in terms

of their quality as customers. This classification is in six classes, ranging from 0 to 5 in terms of increasing

customer quality. The objective for the learning system was to perform well at predicting the class of a new

customer from its attributes and therefore to guide the company strategy. For this problem the presented

statistical distance with neurofuzzy learning has obtained the best classification performance, i.e. around 60%

success at the prediction of the customer class. In contrast, a nearest neighbourhood classification scheme based

on the same distance obtains only around 30% and a Self-Organizing Map (SOM) operating with the traditional

distance types with a numerical coding of feature values (it is not easy to adapt the statistical distance at the

context of the SOM) obtains a performance around 42%.

6. Conclusions

Neuro-fuzzy network algorithms designed for learning are very effective in domains in which all features have

numeric values. In these domains, the examples are treated as points and distance metrics obey to standard

18

8/8/2019 ieee_tfs

19/23

definitions. However, the usual domain of data mining applications is the symbolic domain. In this domain the

utilization of the traditional distance metrics usually results in incompetent results.

This work has adapted a Statistical Distance Metric (SDM), for expressing the distance between values of

features in symbolic domains that was proposed for nearest neighbor schemes, to the context of the peculiarities

of the Symbolic Adaptive Neuro-Fuzzy Inference System (SANFIS). The potential of this distance metric toregularize the solution of the neural fuzzy networks is the theoretical justification of the improved performance

related to the simple nearest neighbor schemes.

Additional performance improvement has been obtained by exploiting the fact that the examples of the training

set are not all of the same importance and reliability. Therefore, a learning mechanism is implemented for the

estimation of the reliability and significance of each example. The SANFIS can be designed to exploit

effectively the irregularity of the problems state space through the selection of the proper training examples for

the construction of the fuzzy rules with an Instance Based Learning (IBL) scheme. Also, the weight parameters

for each rule are determined with the IBL learning pass. These weight parameters serve for accounting the

significance and reliability of the corresponding example. We have described three different Instance Based

Learning (IBL) algorithms for the implementation of this learning step. The fuzzy rules are constructed with an

incremental process, that creates a new rule only in the case that the currently assembled system is inadequate

to explain the new example. The concept of fuzzy adaptive subsethood is used at the third phase, for the

reduction of the number of the fuzzy sets used as membership functions. Finally, the consequent parameters are

estimated with an efficient linear least squares formulation. The obtained performances with this multilevel

learning method are significantly better than the traditional nearest neighbour schemes in many data mining

problems and the system offers enhanced explanation ability with the learned fuzzy rules.

The neurofuzzy data mining system that the paper has presented has a wide span of applications and has been

adapted to heterogeneous data records having both symbolic and numeric data attributes. For the symbolic

attributes the Statistical Distance Metric has resulted the best performance. However, for the numeric attributes

the optimal policy for evaluating the distance is not obvious. In many cases the generalization improvement is

obtained by handling the numeric values with a normalized numeric distance type, e.g. with Euclidean distance

metric normalized with the standard deviation. Furthermore, some attribute domains are better handled with a

discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.

the Interpolated Value Difference Metric [16]).

Future work to upgrade further the proposed SANFIS and IBL hybrid data mining algorithms can proceed along

many different directions. Specifically, more elaborated schemes for treating numerical attributes by finding

optimal multisplits [13] can be incorporated within the context of the current work in order to enhance further

the performance. Also, another approach for the effective discretization of continuous attributes using a

simulated annealing algorithm [11] can improve the results obtained by treating numeric attributes as discretized

symbolic. Furthermore, a nonlinear optimization of the consequent parameters with an approach like the

Levenberg-Marquardt algorithm [18] can (at least theoretically) obtain better performance at the cost of a more

complex (and perhaps more unstable) implementation.

19

8/8/2019 ieee_tfs

20/23

Acknowledgment

The authors wish to thank the Research Committee of the University of Patras, for the partial financial

support of this research with the contract KARATHEODORIS, 2454

References

[11] Simon Haykin,Neural Networks, MacMillan College Publishing Company, Second Edition, 1999.

[22] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer perceptrons, Science 247(1990):978-982.

[33] T. Poggio, and F. Girosi, Networks for approximation and learning,Proceedings of the IEEE, vol. 78, pp. 1481-1497, 1991.

[44] Vapnik., V. N. 1998, Statistical Learning Theory,New York, Wiley.

[55] Pierre Baldi, Soren Brunak,Bioinformatics, MIT Press, 1998.

[66] Federico Girosi, "An Equivalence Between Sparse Approximation and Support Vector Machines", Neural Computation, 10:6, (1998) , pp. 1455-

1480.

[77] C. Stanfill, D. Waltz, Toward memory-based reasoning, Communications of the ACM, 29:12 (1986) , pp.1213-1228.

[88] S. Papadimitriou, A. Bezerianos, A. Bountis, Radial Basis Function Networks as Chaotic Generators for Secure Communication Systems,

International Journal On Bifurcation and Chaos, Vol. 9, No. 1 (1999), pp. 221-232.

[99] A. Bezerianos, S. Papadimitriou, D. Alexopoulos, Radial Basis Function Neural Networks for the Characterization of Heart Rate Variability

Dynamics, Artificial Intelligence in Medicine 15 (1999) pp. 215-234.

[1010] Vladimir Brusic, and John Zeleznikow, Knowledge discovery and data mining in biological databases, The Knowledge Engineering Review,

Vol. 14:3, 1999, pp. 257-277

[1111] Justin C. W. Debuse and Victor Jayward-Smith, Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data

Mining AlgorithmApplied Intelligence 11, pp. 285-295 (1999)

[1212] G. Towell, J. Shavlik, M. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, Proceedings Eight

National Conference on Artificial Intelligence,pp. 861-866, Menlo Park, CA:AAAI Press, 1990

[1313] Tapio Elomaa, Jho Rousu, General and Efficient Multisplitting of Numerical Attributes, Machine Learning, 36,pp. 201-244, (1999)

[1414] Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel, Thomas Seidl, Indexing the Solution Space: A New Technique for Nearest Neighbor

Search in High-Dimensional Space,IEEE Transactions On Knowledge and Data Engineering, Vol. 12, No. 1, January/February 2000

[1515] Peter Bartlett & John Shawe-Taylor, Generalization Performance of Support Vector Machines and Other Pattern Classifiers, in Advances in

Kernel Methods, Support Vector Learning, MIT Press, 1999, pp.43-54

[1616] D. Randall Wilson, Tony R. Martinez, Improved Heterogenous Distance Functions,Journal of Artificial Intelligence Research 6 (1997), p. 1-34

[1717] J. Barry Gomm, Ding Li Yu, Selecting Radial Basis Function Network Centers with Recursive Orthogonal Least Squares Training, IEEE

Transactions On Neural Networks, Vol. 11, No 2, March 2000, p. 306-314

[1818] Christopher M. Bishop,Neural Networks for Pattern Recognition (Clarendon Press-Oxford, 1996)

[1919] Scott Cos and Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features, Machine Learning, Vol. 10, pp.

57-78, 1993

[2020] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, `` Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine

Intelligence,'' Prentice Hall, 1996.

[2121] J.-S. R. Jang and C.-T. Sun, ` Neuro-Fuzzy Modeling and Control,'' The Proceedings of the IEEE, vol. 83, pp. 378-406, Mar. 1995.

20

8/8/2019 ieee_tfs

21/23

[2222] J.-S. R. Jang, `` ANFIS: Adaptive-Network-based Fuzzy Inference Systems,'' IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, pp. 665-

685, May 1993.

[2323] J.-S. R. Jang, `` Self-Learning Fuzzy Controller Based on Temporal Back-Propagation,'' IEEE Trans. on Neural Networks, vol. 3, pp. 714-723,

Sept. 1992.

[2424] T. Takagi and M. Sugeno, Fuzzy identification of systems and its application to modeling and control, IEEE Trans. Syst., Man, Cybern., vol.

15, pp. 116-132, Jan. 1985

[2525] Bart Kosko, Fuzzy Engineering, Prentice Hall, 1997

[2626] Hans Hellendoorn, Dimiter Driankov (Eds),Fuzzy Model Identification , Springer-Verlang, 1997

21

8/8/2019 ieee_tfs

22/23

11 bw

22 bw2w2w

1w1wA

1

A2

B1

B2

X2

X1

X 2

X1

X2

11 bw

22 bw

2w2w

1w1w

L a y e r 1 L a y e r 2 L a y e r 3

Figure 11: The architecture of the Symbolic Adaptive Neuro Fuzzy System (SANFIS)

Figure 22: Illustration of the design of the Gaussian MF spreading along each feature dimension.

22

8/8/2019 ieee_tfs

23/23

Figure 33: The mutual subsethood measure quantifies the amount of overlap of Gaussian MFs. At the left subplot the

amount of overlap is small and the corresponding mutual subsethood measure takes small values. The opposite case is

illustrated at the right subplot.

ieee_tfs

Documents

Transcript of ieee_tfs