Download - BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

An evidence�theoretic k�NN rule with parameter

optimization�

L� M� Zouhal��y and T� Den�ux�

� Universit�e de Technologie de Compi�egne � U�M�R� CNRS �� Heudiasyc

BP �� F�� Compi�egne cedex � France

email� Thierry�Denoeux�hds�utc�fr

y Centre International des Techniques Informatiques

Lyonnaise des Eaux

�Technical Report Heudiasyc �� To appear in IEEE Transactions on Systems� Man

and Cybernetics�

Abstract

This paper presents a learning procedure for optimizing the parameters in the evidence�theoretic k�nearest neighbor rule� a pattern classi�cation method based on the Dem�pster�Shafer theory of belief functions� In this approach� each neighbor of a patternto be classi�ed is considered as an item of evidence supporting certain hypothesesconcerning the class membership of that pattern� Based on this evidence� basic beliefmasses are assigned to each subset of the set of classes� Such masses are obtainedfor each of the k nearest neighbors of the pattern under consideration and aggregatedusing the Dempster�s rule of combination� In many situations� this method was foundexperimentally to yield lower error rates than other methods using the same infor�mation� However� the problem of tuning the parameters of the classi�cation rule wasso far unresolved� In this paper� we propose to determine optimal or near�optimalparameter values from the data by minimizing an error function� This re�nement ofthe original method is shown experimentally to result in substantial improvement ofclassi�cation accuracy�

� Introduction

In the classical approach to statistical pattern recognition� the entities to be classi�edare assumed to be selected by some form of random experiment� The feature vectordescribing each entity is then a random vector with well�de�ned � though unknown �probability density function depending on the pattern category� Based on these den�sities and on the prior probability of each class� posterior probabilities can be de�ned�and the optimal Bayes decision rule can then theoretically be used for classifying anarbitrary pattern with minimal expected risk� Since the class�conditional densitiesand prior probabilities are usually unknown� they need to be estimated from the data�A lot of methods have been proposed for building consistent estimators of the pos�terior probabilities under various assumptions� However� for �nite sample size� theresulting estimates generally do not provide a faithful representation of the funda�mental uncertainty pertaining to the class of a pattern to be classi�ed� For example�if only a relatively small number of training vectors is available� and a new patternis encountered that is very dissimilar from all previous patterns� the uncertainty isquite high and this situation of near�ignorance is not re�ected by the outputs of aconventional parametric or non parametric statistical classi�er� whose principle fun�damentally relies on asymptotic assumptions� This problem is particularly acute insituations in which decisions need to be made based on weak information� such ascommonly encountered in system diagnosis applications� for example�

As an attempt to provide an answer to the above problem� it was recently sug�gested to re�formulate the pattern classi�cation problem by considering the followingquestion� Given a training set of �nite size containing feature vectors with known �orpartly known classi�cation� and a suitable distance measure� how to characterize theuncertainty pertaining to the class of a new pattern In a recent paper �� an answerto this question was proposed based on the Dempster�Shafer theory of evidence �� The approach consists in considering each neighbor of a pattern to be classi�ed as anitem of evidence supporting certain hypotheses concerning the class membership ofthat pattern� Based on this evidence� basic belief masses are assigned to each subsetof the set of classes� Such masses are obtained for each of the k nearest neighborsof the pattern under consideration and aggregated using the Dempster�s rule of com�bination� Given a �nite set of actions and losses associated to each action and eachclass� decisions can then be made by using some generalization of the Bayes decisiontheory�

In many situations� this method was found experimentally to yield lower errorrates than other methods based on the same information� However� the problem ofoptimizing the parameters involved in the classi�cation rule was so far unresolved�In this paper� we propose to determine optimal or near�optimal parameter valuesfrom the data by minimizing a certain error function� This re�nement of the originalmethod is shown experimentally to result in substantial improvement of classi�cationaccuracy�

The paper is organized as follows� The evidence�theoretic k�NN rule is �rst recalledin Section �� The basic concepts of the Dempster�Shafer theory are assumed to be

�

known to the reader who is invited to refer to �� and �� for detailed presentations�and to �� for a short introduction� Section � describes the learning procedure� as wellas an approximation to it allowing to near�optimize the error function very e�ciently�Simulation results are then presented in Section �� and Section � concludes the paper�

� The evidence�theoretic k�NN rule

We consider the problem of classifying entities into M categories or classes� The setof classes is denoted by � � f�� Mg� The available information is assumed toconsist in a training set T � f�x�� x�N�� N�g of N n�dimensional pat�terns x�i�� i � �� N and their corresponding class labels� ��i�� i � �� N takingvalues in �� The similarity between patterns is assumed to be correctly measured bya certain distance function d��

Let x be a new vector to be classi�ed on the basis of the information containedin T � Each pair �x�i�� i� constitutes a distinct item of evidence regarding the classmembership of x� If x is �close� to x�i� according to the relevant metric d� then one willbe inclined to believe that both vectors belong to the same class� On the contrary�if d�x�x�i� is very large� then the consideration of x�i� will leave us in a situationof almost complete ignorance concerning the class of x� Consequently� this item ofevidence may be postulated to induce a basic belief assignment �BBA m��jx�i� over� de�ned by�

m�f�qgjx�i� � ��q�d

�i� ��

m��jx�i� � �� q�d�i� ��

m�Ajx�i� � �� A � �� n f�� f�qgg ��

where d�i� � d�x�x�i�� q is the class of x�i� ��i� � �q� � is a parameter such that� � � � � and �q is a decreasing function verifying �q�� et limd�� q�d � ��Note that m��jx�i� reduces to the vacuous belief function �m��jx�i� � � when thedistance between x and x�i� tends to in�nity� re�ecting a state of total ignorance�When d denotes the Euclidean distance� a rational choice for �q was shown in �� tobe�

�q�d � exp��qd� ��

�q being a positive parameter associated to class �q �As a result of considering each training pattern in turn� we obtain N BBAs that

can be combined using the Dempster�s rule of combination to form a resulting BBAm synthesizing one�s �nal belief regarding the class of x�

m � m��jx�� m��jx�N� ��

�In this paper� we assume for simplicity the class of each training vector to be known with certainty�The more general situation in which the training set is only imperfectly labeled has been introducedin �� However� the problem of optimizing the parameters in the general case is not completely solvedyet �see Section ��

�

Since those training patterns situated far from x actually provide very little informa�tion� it is su�cient to consider the k nearest neighbors of x in this sum� An alternativede�nition of m is therefore�

m � m��jx�i�� m��jx�ik� ��

where Ik � fi�� ikg contains the indices of the k nearest neighbors of x in T �Adopting this latter de�nition� m can be shown �� to have the following expression�

m�f�qg ��

K

�� Y

i�Ik�q

�� q�d�i�

�AY

r ��q

Yi�Ik�r

�� r�d�i� ��

�q � f�� Mg

m��

K

MYr��

Yi�Ik�r

�� r�d�i� ��

where Ik�q is the subset of Ik corresponding to those neighbors of x belonging to class�q and K is a normalizing factor� Hence� the focal elements of m are singletons andthe whole frame �� Consequently� the credibility and the plausibility of each class �qare respectively equal to�

bel�f�qg � m�f�qg ��

pl�f�qg � m�f�qg �m��

The pignistic probability distribution as de�ned by Smets �� is given by�

BetP�f�qg �X

fA��j�q�Ag

m�A

jAj� m�f�qg �

m��

M��

for q � �� M � Let us now assume that� based on this evidential corpus� a decisionhas to be made regarding the assignment of x to a class� and let us denote by �qthe action of assigning x to class �q� Let us further assume that the loss incurred incase of a wrong classi�cation is equal to �� while the loss corresponding to a correctclassi�cation is equal to �� Then� the lower and the upper expected losses �� associatedto action �q are respectively equal to�

R��qjx � �� pl�f�qg ��

R��qjx � �� bel�f�qg ��

The expected loss relative to the pignistic distribution is�

Rbet��qjx � �� BetP�f�qg ��

Given the particular form of m� the three strategies consisting in minimizing R�� R�

and Rbet lead to the same decision in that case� the pattern is assigned to the classwith maximum belief assignment� Other decision strategies including the possibilityof pattern rejection as well as the existence of unknown classes are studied in ��

�

� Parameter optimization

�� The approach

In the above description of the evidence�theoretic k�NN rule� we left open the questionof the choice of parameters � and � � �� q

t appearing in Equations � and ��Whereas the value of � proves in practice not to be too critical� the tuning of the otherparameters was found experimentally to have signi�cant in�uence on classi�cationaccuracy� In �� it was proposed to set � � �� and �q to the inverse of the meandistance between training patterns belonging to class �q� Although this heuristicyields good results on average� the e�ciency of the classi�cation procedure can beimproved if these parameters are determined as the values optimizing a performancecriterion� Such a criterion can be de�ned as follows�

Let us consider a training pattern x�� belonging to class �q� The class mem�

bership of x can be encoded as a vector t�� t�� t

��M t of M binary indicator

variables t��j de�ned by t

��j � � if j � q and t

��j � � otherwise� By consider�

ing the k nearest neighbors of x�� in the training set� one obtains a �leave�one�out� BBA m�� characterizing one�s belief concerning the class of x�� if this patternwas to be classi�ed using other training patterns� Based on m�� an output vec�tor P �� BetP��f��g� � � � �BetP

��f�Mgt of pignistic probabilities can be com�puted� BetP�� being the pignistic probability distribution associated to m�� Ideally�vector P �� should as �close� as possible to vector t�� closeness being de�ned� forexample� according to the squared error E�x��

E�x�� P �� t��t�P �� t�� MXq��

�P ��q � t��q � ��

The mean squared error over the whole training set T of size N is �nally equal to�

E ��

N

NX��

E�x��

Function E can be used as a cost function for tuning the parameter vector �� Theanalytical expression for the gradient of E�x�� with respect to � can be calculated�allowing the parameters �q to be determined iteratively by a gradient search procedure�see Appendix A� Alternatively� the minimum of function E can be approximated inone step for large N using the approach described in the sequel�

�� One�step procedure

For an arbitrary training pattern x�� and �xed parameters� vector P �� can be re�garded as a function of two vectors�

�� a vector d� � �d�i�� d�ik��t of squared distances between x�� and its k

nearest neighbors� and

�

�� a vector � containing the class labels of these neighbors�

For small k and largeN � d� can be assumed to close to zero�� allowing each component

P��q to be approximated by Taylor series expansion around � up to the �rst order�

P ��q �d� �� P ��

q �� rd

�P ��tq

��d

��

d� ��

The �rst term in this expression can be readily obtained from Equations � and � bysetting d�i� to � for all i� which leads to�

P ��q ��

�� k

K

�� kq � � �

�

M

��

where kq is the number of neighbors of x�� in class �q and

K �� kPM

q�� kq � k � ��

The computation of the �rst order term�

rd�P ��t

q

��d

��

d� �kX

i��

�P��q

�d�i��d�i��

is more involved �see appendix B� This term can be shown to be of the form A��where A�� is a square matrix of size M � As a consequence� both E�x�� and E can beapproximated by quadratic forms of �� which allows the minimum to be approacheddirectly by solving a system of linear equations�

Figures � and � show the quality of this approximation in the case of two Gaussianclasses with mean vectors �� t and �� t� respectively� and covariancematrices �� I � The data set contained �� samples of each class� Displayedare the mean squared error as a function of �� Figure � and its quadraticapproximation �Figure �� for k � �� The minima of the two functions di�er by lessthan �� which proves the relevance of the approximation in that case� Notethat the quality of the approximation depends on both k and N �

� Numerical experiments

The performances of the above methods were compared to those of the voting k�NNrule with randomly resolved ties� the distance�weighted k�NN rule �� the fuzzy k�NN rule proposed by Keller �� and the evidence�theoretic rule without parameteroptimization �� Experiments were carried out on a set of standard arti�cial and

�This assumption is justi�ed by the following result �� Regarding the training set as a sampledrawn from some probability distribution� the k�th nearest neighbor of x�� converges to x�� withprobability one as the sample size increases with k �xed�

�

real�world benchmark classi�cation tasks� The main characteristics of the used datasets are summarized in Table ��

Data sets B� and B� were generated using a method proposed in �� The dataconsists in three Gaussian classes in �� dimensions� which diagonal covariance matricesD�� D� and D�� The i�th diagonal element Dqi of Dq is de�ned as a function of twoparameters a and b�

D�i�a� b � a� �b� ai� �

n� �

D�i�a� b � a� �b� an � i

n� �

D�i�a� b � a� �b� amin

�i�n � i

n��

�

where n � �� is the input dimension� The mean vectors and covariance matrices �withn � �� for the three classes were�

�� D�� D�� D��

The ionosphere data set �Ion was collected by a radar system and consists of phasedarray of �� high�frequency antennas with a total transmitted power of the order of�� kilowatts �� The targets were free electrons in the ionosphere� �Good� radarreturns are those showing evidence of some type of structure in the ionosphere� �Bad�returns are those that do not�

The vehicle data set �Veh was collected from silhouettes by the HIPS �HierarchicalImage Processing System extension BINATTS� Four model vehicles were used for theexperiment� bus� Chevrolet van� Saab �� and Opel Manta �� The data was usedto distinguish �D objects within a �D silhouette of the objects ��

The sonar data were used by Gorman and Sejnowski in a study of the classi�cationof sonar signals using a neural network �� The task is to discriminate between sonarsignals bounced o� a metal cylinder and those bounced o� a roughly cylindrical rock�

Test error rates are represented as a function of k in Figures � to �� The results forsynthetic data are averages over � independent training sets� Test error rates obtainedfor the values of k yielding the best classi�cation of training vectors are presented inTable ��

As can be seen from these results� the evidence�theoretic rule with optimized �

presented in this paper always performed as well or better than the four other rulestested� and signi�cantly improved at the �� con�dence level over the evidence�theoretic rule with �xed � on data sets B� and B� when considering the best resultsobtained for � � k � �� However� the most distinctive feature of this rule seems tobe its robustness with respect to the number k of neighbors taken in consideration�Figures � to �� By optimizing �� the method learns to discard those neighbors whosedistance to the pattern under consideration is too high� Practically� this property isof great importance since it relieves the designer of the system from the burden of

�

searching for the optimal value of k� When the number of training patterns is large�then the amount of computation may be further reduced by adopting the approximateone�step procedure for optimizing � which gives reasonably good results for small k�Figures �� and �� However� the use of the exact procedure should be preferedfor small and medium�sized training sets�

� Concluding remarks

A technique for optimizing the parameters in the evidence�theoretic k�NN rule hasbeen presented� The classi�cation rule obtained by this method has proved superiorto the voting� distance�weighted and fuzzy rules on a number of benchmark problems�A remarkable property achieved with this approach is the relative insensitivity of theresults to the choice of k�

The method can be generalized in several ways� First of all� one can assume amore general metric than the Euclidean one considered so far� and apply the principlesdescribed in this paper to search for the optimal metric �� For instance� let �q bea positive de�nite diagonal matrix with diagonal elements �q�� q�n� The distancebetween an input vector x and a learning vector x�i� belonging to class �q can bede�ned as�

d�x�x�i� � �x� x�i�t�q�x� x�i�

�nX

j��

�q�j�xj � x�i�j �

The parameters �q�j for � � q �M and � � j � n can then be optimized using exactlythe same approach as described in this paper� which may in some cases result in furtherimprovement of classi�cation results� A more general form could even be assumed for�q� with however the risk of a dramatic increase in the number of parameters for largeinput dimensions�

More fundamentally� the method can also be extended to handle the more generalsituation in which the class membership of training patterns is itself a�ected by un�certainty� For example� let us assume that the class of each training pattern x�i� isonly known to lie in a subset A�i� of � �such a situation may typically arise� e�g�� inmedical diagnosis problems in which some records in a database are related to patientsfor which only a partial or uncertain diagnosis is available� A natural extension ofEquations �� is then�

m�A�i�jx�i� � ��d�i�

m��jx�i� � �� d�i�

m�Bjx�i� � �� B � �� n f�� A�i�g

with ��d � exp��d�� being a positive parameter �note that we cannot a de�ne aseparate parameter for each class in this case� since the class of x�i� is only partiallyknown� The BBAs de�ned in that way correspond to simple belief functions and can

�

be combined in linear time with respect to the number of classes� For optimizing ��the error criterion de�ned in Equation �� has to be generalized in some way� Withthe same notations as in Section �� a possible expression for the error concerningpattern x�� is�

E�x�� BetP��A��

��

which re�ects the fact that the pignistic probability of x�� belonging to A�� given theother training patterns� should be as high as possible� The value of � minimizing themean error may then be determined using an iterative search procedure� Experimentswith this approach are currently under way and will be reported in future publications�

�

A Computation of the derivatives of E w�r�t� �q

Let x�� be a training pattern and m�� the BBA obtained by classifying x�� usingits k nearest neighbors in the training set� Function m�� is computed according toEquations � and ��

m��f�qg ��

K

�BB�� Y

i�I��k�q

�� q�d��i�

�CCAY

r ��q

Yi�I

��k�r

�� r�d��i� ��

�q � f�� Mg

m��

K

MYr��

Yi�I

��k�r

�� r�d��i� ��

where I��k�r denotes the set of indices of the k nearest neighbors of pattern x�� in class

�r� d��i� is the distance between x�� and x�i�� and K is a normalizing factor� In thefollowing� we shall assume that�

�q�d��i� � exp��qd

��i��

which will simply be denoted by ��i�q � To simplify the calculations� we further in�

troduce the unnormalized orthogonal sum �� m�� de�ned as m��A � Km��Afor all A � �� We also denote as m��i� the unnormalized orthogonal sum of the

m��jx�j� for all j � I��k � j �� i� that is m�� m��i� m��jx�i�� where denotes the

unnormalized orthogonal sum operator� More precisely� we have�

m��f�qg � m��f�qgjx�i�� m��i��f�qg � m��i��

m��jx�i� m��i��f�qg �q � f�� Mg

m�� m��i��m��jx�i� ��

The error for pattern x�� is�

E�x�� MXq��

�P ��q � t��q � ��

where t�� is the class membership vector for pattern x�� and P��q is the pignistic

probability of class �q computed from m�� as P��q � m��f�qg �

m��M

�

The derivative of E�x�� with respect to each parameter �q can be computed as�

�E�x��

��q�Xi�I

��k�q

�E�x��

��i�q

��i�q

��q��

�

with

�E�x��

��i�q

�MXr��

�E�x��

�P��r

�P��r

��i�q

��

�MXr��

��P ��r � t��r

��m��f�rg

��i�q

��

M

�m��

��i�q

��

and��

��i�q

��q� �d��i��i�q ��

The derivatives in Equation �� can be computed as�

�m��f�rg

��i�q

��

��i�q

m��f�rg

K

��

��

K�

�K�m��f�rg

��i�q

�m��f�rg�K

��i�q

��

�m��

��i�q

��

K�

�K�m��

��i�q

�m��K

��i�q

��

�K

��i�q

�MXr��

�m��f�rg

��i�q

��m��

��i�q

��

Finally��m��f�rg

��i�q

� � m��i��f�rg � m��i��rq � � m��i��

where is the Kronecker symbol� and

�m��

��i�q

� � m��i��

which completes the calculation of the gradient of E�x�� w�r�t� �q�To account for the constraint �q �� we introduce new parameters q �q �

�� M such that��q � �q

� ��

and we compute �E�x��q

as�

�E�x��

�q��E�x��

��q

��q�q

� �q�E�x��

��q��

��

B Linearization

We consider the expansion around d� � � of P��q by a Taylor series up to the �rst

order�P ��q �d� �� P ��

q �� rd

�P ��q ��td� ��

where P��q �� is given by Equations �� In the following� we shall compute the �rst

order term in the above equation� and deduce from that result a method for deter�mining an approximation to the optimal parameter vector� To simplify the notation�the superscript �� will be omitted from the following calculations�

As a result of the de�nition of the pignistic probability� we have�

�Pq�d�i��

��m�f�qg

�d�i��

�

M

�m��

�d�i��

The derivatives ofm�f�qg andm�� can be more conveniently expressed as a functionof the unnormalized BBA m��

�m�f�qg

�d�i��

�

K�

�K�m��f�qg

�d�i��m��f�qg

�K

�d�i��

��

�m��

�d�i��

�

K�

�K�m��

�d�i��m��

�K

�d�i��

��

To compute �m��f�qg��d�i��

and �m��d�i��

we need to distinguish two cases�Case �� i � Ik�q � We then have�

�m��f�qg

�d�i�� q exp��qd

�i��Y

j�Ik�q�j ��i

�� exp��qd�j��

�Yr ��q

Yj�Ik�r


�m��

�d�i�� q exp��qd

�i��MYr��

Yj�Ik�r �j ��i

�� exp��rd�j��

Setting all distances to � in the above equations� we have�

�m��f�qg

�d�i��

��d

��

� ��q�� k��

�m��

�d�i��

��d

��

� ��q�� k��

Case �� i � Ik�l� l �� q� We have�

�m��f�qg

�d�i�� l exp��ld

�i��Y

j�Ik�q


��

�Yr ��q



�m��

�d�i�� l exp��ld

�i��MYr��



Setting the distances to zero in the above equations�

�m��f�qg

�d�i��

��d

��

� ��l�� kq �� k�kq��

�m��

�d�i��

��d

��

� ��l�� k��

where kq � jIk�qj�The derivatives of K are simply obtained as follows�

�K

�d�i��

MXq��

�m��f�qg

�d�i��m��

�d�i��

Hence��K

�d�i��

��d��

� ��q

MXr��r ��q

�� kr �� k�kr��

It follows from the preceding calculations that� for i � Ik�r� the derivatives ofm��f�qg� m

�� and K for d� � � are proportional to �r� Since m��f�qg� m

�� andK do not themselves depend on � for d� � �� the derivative of Pq is also proportionalto �r� Hence� we have�

Xi�Ik�r

�Pq�d�i��

��d

��

d�i�� Aq�r�r ��

for all r � f�� Mg� Aq�r being some constant not depending on �� Consequently�we can write�

Pq � Pq�� MXr��

Aq�r�r ��

and� expressing this result in matrix form�

P �� P �� A� ��

with A � �Ai�j is a square matrix of size M �The above calculations have been performed for an arbitrary training pattern x�

Reintroducing the pattern index �� we have�

P �� P �� A��

��

Introducing these terms into the mean squared error� we have�

E ��

N

NX��

�P �� t��t�P �� t��

��

N

NX��

�P �� t�� A��t�P �� t�� A��

��

N

NX��

�P �� t��t�P �� t�� tA��t�P �� t��

�tA��tA��

The gradient E with respect to � is therefore given by�

r�E ��

N

�NX��

A��t�P �� t�� NX��

A��tA��

��

Minimizing E under the constraint � � is a nonnegative least squares problem thatmay be solved e�ciently using� for instance� the algorithm described in �� page ��

��

References

�� T� M� Cover and P� E� Hart� Nearest neighbor pattern classi�cation� IEEETransactions on Information Theory� IT��

�� T� Den!ux� A k�nearest neighbor classi�cation rule based on Dempster�Shafertheory� IEEE Transactions on Systems� Man and Cybernetics� ��

�� T� Den!ux� Analysis of evidence�theoretic decision rules for pattern classi�cation�Pattern Recognition� ��

�� T� Den!ux� Application du mod"ele des croyances transf#erables en reconnaissancede formes� Traitement du Signal �In press��

�� T� Den!ux and G� Govaert� Combined supervised and unsupervised learningfor system diagnosis using Dempster�Shafer theory� In P� Borne et al�� editor�CESA�� IMACS Multiconference� Symposium on Control� Optimization and Su�pervision� volume �� pages �� Lille� July ��

�� S� A� Dudani� The distance�weighted k�nearest�neighbor rule� IEEE Transactionson Systems� Man and Cybernetics� SMC��

�� J� H� Friedman� Regularized discriminant analysis� Journal of the AmericanStatistical Association� ��

�� R� P� Gorman and T� J� Sejnowski� Analysis of hidden units in a layered networktrained to classify sonar targets� Neural Networks� ��

�� J� M� Keller� M� R� Gray� and J� A� Givens� A fuzzy k�NN neighbor algorithm�IEEE Transactions on Systems� Man and Cybernetics� SMC��

�� C� L� Lawson and R� J� Hanson� Solving Least Squares Problems� Prentice�hall��

�� P� M� Murphy and D� W� Aha� UCI Reposition of machine learning databases�Machine�readable data repository� University of California� Departement of In�formation and Computer Science�� Irvine� CA� ��

�� G� Shafer� A mathematical theory of evidence� Princeton University Press� Prince�ton� N�J��

�� P� Smets� The combination of evidence in the Transferable Belief Model� IEEETransactions on Pattern Analysis and Machine Intelligence� ��

�� P� Smets and R� Kennes� The Transferable Belief Model� Articial Intelligence��

��

�� L� M� Zouhal� Contribution �a l�application de la th�eorie des fonctions de croy�ance en reconnaissance des formes� PhD thesis� Universit#e de Technologie deCompi"egne� ��

��

Biography

Lalla Meriem Zouhal

Lalla Meriem Zouhal received a M�S� in electronics from the Facult#e des SciencesHassan II� Casablanca� Morocco� in �� a DEA �Dipl$ome d�Etudes Approfondiesin System Control from the Universit#e de Technologie de Compi"egne� France� in ��and a PhD from the same institution in �� Her research interests concern patternclassi�cation� Dempster�Shafer theory and Fuzzy logic�

Thierry Den�ux

Thierry Den!ux graduated in �� as an engineer from the Ecole Nationale des Pontset Chauss#ees in Paris� and earned a PhD from the same institution in �� Heobtained the �Habilitation "a diriger des Recherches� from the Institut National Poly�technique de Lorraine in �� From �� to �� he was employed by the Lyonnaisedes Eaux water company� where he was in charge of research projects concerning theapplication of neural networks to forecasting and diagnosis� Dr� Den!ux joined theUniversit#e de Technologie de Compi"egne as an assistant professor in �� His researchinterests include arti�cial neural networks� statistical pattern recognition� uncertaintymodeling and data fusion�

��

List of Tables

� Main characteristics of data sets� number of classes �M� training setsize �N� test set size �Nt and input dimension �n� � � � � � � � � � � ��

� Test error rates obtained with the voting� distance�weighted� fuzzy andevidence�theoretic classi�cation rules for the best value of k �in brack�ets� with �� con�dence intervals� ETF� evidence�theoretic classi�erwith �xed �% ETO� evidence�theoretic classi�er with optimized ��

List of Figures

� Contour lines of the error function for di�erent values of �� using thegradient�descent method� The optimal value is � � �� t� � � � ��

� Contour lines of the error function for di�erent values of �� using thelinearization method for pignistic probabilities vectors� The optimalvalue is � � �� t� � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Test error rates on data sets B� as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� linearization method� ��

� Test error rates on data sets B� as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules� � � � � � � � ��

� Test error rates on data sets B� as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� linearization method� ��

� Test error rates on data sets B� as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules� � � � � � � � ��

� Test error rates on ionosphere data as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� linearization method� ��

� Test error rates on ionosphere data as a function of k� for the voting�� ETO�� fuzzy �� and distance�weighted �� k�NN rules� � � � � � ��

� Test error rates on vehicle data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method� ��

�� Test error rates on vehicle data as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules� � � � � � � � ��

�� Test error rates on sonar data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method� ��

�� Test error rates on sonar data as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules� � � � � � � � ��

��

Table �� Main characteristics of data sets� number of classes �M� training set size�N� test set size �Nt and input dimension �n�

data set M N Nt n

B� � ��

B� � ��

Ion � ��

veh � ��

Son � ��

Table �� Test error rates obtained with the voting� distance�weighted� fuzzy andevidence�theoretic classi�cation rules for the best value of k �in brackets� with �� con�dence intervals� ETF� evidence�theoretic classi�er with �xed �% ETO� evidence�theoretic classi�er with optimized ��

data set voting ETF ETO weighted fuzzy

B� ��

B� ��

Ion ��

Veh ��

Son ��

��

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0.01859

0.018601

0.018643

0.018705

0.018809

Figure �� Contour lines of the error function for di�erent values of �� using thegradient�descent method� The optimal value is � � �� t�

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0.018585

0.018593

0.018619

0.018676

Figure �� Contour lines of the error function for di�erent values of �� using thelinearization method for pignistic probabilities vectors� The optimal value is � �� t�

��

0 5 10 150.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method�

0 5 10 15 20 25 30 35 400.3

0.35

0.4

0.45

0.5

0.55

0.6

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the voting �� ETO��fuzzy �� and distance�weighted �� k�NN rules�

��

0 5 10 150.24

0.26

0.28

0.3

0.32

0.34

0.36

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method�

0 5 10 15 20 25 30 35 400.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the voting �� ETO��fuzzy �� and distance�weighted �� k�NN rules�

��

0 5 10 150.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

k

erro

r ra

te

Figure �� Test error rates on ionosphere data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method�

0 5 10 15 20 25 30 35 400.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

k

erro

r ra

te

Figure �� Test error rates on ionosphere data as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules�

��

0 5 10 15

0.32

0.34

0.36

0.38

0.4

k

erro

r ra

te

Figure �� Test error rates on vehicle data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method�

0 5 10 15 20 25 30 35 400.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

k

erro

r ra

te

Figure �� Test error rates on vehicle data as a function of k� for the voting ��ETO�� fuzzy �� and distance�weighted �� k�NN rules�

��

0 5 10 150.1

0.15

0.2

0.25

0.3

0.35

0.4

k

erro

r ra

te

Figure �� Test error rates on sonar data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� linearization method�

0 5 10 15 20 25 30 35 400.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

k

erro

r ra

te

Figure �� Test error rates on sonar data as a function of k� for the voting �� ETO��fuzzy �� and distance�weighted �� k�NN rules�

��