[IEEE 2009 IEEE/ACS International Conference on Computer Systems and Applications - Rabat, Morocco...

Possibilistic Ordination-Based Analysis of an Imperfect Database

Anas DAHABIAH, John PUENTES, and Basel SOLAIMAN TELECOM Bretagne, Département Image et Traitement de l’information, Brest, France

INSERM, U650, Laboratoire de traitement de l’Information Médicale, Brest, France {anas.dahabiah, john.puentes, basel.solaiman}@telecom-bretagne.eu

Abstract— An approach that aims to reveal and to explain the pattern of information potentially present in a dataset consisting of n objects by ordering them using the possibility-based Robinsonian similarity matrix is proposed. The similarity is estimated between objects containing imperfect and heterogeneously-assigned data. A graph-based model is proposed to visualize these patterns. This method is applied to a medical database. Without any a priori medical knowledge and without knowing the key attributes of the pathologies, the objects have been ranked according to their corresponding classes.

Keywords- ordination; similarity; possibility theory, Robinsonian matix; ultrametric matrix; dendrogram; partition.

I. INTRODUCTION Any system whose goal is to analyze or to organize

automatically a set of data or knowledge must use a similarity operator to evaluate the resemblance or the relations that exist in the processed information. The reordering of the rows and congruently the columns of the similarity matrix to get a Robinsonian form could be very useful in data analysis since it reveals the structure of the data either in its own right or as a preliminary step to further analysis. In this paper, we aim to ordinate a medical database concerning the upper gastrointestinal tract (esophagus, stomach, and duodenum) using the ordination methods presented in section II. However, the limits of the traditional measures of similarity introduced in section III incite us to propose another approach based on possibility theory explained in section IV to estimate the similarity on which the ordination methods are based (section V). The tested database is described in section VI. Results presented in section VII show that the objects belonging to a given type of pathology lie next to each other. This fact could be very useful to find diagnosis and analysis of similar objects and to extract extra potential knowledge from the database. This is discussed along with some perspectives in section VIII.

II. ORDINATION Ordination, sometimes called seriation or sequencing, is a

frequent issue in data analysis and mining that aims to arrange all objects in a set in a linear order with the objective of revealing underlying structural information in such a way that we can visualize simple geometrical and relational structures between objects to explain the pattern of information potentially present in a numerically given dissimilarity matrix

[1]. Specifically, to order a set of n objects { }nOOOS ,...,, 21= , one typically starts with an nn×

symmetric dissimilarity matrix { }ijδ≡Δ (where ijδ for

i≤1 , nj ≤ represents the dissimilarity between objects iO

and jO , and 0=iiδ for all i ) and a permutation function Ψ

(a function which reorders the objects in Δ by simultaneously permuting the rows and the columns of the dissimilarity matrix). The goal in the ordination is to find a permutation

function ∗

Ψ which optimizes the value of a given loss function L (or a merit function M) as follows:

))((minarg ΔΨ=ΨΨ

∗L (1)

))((maxarg ΔΨ=ΨΨ

∗M (2)

leading to obtain a Robinsonian dissimilarity matrix

)(ΔΨ=Δ′∗

in which the small values of dissimilarity (large values of similarity) are concentrated around the main diagonal as closely as possible, whereas large values of dissimilarity (small values of similarity) fall as far as possible from it. In other words, the matrix Δ′ respects the two following gradient properties [2]:

ijik δδ ≤ for njki ≤<<≤1 (for the rows)

ijkj δδ ≤ for njki ≤<<≤1 (for the columns)

Many loss and merit functions have been proposed in the ordination literature [1] such as:

1- The loss function proposed by Chen [3] to quantify the divergence of a matrix from the Robinsonian form given by:

∑ ∑<< <<

+=Δjki jki

ijkjijik ffL ),(),()( δδδδ (3)

Where { yzifOtherwiseyzIyzf <== 1

0),(),( (4)

978-1-4244-3806-8/09/$25.00 © 2009 IEEE 199

(.,.)f is a function which defines how a violation of the two aforementioned gradient properties for an object triple ( ki OO , and jO ) is counted. )(⋅I is an indicator function returning 1 for the violations.

2- The loss function proposed by Caraux [4] to quantify the deviations between the dissimilarity in Δ and the rank differences of the objects defined by:

∑∑= =

−−=Δn

i

n

jij jiL

1 1

2)()( δ (5)

Where ji − is the rank difference or gap between iO

and jO .

3- The Hamiltonian path length loss function [4], [11] given in equation 6 to optimize the ordination with respect to dissimilarities between neighboring objects that constitute the vertices in a weighted graph:

( ) ∑−

=+=Δ

1

11,

n

iiiL δ (6)

4- The merit function proposed by Hubert [5] to quantify the divergence of a matrix from the Robinsonian form:

∑ ∑<< <<

+=Δjki jki

ijkjijik ffM ),(),()( δδδδ (7)

Where (.,.)f is a function which defines how a violation or satisfaction of the gradient properties for an object triple ( ki OO , and jO ) is counted :

⎪⎩

⎪⎨

⎧

>−=<+

=−=yzifyzifyzif

zysignyzf1

01

)(),(

(8)

5- The merit function proposed by Caraux [4] to represent the moment of inertia of dissimilarity values around the diagonal as:

2

1 1

)( ∑∑= =

−=Δn

i

n

jij jiM δ (9)

6- The merit function proposed by McCormick [6] called “the measure of effectiveness”:

[ ]∑∑= =

−+−+ +++=Δn

i

n

jjijijijiijM

1 1,1,11,1,2

1)( δδδδδ (10)

In fact, any function from this list (the six proposed functions) can be used without any preference, because they are all simple and easy to be optimized. It has to be noted also that Δ can be brought into a perfect Robinsonian form Δ′ by

row and column permutation whenever Δ is an ultrametric ( [ ]jkikij δδδ ,max≤ for all nkj ≤≤ ,,11 (the three-point

condition), and 0=ijδ when ji = ) [2]. However, for most data only an approximation to the Robinsonian form is possible. Hubert [2] shows that a best fitting ultrametric, say

⎭⎬⎫

⎩⎨⎧≡Δ

∗∗

ijd to a given dissimilarity matrix { }ijδ≡Δ can be

generated by applying the iterative projection strategy of Dykstra [7] to find an optimal solution to the system defined by the ultrametric matrix constraints and by minimizing f given as follows:

∑<

−=ji

ijij df 2)(δ (11)

∑<

∗−=

jiijij

dij dd 2)(minarg δ (12)

Thus, the Robinsonian matrix is obtained by applying the

permutation function ∗

Ψ on ∗Δ :

)(∗∗ΔΨ=Δ′ (13)

Where ))((minarg∗

Ψ

∗ΔΨ=Ψ L (14)

In this paper, Δ′ is calculated by the aforementioned manner, for two reasons: on the one hand, since Δ′ is Robinsonian, it can be used in revealing the potential structure of the set of objects (ordination) [1], on the other hand, since Δ′ is a reordered ultrametric, it can be very useful in obtaining hierarchical partitions and tree representation of the set of objects [2].

As the other techniques of data mining, the construction of meaningful and robust dissimilarity matrix is essential to reveal the structure of the data. Therefore we propose herein an approach that enables to estimate the similarity between objects having heterogeneous types of attributes (quantitative, qualitative, ordinal, distribution …etc) and imperfect data (missing, imprecise, and/or uncertain data). Even if this approach is general and can be applied to any type of databases, it has been applied to a medical heterogeneous database in this work.

III. TRADITIONAL SIMILARITY MEASURES LIMITS Traditional similarity (dissimilarity) measures (Minkowski,

Canberra, Hamming, Jaccard, etc) [8] suppose generally that the value of each attribute is precise (disregarding the existence of imprecise data), certain (disregarding the existence of uncertain values), and given (disregarding the existence of missing values) while on the contrary real databases contain a remarkable amount of incomplete and imperfect values. Furthermore, some constraints and conditions should be considered when dealing with each measure. For instance, we

200

must avoid the division by zero that could take place in a considerable amount of these measures, besides we need to know the nature of each variable in the records that contain heterogeneous attributes (quantitative, qualitative, ordinal,...etc) in order to choose the suitable measure. Moreover, similarity interval should be taken into account during the aggregation and during the interpretation of the resulting value ([0,1] is the similarity interval of the majority of the proposed measures, nevertheless some measures accept [-1,1] as an interval like the angular separation measure which represents the cosine of an angle). In reality, a value of an attribute can be given by different ways. For example, if we examine the value of the attribute “age” in some patient records, “age” could be assigned as {18 yeas, close to 18 years, more than 15 years, young, between 15 and 20, unknown, 18 or 19, it’s quite possible to be 18 or 19 and somehow possible to be 17 or 20, defined by a probability distribution, …}. Similarity calculation according to the traditional measures can not be easily carried out between two heterogeneously-given values, for example, between a value given as 25 and another value given as close to 25 or as a probability distribution. For these reasons and in order to construct a general approach, we don’t recommend the use of the traditional measures overburdened with a lot of conditions and constraints. Instead, we propose to use the possibility theory measures developed by Prade, Dubois, and Rakoto [9-10] in order to build the similarity (dissimilarity) matrix between the objects of our set.

IV. POSSIBILITY THEORY Possibility theory provides a method to formalize

subjective uncertainties of events, that is to say a means of assessing to what extent the occurrence (the realization) of an event is possible and to what extent we are certain of its occurrence, without having however the possibility to measure the exact probability of this realization because we don’t know an analogous event to be referred to or because the uncertainty is the consequence of an absence of the reliability of the instruments of observation. Let’s attribute to each event defined on the universe of discourse Ω (in other words to each element belonging to )(Ωρ ) a coefficient ranging between 0 and 1 assessing to which degree the occurrence of an event is possible, where the value “1” means that the event is completely possible, while the value “0” means that the event is impossible. To define this coefficient, we introduce the possibility measure Π which is a function defined over )(Ωρ , taking its values in [0,1], such that:

Axiom 1: ( ) 0=Π φ (15)

Axiom 2: ( ) 1=ΩΠ (16)

Axiom 3:

)()(),(,..., ,...2,1,..2,121 iiii ASUPAAA Π=∪ΠΩ∈∀ ==ρ

(17) where SUP indicates the supremum of the concerned values.

We can say that the possibility measure is totally defined, if we can attribute a possibility coefficient to all the singletons of Ω . Consequently, the possibility distribution function π defined on Ω , whose values are included in [0,1], such that

1)(sup =∈ xx πχ must be defined. As a result the function Π can be defined form the function π by equation 18:

( )Ω∈∀ ρA )(sup)( xA Ax π∈=Π (18)

Reciprocally, π can be defined form Π by equation 19:

Ω∈∀x { })()( xx Π=π (19)

We should also mention here that the indicator function (characteristic function) of a subset from Ω can be considered as a possibility distribution π defined on Ω . To calculate the possibility degree of the couple ),( yx given that 1Ω∈x and

2Ω∈y where ,1Ω 2Ω are two non-interactive universes of discourse, the conjoint possibility distribution defined on the Cartesian product 21 Ω×Ω should be calculated from equation 20: 21 Ω∈∀Ω∈∀ yx

))(),(min(),( yxyx γχ πππ = (20)

In fact, the possibility measure is not sufficient to describe the incertitude of the realization of an event, because contrary to probability theory sometimes the realization of both the event A and its complement CA could be completely possible simultaneously ( 1)( =Π A and 1)( =Π CA in the same time). This means that in this particular case it is impossible to take a decision concerning the realization of A depending on the estimated possibility measure. For this reason, another function, defined on )(Ωρ , whose values are included in [0,1] and which is called the necessity measure (denoted N) is defined as follows:

Axiom 1: 0)( =φN (21)

Axiom 2: 1)( =ΩN (22)

Axiom 3: )()( 21 Ω∈∀Ω∈∀ ρρ AA

)()( ,...2,1,....2,1 iiii ANINFAN == =∩ (23)

where INF stands for infimum.

V. POSSIBILITY-BASED SIMILARITY ESTIMATION

Suppose that we have two objects jO and kO containing

“S” attributes ]....[ 21 Sjijjjj xxxxO = ,

]....[ 21 Skikkkk xxxxO = .

Each attribute could take a precise or an imprecise value modeled by its possibility distribution, and this value can be either numerical or nominal. The values of some attributes could be unassigned (missing value). Besides, each attribute is

201

associated with a “tolerance function” defined by an expert as a formula or as a table permitting to describe mathematically to which degree we consider that two values of this attribute are similar. An example of tolerance function is the function that we call “close to” (figure 1-a). Such a function can be defined by the following formula:

Δ−

−= yxyxa

aaaa 1),(μ if Δ≤− yx aa (24)

0),( =yxa aaμ Otherwise

Where Δ is a variable that influences the slope of the function and consequently the notion of “close to”. The tolerance function can be also:

- The function "True/false": two values of an attribute are similar if they are identical (similarity equal to 1). If the values are different, the similarity is null, this type of functions is used especially when dealing with nominal variables having independent categories. In the case of ordinal variables we must use the function “close to”.

- The "ad hoc" tolerance functions which are defined by the experts to reflect their point of view about the similarities between the attributes.

In our approach the similarity between the two objects jO

and kO can be estimated by means of two measures: the

possibility degree of similarity between jO and kO that tells us to which degree it is possible that these vectors are similar and the necessity degree of similarity of these vectors that tells us to which degree we are certain of their similarity. The probability of the similarity between jO and kO exists between the necessity degree that represents the lower limit and the possibility degree that represents the upper limit. To calculate the possibility and the necessity degrees of resemblance, we must calculate the local possibility and necessity degrees between their corresponding attributes and aggregate them by taking their average for example in order to take a decision concerning the total similarity. The local possibility and necessity degrees of similarity between ijx

given by its possibility distribution ),(,

yxijxX ijj

π and ikx

given by its possibility distribution ),(, ikxX

xxikk

π for all

{ }Si ,...,2,1∈ (see figure 1-b in which each possibility distribution is supposed for clarity to be a fuzzy number) are calculated according to the following relations:

Supposing that D is the definition domain of the considered attribute ix ( DDU ×= ) and that μ is the tolerance function associated to this attribute, the conjoint possibility distribution Dπ (figure 1-c) is calculated as:

))(),(min(),( ,, yxxxikkijj xXxXikijD πππ = (25)

In this case, the local possibility degree of similarity iπ can be calculated (figure 1-c, 1-d) as:

))](),([min(),( uuSUPxx DUuikiji πμπ ∈= (26)

The local necessity degree of similarity iN can be calculated (figure 1-f) as:

))](1),([max(),( uuINFxxN DUuikiji πμ −= ∈ (27)

Figure 1. Local possibility and necessity calculation. X represents the first fuzzy proposition concerning the value of the attribute in the first object. Y represents the second fuzzy proposition concerning the value of the same attribute in the second object. μ represents the possibility or the necessity

degree.

We consider that if the value of an attribute is given in one object and is unassigned in the other (the case of missing values), it is completely possible that these values are similar

0=iπ but we are entirely uncertain 0=iN . Now that the local possibility and necessity degrees are calculated between the attributes, the global possibility and necessity degrees between the objects could be calculated by averaging the local degrees. The average possibility jkΠ and the average

necessity jkN are calculated from the following equations:

SS

iijk ∑

=

=Π1π (28)

SNNS

iijk ∑

=

=1

(29)

202

Where S is the number of the attributes. In this paper, the similarity is modeled by the lower limit represented by the necessity degree of similarity. The resulted similarity matrix can be transformed into a dissimilarity matrix by applying a decreasing function on it, for instance:

SI −=Δ (30)

Where Δ is the dissimilarity matrix, S is the similarity matrix, and I is an nn× identity matrix.

VI. THE TESTED BASE Our digestive endoscope database [12] consists of images,

lesions description, and scene information concerning the upper gastrointestinal tract (esophagus, stomach, and duodenum). The endoscopic findings (pathologies) constitute the objects to be depicted thanks to an exhaustive description mode. Each object is described by 24 attributes with 145 modalities (even 33 attributes with 206 modalities if a sub-object exists), and to each attribute is associated a set of all the possible choices. Owing to the fact that the sub-object features depend on the “non-homogenous state” of the type feature, there are some other relationships between modalities and feature (for example an object whose Density is “unique” has not a Spatial Organization feature, an object whose Shape is “ring-tube” has not a Minor Axis feature) or between modalities of different features (for example, modalities of Relief and Thickness features or modalities of object sizes and sub-object sizes,…). For the scene information, a scene is depicted by a patient profile (the sex and age prevalence features as well as a predefined whole of clinical contexts denoting antecedents, circumstances and symptoms), by the objects (at least one), by eventual spatial relations between objects and the complementary procedures to be envisaged to confirm the disease diagnosis. The attributes of this base are qualitative, quantitative, or unevaluated (missing values). In our test, we calculated the necessity (the average necessity of similarity of the local necessity degrees of similarity) between a given profile and the others, and then we represented the similarity matrix using the Robinsonian matrix and the hierarchical partition. The base contains the following pathologies: Dilated Lumen, Stenosis, Extrinsic Compression, Web, Ring, Hiatal Hernia, Food, Liquid Blood, Blood Clot, Z-line, spot, Circular Barrett’s, Moniliasis, Simple Erosion, Ulcer (edge), and Petchial Mucosa.

VII. EXPERIMENTS AND RESULTS A similarity matrix containing the necessity degrees of

similarity between all the objects of the tested database is constructed by applying the steps illustrated in section IV and is transformed to a dissimilarity matrix by applying (30). We have chosen a graph-based model that depends on the Robinsonian form of the matrix to depict a tree representing the hierarchical partitions of our base. This has been achieved by fitting an ultrametric matrix to the constructed dissimilarity matrix and then by reordering the ultrametric matrix using the permutation function given by (7) [5]. We have remarked that the ordination has been carried out in such a way that all objects belonging to a given type of pathology lie next each

other, and a group of similar objects belonging to a given pathology lies always next to the group of the pathology the most similar. We have to underline that we can also distinguish thanks to this approach the similarity degrees between the objects belonging to the same pathology class. For clarity and simplicity purposes and in order to easily present our results, we give in the following an example of a smaller database (consisting of the first fourteen objects of the main database that represent three distinct pathologies) to be able to follow the interpretation of the structure of our data. In this case the possibility-based dissimilarity matrix is a submatrix of the main matrix that represents the whole dataset. Suppose that

{ }14211 ....,,, OOOS = is a database consisting of 14 objects

where { }211 ,OOP = is the set of objects whose pathology class is “Dilated Lumen”,

{ }1098765432 ,,,,,,, OOOOOOOOP = is the set of objects whose pathology class is “Stenosis (esophagus)” and

{ }141312113 ,,, OOOOP = is the set of objects whose pathology class is “Extrinsic Compression”. Table 1 depicts the Robinsonian matrix of 1S (the reordered ultrametric-fitted dissimilarity matrix using the row and column order of

...875634 ≺≺≺≺≺≺ OOOOOO ). Here the blocks of equal-valued entries are highlighted with the same color, indicating the partition hierarchy induced by the ultrametric (depicted in table II which is a clearer equivalent version of table I). Note that at the beginning each object represents an individual class, then at the first level 0.3987 (the smallest value) the two most similar classes merge (agglomerate) together to form one class. The agglomeration continues gradually until all the objects gather together in just one class (at the highest level 0.7540) which represents our entire dataset. For the partition hierarchy just given, the alternative structure of a tree (dendrogram) for its representation is given in figure 2 (the terminal nodes (vertices) of this structure correspond to the 14 objects of 1S ). As expected, we can see that objects having the same class lie next to each other in the resulting ordination. Other interesting remarks and interpretations can be obtained: for example, we can say that

11O and 13O are the objects the most similar in 1S (actually, they are the objects the most similar in comparison with all the objects of our database S ). We can also see that the pathology

3P is an extinct pathology in 1S (in S also) because it can be distinguished easily from the other classes of pathologies. We can also remark that object 4O is the object the less similar to the other members of the class "" 2P .

According to the levels of the members of a given class, we can divide this class into smaller homogeneous groups (subclasses) in order to depict the details and the degrees of a given pathology (for example ulcer grade I, grade II, and grade III). Object segmentation and organization in this way can be very useful for doctors to find similar cases (content-based case retrieval) or to help to take a decision or to find a solution (diagnostic) to a similar case (case-based reasoning). It helps also to study the association and the relations that exist

203

between the pathologies and the lesions on the one hand, and those that exist between the objects composing the same pathology on the other. In fact, this representative model provides the doctors with a simple-to-interpret tree that visualizes explicitly the relations between the objects and the pathologies of our database.

TABLE I. THE ROBINSONIAN MATRIX OF S1

TABLE II. THE PARTITION HIERARCHY INDUCED BY THE ULTRAMETRIC ROBINSONIAN MATRIX (NOTE THAT THE OBJECTS THAT REPRESENT A

GIVEN PATHOLOGY TEND TO BE AGGLOMERATE IN THE SAME CLUSTER.

Figure 2. The dengrogram of S1

VIII. DISSCUSSION AND CONCLUSION A novel approach to ordinate and to find potential relations

and structures between the objects of a database has been

proposed. These objects have heterogeneous types of variables and imperfect data. Thanks to possibility theory, our approach can be applied without complicated data preprocessing steps that deal with these missing, imperfect, and heterogeneously-assigned values, and the phase of estimating the similarity based on this theory is very general and can be applied to any other problem in data mining (segmentation, classification, association, …etc). In our approach, we applied the graph-based representation (the dendrogram) to visualize the dissimilarity matrix for its simplicity and clarity, whether when it is depicted or when it is interpreted by the doctors. The interesting thing is that without any a priori medical knowledge and without knowing the key attributes of the pathologies (which is represented generally by assigning a weight to each attribute), we have been able to reveal the structure of the objects of the database. Herein, our approach was applied on a digestive database. In reality, thanks to its generality, this approach could be applied to other types of pathologies or in other domains, without any modification and it could be developed to extract potential unknown medical rules. This method can be used in semantic image reorganization and in image understanding when the attributes of an object represent descriptions of images.

REFERENCES [1] M. Hahsler, K. Hornik, and C. Buchta. “Getting things in order: an

introduction to R package seriation”. Jornal of Statistical Software, volume 25, Issue 3, pages 1-34, 2008.

[2] L. Hubert, P. Arabie, and J. Meulman. “Graph-theoretic representations for proximity matrices through strongly-anti-Robinson or circular strongly-anti-Robinson matrices”. Psychometrica Springerlink, volume 63 (4), pages 341-358, 2006.

[3] CH. Chen. “Generalized Association plots: information visualization via iteratively generated correlation matrices”. Statistica Sinica, volume 12 (1), pages 7-29, 2002.

[4] G. Caraux, and S.Pinloche. “Permutmatrix : A graphical environment to arrange gene expression profiles in optimal linear order” . Bioinformatics, volume 21 (7), pages 1280-1281, 2005.

[5] L. Hubert, P. Arabie, and J. Meulman. “Combinatorial data analysis: optimization by dynamic programming”. Book publisher “Society for industrial mathematics”, ISBN number: 9780898714714784, 1987.

[6] W. McCormick, P. Schweitzer, and T. white. “Problem decomposition and data reorganization by a clustering technique”. Operations research, volume 20 (5), pages 993-1009.1972.

[7] R. Dykstra. “An algorithm for restricted least squares regression”. Journal of the american statistical association, volume 78, no. 384, pages 837-842, 1983.

[8] G. Bisson. « La similarité: une notion symbolique/ numérique. Apprentissage symbolique-numérique », tome 2. Eds Moulet, Brito. Edition CEPADUES. 2000.

[9] B. Bouchon-Meunier, La logique floue et ses applications, Addison Wesley France, chapter 2, 1995.

[10] H. Rakoto, J. Hermosillo, and M. Ruet. “ Integration of experience based decision support in industrial processes”. IEEE conference on system, man, cybernetics, on pages: 6 pp, volume 7, 2002.

[11] Z. Bar-Joseph, D. Gifford, and T. Jaakkola. “Fast optimal leaf ordering for hierarchical clustering. Bioinformatics, 17, pages 22-29, 2001.

[12] C. Le Guillou, and Jean-Michel Cauvin. « From Endoscopic Imaging and Knowledge to Semantic Formal Images». Springer, computer science, volume 4370, pages 189-201. 2007.

204

[IEEE 2009 IEEE/ACS International Conference on Computer Systems and Applications - Rabat, Morocco...

Documents

Transcript of [IEEE 2009 IEEE/ACS International Conference on Computer Systems and Applications - Rabat, Morocco...