A fuzzy linguistic approach generalizing Boolean Information ...

13
A Fuzzy Linguistic Approach Generalizing Boolean Information Retrieval: A Model and Its Evaluation Gloria Bordogna and Gabriella Pasi lstituto di Fisica Cosmica e Tecnologie Relative, Consiglio Nazionale de//e Ricerche, Via Ampere 56, 20731 Milano, Italy The generalization of Boolean lnformatlon Retrieval Systems (IRS) is still an open research field; in fact, though such systems are diffused on the market, they present some limitations; one of the main features lacking in these systems is the ability to deal with the “imprecision” and “subjectivity” characterizing retrieval activity. However, the replacement of such systems would be much more costly than their evo- lution through the incorporation of new features to enhance their efficiency and effectiveness. Previous efforts in this area have led to the introduction of nu- meric weights to improve both document represen- tation and query language. By attaching a numeric weight to a term in a query, a user can provide a quantitative description of the “importance” of that term in the documents he or she is looking for. How- ever, the use of weights requires a clear knowledge of their semantics for translating a fuzzy concept into a precise numeric value. Our acquaintance with these problems led us to define, starting from an existing weighted Boolean retrieval model, a linguistic exten- sion, formalized within fuzzy set theory, in which numeric query weights are replaced by linguistic descriptors which specify the degree of importance of the terms. This fuzzy linguistic model is defined and an evaluation is made of its implementation on a Boolean IRS. Introduction The most crucial phases of information retrieval (IR) activity are the unambiguous formulation of information requirements, and the selection, from a given archive, of the information satisfying them; both these phases are based on subjective and often imprecise judgments. In fact, in many practical cases, a precise description of information needs Received May 25, 1992; revised September 21, 1992; accepted September 21. 1992. Not subject to copyright within the United States. Published by John Wiley & sons, Inc. cannot be provided, as there is no a priori exact idea of what is being sought. Moreover, as information can satisfy only to some extent a particular request, it is important to know how much the selected information satisfies the specified needs. In an information retrieval system which automatically performs the IR activity, both information (organized in documents) and user needs must be formally represented in a consistent way so that the retrieval mechanism can match the user query against the collection of documents, making the pertinent ones available. The automatic retrieval process is then activated by the user’s request, which must be expressed in the system query language. In practice, a human intermediary, expert in the query language, is often needed to “efficiently” translate the user’s informal request into a query; this process further increases the imprecision in the expression of the user’s needs. Although these problems are well known, the Boolean retrieval model is still the basis of the majority of com- mercial IRSs (Van Rijsbergen, 1979; Salton, 1984) even though this model oversimplifies the retrieval activity, not taking into account the problems due to subjectivity and imprecision. In fact, documents are represented as mathe- matical sets of terms, and queries are Boolean combinations of terms, thus making them the unique conceptual units for dealing with information. The main limitation of these systems is that they do not allow the importance of terms in the desired documents to be expressed in queries and they are not able to discriminate the retrieved documents by relevance judgments (Cooper, 1988; Salton, 1984). In order to overcome these limitations some generaliza- tions of Boolean retrieval systems have been provided in which crisp queries are evaluated to yield quantitative rele- vance judgments by the use of partial matching mechanisms (Radecki, 1988). These models are based on the concept of weight: in a document representation weights are used to express the distinct roles played by terms in qualifying the document’s contents. In query languages, a numeric weight may be assigned to each term (or subexpression), indicating the degree of importance of that term (subexpression) JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 44(2):70-82, 1993 CCC 0002-8231/93/020070-13

Transcript of A fuzzy linguistic approach generalizing Boolean Information ...

Page 1: A fuzzy linguistic approach generalizing Boolean Information ...

A Fuzzy Linguistic Approach Generalizing Boolean Information Retrieval: A Model and Its Evaluation

Gloria Bordogna and Gabriella Pasi lstituto di Fisica Cosmica e Tecnologie Relative, Consiglio Nazionale de//e Ricerche, Via Ampere 56, 20731 Milano, Italy

The generalization of Boolean lnformatlon Retrieval Systems (IRS) is still an open research field; in fact, though such systems are diffused on the market, they present some limitations; one of the main features lacking in these systems is the ability to deal with the “imprecision” and “subjectivity” characterizing retrieval activity. However, the replacement of such systems would be much more costly than their evo- lution through the incorporation of new features to enhance their efficiency and effectiveness. Previous efforts in this area have led to the introduction of nu- meric weights to improve both document represen- tation and query language. By attaching a numeric weight to a term in a query, a user can provide a quantitative description of the “importance” of that term in the documents he or she is looking for. How- ever, the use of weights requires a clear knowledge of their semantics for translating a fuzzy concept into a precise numeric value. Our acquaintance with these problems led us to define, starting from an existing weighted Boolean retrieval model, a linguistic exten- sion, formalized within fuzzy set theory, in which numeric query weights are replaced by linguistic descriptors which specify the degree of importance of the terms. This fuzzy linguistic model is defined and an evaluation is made of its implementation on a Boolean IRS.

Introduction

The most crucial phases of information retrieval (IR) activity are the unambiguous formulation of information requirements, and the selection, from a given archive, of the information satisfying them; both these phases are based on subjective and often imprecise judgments. In fact, in many practical cases, a precise description of information needs

Received May 25, 1992; revised September 21, 1992; accepted September 21. 1992.

Not subject to copyright within the United States. Published by John Wiley & sons, Inc.

cannot be provided, as there is no a priori exact idea of what is being sought. Moreover, as information can satisfy only to some extent a particular request, it is important to know how much the selected information satisfies the specified needs.

In an information retrieval system which automatically performs the IR activity, both information (organized in documents) and user needs must be formally represented in a consistent way so that the retrieval mechanism can match the user query against the collection of documents, making the pertinent ones available. The automatic retrieval process is then activated by the user’s request, which must be expressed in the system query language. In practice, a human intermediary, expert in the query language, is often needed to “efficiently” translate the user’s informal request into a query; this process further increases the imprecision in the expression of the user’s needs.

Although these problems are well known, the Boolean retrieval model is still the basis of the majority of com- mercial IRSs (Van Rijsbergen, 1979; Salton, 1984) even though this model oversimplifies the retrieval activity, not taking into account the problems due to subjectivity and imprecision. In fact, documents are represented as mathe- matical sets of terms, and queries are Boolean combinations of terms, thus making them the unique conceptual units for dealing with information. The main limitation of these systems is that they do not allow the importance of terms in the desired documents to be expressed in queries and they are not able to discriminate the retrieved documents by relevance judgments (Cooper, 1988; Salton, 1984).

In order to overcome these limitations some generaliza- tions of Boolean retrieval systems have been provided in which crisp queries are evaluated to yield quantitative rele- vance judgments by the use of partial matching mechanisms (Radecki, 1988). These models are based on the concept of weight: in a document representation weights are used to express the distinct roles played by terms in qualifying the document’s contents. In query languages, a numeric weight may be assigned to each term (or subexpression), indicating the degree of importance of that term (subexpression)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 44(2):70-82, 1993 CCC 0002-8231/93/020070-13

Page 2: A fuzzy linguistic approach generalizing Boolean Information ...

(Bookstein, 1980; Bordogna, Carrara, & Pasi, 1991; Buell & Kraft, 1981; Cater & Kraft, 1987). As a result of the evaluation of a weighted query, a numeric value, called the retrieval status value (RSV), is assigned to each document retrieved; this value indicates the estimated relevance of the document to the user’s information needs.

However, the main limitation of such an approach is its inadequacy in dealing with imprecise queries. In fact, the use of numeric query weights forces the user to quantify a set of qualitative and rather vague concepts; moreover, when indicating the importance of a term with a numeric weight, one should be well aware of weight semantics.

An IRS would be much more flexible and powerful if it could directly process queries containing linguistic descriptors of user needs such as:

give me all relevant documents dealing strongly with fuzzy models.

In this query, “fuzzy models” are the terms which spec- ify the desired document contents, dealing strongly is a linguistic description of the importance of these terms in qualifying documents, and relevant is a linguistic de- scription qualifying the desired degree of satisfaction of documents with respect to user needs. Several attempts have been made in the DBMS field (Bose & Pivert, 1992; Buckles, Petry, & Cheung, 1989; Kacprzyk & Ziolkowski, 1986; Zemankova, 1989) to deal with this kind of linguistic description through fuzzy models, while in information retrieval fuzzy linguistic approaches have seldom been adopted in operational systems (Biswas et al., 1987; Bolt, Kowalski, & Kozlowska, 1985; Doszkocs, 1986).

In this article, we tackle the problem of defining a fuzzy retrieval model as the basis for extending a weighted Boolean IRS. In this fuzzy model, queries are defined as linguistic generalizations of queries in the weighted model. To this end, linguistic descriptors are introduced in the query language to express the importance that a term must have in the desired documents and in the classification mechanism to label the retrieved documents in relevance classes.

Through this model, a user is allowed to characterize the contents of the desired documents by explicitly associating a linguistic descriptor, like important or fairly important, to each term in a query. In the same way, the retrieved documents are arranged in relevance classes identified by descriptors, such as very relevant, fairly relevant, or not very relevant.

This retrieval model has been defined within fuzzy set theory, a theory providing a simple and suitable means of dealing with qualitative and imprecise criteria (Zadeh, 1972, 1975, 1978).

The basis of our approach is presented in the next two sections, while the section following those gives the formal definitions of fuzzy queries and of the linguistic variable Importance. The evaluation mechanism of fuzzy queries is then described, followed by the relevance classification mechanism. The last section discusses an evaluation of the implementation of this linguistic model.

Foundations of the Fuzzy Linguistic Model

In the literature, the attempts to extend the Boolean retrieval model have had a twofold aim: to enable IRSs to make relevance judgments on retrieved documents, and to provide users with a means to express their imprecision in specifying information requirements (Cooper, 1988; Salton, 1984).

The theory of fuzzy sets (Zadeh, 3 978) has been success- fully employed to model relevance judgment formulations based on the extension of the Boolean retrieval model. Through the application of this theory, extensions have been made in document representation and in query languages (Buell & Kraft, 1981; Kerre, Zenner, & De Caluwe, 1986; Koll & Srinivasan, 1990; Kraft & Buell, 1983; Miyamoto, 1990; Radecki, 1979). The fuzzy representation of docu- ments is based on the assumption that the significance of a term in describing the content of a document can be expressed as a number in the range [0, 11, called its index term weight.

Document representation is founded on the notion of a fuzzy set, in which the transition from membership of an element to nonmembership is gradual, rather than abrupt. A document is no longer represented by a set of terms belonging to or associated with it, as in the Boolean model; it is treated as a fuzzy set of terms, M(d):

M(d) = it, /a(t) I t E T)

in which pd: T - [0, 11, and T is the set of index terms. pd(t), Called th e index term weight, is interpreted as

the degree of significance of t in representing the content of document d. To make explicit the relation between a doc- ument d E D and a term t E T, the correlation function F is defined as F: D X T - [O, 11, with F(d, t) = pd(t).

In automatic full-text indexing, the membership function of a fuzzy set representing a document is a function of term occurrences, for example, it can be defined as below.

pd(t) = F(d,t) = DFd,

MAXOC& ’

in which DFd, is the number of occurrences of term t in a document d, and MAXOCCd is the number of occurrences of the most frequent term in document d. As the significance of a term in a document is high when the term has a high frequency in the individual document and a low frequency in the whole collection (Salton, 1984), the adoption of function F, as defined in (l), is not sufficient to model this behavior: either definition (1) must be multiplied by an inverse document frequency factor to lower the index term weight of frequent meaningless terms, or a stop list (also called the negative dictionary) containing the most frequent nonsignificant terms (such as articles or prepositions) must be adopted to avoid their use as index terms.

The fuzzy representation of documents has been adopted in different fuzzy models defined to extend the Boolean retrieval scheme. Such fuzzy models differ in the query weight semantics introduced in Boolean queries to allow

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 71

Page 3: A fuzzy linguistic approach generalizing Boolean Information ...

users to specify their information needs in a more detailed way. All these models deal with pure Boolean queries as borderline cases of weighted queries.

With relative relevance semantics, query weights have been interpreted as measures of the relative relevance of each term with respect to the others in the query (Bookstein, 1980). By associating relevance weights to terms in a query, the user requires that the computation of the RSV be dominated by the more heavily weighted terms.

Other approaches proposed a threshold semantics for query weights (Buell & Kraft, 1981): by associating a weight w with a term t, the user requires that the documents be evaluated by considering whether F(d, t) exceeds the threshold w or not. By specifying thresholds, the user is asking to see all the documents sufficiently about the topic being searched for.

The main formal drawback of these models is that, to be consistent with query weight semantics, they must renounce the separability property of the wish-list (Cater & Kraft, 1989; Waller & Kraft, 1979).

In an effort to overcome these formal limitations, a more recent approach introduces the concept of weight in a query at two distinct levels, the query term weight and the query weight, with different semantics (Cater & Kraft, 1987). Models based on this assumption interpret a weighted query as the description of a perfect collection of documents or of a set of ideal documents satisfying a user’s needs (Bordogna, Carrara, & Pasi, 1991; Cater & Kraft, 1987). Thus, the process of evaluating a query selects all documents as much like the perfect collection as possible. This point of view implies that users must be able to precisely specify the characteristics of their ideal documents in a form consistent with document representations.

In the model defined in Bordogna et al. (1991), which is the basis of the fuzzy linguistic model, a query is a Boolean expression on pairs (t, w), in which t is a term and w the numeric weight belonging to the set [0, 11. A query term weight w has the meaning of an ideal index term weight: in the query evaluation this value is considered as a constraint on the stored document representations. When associating a weight w to a term t in a query, the user is asking the system to first retrieve documents with an index term weight close to w. If w is low/fairly high, the user declares his interest in documents which are not much about/about term t. The retrieval status value (RSV) is then computed as the degree to which all the constraints have been satisfied by each document in the stored collection, i.e., the degree of closeness of a stored document representation to the ideal description. This computation consists in a partial matching between the index term weights in the fuzzy document representation and the query term weights.

The constraint, close-to-w, defined by the weight w, is formally interpreted as a fuzzy restriction on the set [0, 11: a fuzzy restriction is a fuzzy subset whose membership function is called a compatibility function as it indicates the compatibility between each element of the set and the concept expressed by the constraint label. The compatibility

function, ~~~~~~~~~~~~ has been defined as a normalized parametric Gaussian function centered in w.

,wose-to-w (&A t)) = e(F(d,r)-w)2’n(k) for F(d, t) E [0, l]

(2)

The concept of closeness can be modulated by varying the k value: when it approaches to 1, the Gaussian curve becomes flatter, thus relaxing the constraint of closeness. Conversely, for k tending to 0, the constraint is strengthened. When implementing the model, due to the adopted approximation of real numbers, a k value must be chosen in order to define the set of distinct RSVs produced by the system. For example, in the application of this model to the information retrieval system DOMINO (Bordogna et al., 1990, 1991), k has been set to 0.01; in this way, 100 distinct RSV values are obtained (distributed in the interval [O.Ol, 11) when adopting an approximation to the second decimal digit.

The evaluation function of a pair (t, w) must establish how well document d satisfies the request expressed by (t, w). On the basis of the interpretation of a pair (t, w), its evaluation function must compute the degree of compatibil- ity between F(d, t), for d varying in D, and the constraint close-to-w identified by w. It follows that the meaning of the pair (t, w) can be defined as:

Mkw)) = id, wose.,o&Yd, t)) I d E D) (3)

From the user’s point of view, the main limitation of the approaches based on numeric query weights is that the user is forced to express a qualitative fuzzy concept, such as importance, by a numeric value, the query weight.

Several attempts to model the fuzziness in user queries by means of linguistic expressions have been made in the data base management systems field (Bose & Pivert, 1992; Buckles, Petry, & Cheung, 1989; Kacprzyk, Zadrozny, & Ziolkowski, 1989; Kacprzyk & Ziolkowski, 1986; Ze- mankova, 1989). In this context, a classification of linguistic descriptors in database queries has been provided (Kamel, Hadfield, & Ismail, 1990):

l qualitative numeric descriptors: i.e., words describ- ing some numeric values or a range of numeric values, e.g., “large” or “high”;

l qualitative non-numeric descriptors: i.e., words which describe nonnumeric concepts such as “nice”;

l quantification descriptors: i.e., words which de- scribe the quantity of items desired, e.g., “some” or “the major part of.”

In information retrieval a few approaches have been formalized to deal with fuzzy queries (Biswas et al., 1987; Bolt et al., 1985; Doszkocs, 1986; Lucarella, 1990). For example, an IRS prototype (Biswas et al., 1987) has been defined with fuzzy queries in a pseudo-natural language. Queries are interpreted and evaluated through a knowledge- based model defined in the specific topic area of “knowl- edge representation.” In this system both the indexing of the documents and the definition of the term relations

72 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993

Page 4: A fuzzy linguistic approach generalizing Boolean Information ...

are performed manually by an expert familiar with the topic area. While such approaches give valuable results in a specific domain of information, they can hardly be generalized to deal with large amounts of information from nonspecific domains.

In the information retrieval field, fuzzy queries have never been defined as linguistic generalizations of weighted Boolean queries. To directly interpret linguistic descriptors of terms in queries, we deal with fuzziness in IR Boolean queries by evolving the retrieval model based on numeric query term weights (Bordogna et al., 1991), as briefly discussed earlier.

The Fuzzy Linguistic Model

To introduce qualitative selection requirements in weighted Boolean systems, the numeric weights can be replaced by linguistic descriptors of the desired documents. Thus, linguistic descriptors play the role of fuzzy specifications of ideal index term weights. They are used to fuzzify numeric weights; in fact, weights such as 0.85, 0.86, and 0.9 are so close to each other that they could equally indicate that the term is very important in the desired documents. Linguistic descriptors, such as very important, fairly important, or not very important can be used to fuzzify weights interpreted as ideal index term weights.

In the same way, other linguistic descriptors could be defined to fuzzify query term weight semantics, such as a threshold or relative semantics. For example, at least important or at least fairly important could be defined as linguistic specifications of fuzzy thresholds on the index term weights. In the new model, the query then becomes a fuzzy description of one or more ideal documents, no longer an exact one. Moreover, in response to a query, a classification of the selected documents in relevance classes is provided; each class is labeled by a linguistic descriptor. At this stage, other levels of fuzziness, such as the use of quantification descriptors or the fuzzification of the AND and OR operators, are not formalized.

Quantification descriptors would be useful as a means of fuzzily specifying the number of relevant documents to be retrieved (e.g., to ask for almost all relevant documents); with this interpretation, these descriptors would play the role of thresholds on the RSVs of the documents retrieved.

The need for a fuzzification of the AND, OR connectives arises from the problems caused by their strict interpre- tation; as has been pointed out (Salton, 1989; Sanchez, 1989), the “strict” interpretation of Boolean connectives as the min and max operators of fuzzy set theory does not always select all the documents which match user needs. To overcome this limitation, some approaches in the literature interpret the fuzzy connectives, AND and OR, in a softer way (Paice, 1984; Salton, Fox, & Wu, 1983); e.g., the averaging operators (Dubois & Prade, 1985) or the OWA operator (Yager, 1992) could be used to model different softer interpretations of linguistic Boolean connectors.

In the linguistic model, a fuzzy document representation is adopted and the linguistic descriptors used in queries are defined by the fuzzy notion of a linguistic variable. In this way, the matching mechanism may be consistently defined in the fuzzy context; i.e., a descriptor can be directly compared with the numeric index term weights to compute their compatibility.

Linguistic variables are well suited to formalize linguis- tic descriptors. A linguistic variable is a variable whose val- ues are words or sentences in a natural or artificial language (Zadeh, 1975). From a formal point of view, a linguistic variable is defined by the quintuple {L, T(L), U, G, M}. L is the name of the linguistic variable, and T(L) is its term set, i.e., the collection of all the possible values of L. For example, a term set for the linguistic variable Importance, denoted by T{Importance), can be defined as:

T(Importance) = {important, not important, very impor- tant, not very important, fairly important}

A numerical variable, 1, called the base variable, is asso- ciated with each linguistic variable L, and takes values in the universe of discourse, U. G is the syntactic rule, i.e., a context-free grammar which generates the terms in T(L); and M is a semantic rule which associates the meaning M(x), a fuzzy subset of U, to each linguistic value x. Each x E T(L), g enerated by G, is the label for the fuzzy restriction M(x) on the values of the base variable (Zadeh, 1975).

M(x) = {(u. t-&h I u E Ul

where U is the base variable domain and u is a base variable value. With each value of the base variable, there is associated a number in [0, l] that represents the degree of compatibility of the base variable value with the concept expressed by the linguistic value x.

A value of the linguistic variable Importance, such as not very important, involves the negation not, the hedge very, and the value important-called the primary term. The semantics of the primary terms is both subjective and context dependent; the semantics of the nonprimary terms is deduced by applying the semantic rule M which deals with connectives (and or or), hedges (e.g., very, fairly, or almost), and negation (not) as modifiers of the meaning of the primary terms. Only in some cases can these modifiers be defined in a context-independent way (Zadeh, 1972).

The Fuzzy Query Definition

Formally, a fuzzy query in the linguistic model is any le- gitimate Boolean expression whose atomic components are pairs (t, q) belonging to the set, TQ = T X T(Zmportance); t is an element of the set T of terms, and q is a value of the linguistic variable, Importance, qualifying the importance that term t must have in the desired documents.

The set TQ* of the legitimate queries is defined by the following syntactic rules:

(1) V (t,q) E TQ - 6, q) E TQ*

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 73

Page 5: A fuzzy linguistic approach generalizing Boolean Information ...

(2) V Ql,Qz E TQ* -) Ql A Q2 E TQ* (3) V Ql, Q2 E TQ’ - QI v Q2 E TQ” (4) tl Q E TQ*-Q E TQ* (5) Legitimate Boolean queries Q E TQ* are those

obtained by applying rules (l)-(4) only.

When specifying the query (t, important), the user requires that the documents to be retrieved deal “strongly” with the concept expressed by t, thus declaring an interest in those documents that “highly” concern the concept expressed by term t. This interpretation implies that the retrieval process must select all the documents for which the index term weight is high. In the model based on numeric weights (Bordogna et al., 1991), an equivalent request could be expressed by the following disjunctive query:

(t, important) - OR,,[i,j](t, W)

in which 0 5 i < j 5 1 (4)

in which i and j are numeric weight values delimiting the full satisfaction of the constraint important. Their definition is discussed later in this section.

In order to reflect the vagueness intrinsic in important, the retrieval process should select all documents having an F(d, t) “high” with an RSV equal to 1; this means that both the documents dt and d2 with F(dl, t) = 0.8 and F(d2, t) = 0.9 must be judged as fully satisfying the query (t, important).

The linguistic variable, Importance, is formally defined as the quintuple:

{Importance, T(Importance), Uimp, Gimp,Mimp}

The definition of the term set T(Importance) is based on the selection of linguistic descriptors which characterize the importance of a term in a document, i.e., the level of concern of the term with respect to document contents. The main aim in defining the term set of the variable Importance is to supply the user with a few words by which he can naturally formulate his information needs. For this reason the primary term, important, is defined:

T(Importance) = {important, very important,

not important,

fairly important,. . .}

The universe of discourse, U, is the set of all values which can be assumed by the base variable: U = [0, 11. Gimp is the context-free grammar defined by the 4-tuple: ITimp 1 Nimpr Pimp 9 Simp)

l Tjmp is the set of the terminal symbols, also called the alphabet, defined as follows: Timp = {important, very, fairly, not, or}

l Nimp is the set of nonterminal symbols: Nimp = {(term), (atomic term), (composite term), (primary term), (connective), (hedge}}

l Pimp is the set of the production rules defined in an extended Backus Naur Form in which the square brackets enclose optional elements, the symbol * indicates the possible repetition of the elements

which follow, and the symbol ] indicates alternative elements: Pimp = {(term) ::= (composite term)] (atomic term) (atomic term) ::= [not] [(hedge)] (primary term) {composite term) ::= (atomic term) (connective) (atomic term) *[(connective) (atomic term)] (primary term) ::= important (hedge) ::= fairZy)*very (connective) :: = or}

l Simp is the start symbol or axiom: Shp = (term)

Each element of T(Zmportance) must be given a mean- ing through the definition of a compatibility function. This is accomplished by first defining the compatibility function associated to the primary term important. As was pointed out earlier, the requirement expressed by specifying the query (t, important) is that all documents highly concerning term t must be selected.

From the interpretation of the query (t, important) formulated in the Boolean weighted query language given in (4) and by assuming the usual definition of the OR operator on fuzzy sets as the U operator, the meaning of the pair (t,important) can be defined as:

M((t, important)) = UwE[i,j]M((tv W>)

From the definition of M((t, w)) given in (3), it follows:

M((t, important)) = U,E[i,j]

{(d, ,wose_to_w (FM t>)) I d E D>

From the definition of pClose-fo-w in (2), and by applying the definition of union in fuzzy set theory:

M((t, important)) = {(d, max,Eli,jl (e(F(d+“)z ‘“‘k’)) 1 d E D}

From this, the compatibility function of M((t, important)), i.e., the pimportanr function is then the following:

,uCLi,p,,t,,t(F(d, t)) = max,E[i,~~(e(F(d~‘)-W)2L”(k))

i.e.,

~iimporta&(d~ t)) =

I

e(F(d, f)-iy In(k) 0 5 F(d, t) < i 1 i 5 F(d, t) 5 j (5) e(F(d, 0-j)’ In(k) j < F(d, t) 5 1

The range [i, j] is the core of the compatibility function (5) and it defines the set of values of index term weights which fully satisfy the constraint expressed by important.

The i and j values can be set by taking into account both the semantics of the index term weights, F(d, t) (the indexing process), and the user concept of importance. For example, when F is computed as defined-in (1) and a stop list is used to eliminate nonsignificant frequent terms, j is chosen equal to 1, as the index term weights increase

74 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993

Page 6: A fuzzy linguistic approach generalizing Boolean Information ...

with the importance of the index terms. The i value can be estimated on the basis of heuristic considerations: for example, a set of sample documents is submitted to a group of users familiar with the topic area of the archive. Each user is asked to select a set of terms, belonging to the documents, which he/she judges fully important. The index term weights for these terms are computed and their minimum value is selected. The value of i can be defined as the mean of the minimum values produced by each user.

The k value in function (5) determines the steepness of the Gaussian slopes and, as a consequence, it affects the strength of the constraint important; the higher the k value, the weaker the constraint. As discussed earlier, the definition of the k value can be made depending on the approximation degree of the implementation. Moreover, in the Pimportant definition, by different k values, it is possible to model different subjective interpretations of the concept important. To this end one can choose the minimum F(d, t) value, let us call it x0, to be considered important. In this way, given an approximation to the nth decimal digit, the k value can be obtained by solving equation (6):

In 10-n kye”o (6)

In the implementation, the values pimportant(F(d, t)) are approximated to 0 when F(d, t) < x0,

In order to define the hedges, the context of the terms, as defined in Yager (1982) can be useful; following this definition, the context of a linguistic value is the window or range in which significant changes in membership degree occur, or, more formally, the range with the largest absolute value of the membership function derivative.

In this context, the hedge very strengthens the concept to which it applies; very important is a more strict re- quirement on documents than important. It should then satisfy:

pvery importa&‘i(4 t)> 5 /+nportant V’(4 t))

V F(d, t) E [0, 11.

For this reason, we have interpreted very as a concentra- tion operator. Being pi the compatibility function of the linguistic term i:

very(pui(F(d, t))> = /-~@(d, t)12

On the contrary, the hedge fairly weakens the concept to which it applies; it is then formalized as a dilation operator:

fair~yh&‘@, t))) = pui(F(d, t))0,5

The operator not and the connective or are defined as the complement and the maximum, respectively.

The Fuzzy Query Evaluation Mechanism

According to the semantics we have given to linguistic descriptors, the association of a value LJ to a term t is an absolute requirement on the values of index term weights. A fuzzy query Q is evaluated by a function E*, which produces the RSV for each document. As linguistic

descriptors can be associated only with query terms, the evaluation of a pair (t, q) is completely independent of other pair evaluations. Thus, function E* is based solely on the evaluations of the atomic components and on their logical Boolean connectors.

Function E* may then be defined on the basis of function E, which evaluates a single pair (t, q) by computing the degree of compatibility of the index term weight F(d, t) with the linguistic value q for each stored document d E D. First, the query evaluation process evaluates the atomic components (t, q), then the Boolean combinations of atomic components, and so on, working in a bottom-up fashion until the whole query is evaluated; this process can be carried out with the separability property holding true (Cater & Kraft, 1989; Waller & Kraft, 1979).

We start by defining the matching function E, which describes the evaluation of an atomic component (t, q):

E: D X TQ - [0, l]

For a given pair (t, q) E TQ and d E D, E establishes how well document d satisfies the request expressed by (t,q). Accordingly, for a specified pair (t, q), E can be seen as the membership function of the fuzzy subset M((t, 9)):

M((t, 4)) = {(d,E@,(t,q))) 1 d E D>

In our model, given a pair (t, q), E must evaluate the degree of compatibility between F(d, t), for d varying in D, and the constraint expressed by the term q, on the basis of the definitions given in the previous section.

In order to define E in the close set [O,l] some consider- ations on the semantics of queries (t, q) are needed. For any q belonging to the term set T(Zmportance), a query (t,q) asks for documents somehow dealing with term t. Thus, a document with F(d, t) = 0 does not deal with term t, and it has a null relevance to the query. This leads to define function E as follows:

E(d, (t. 4)) = /-@Id, t)) F(d, t) # 0 o F(d, t) = 0

,u~ being the compatibility function of the linguistic value q.

Function E*: D X TQ* - [0, 11, which evaluates queries Q E TQ*, is formalized with the extension of E by recursively applying the following rules:

E*(d, Q) = E(d, k 4)) for Q = (t, q)

E*(d, Ql A Q2) = min(E*(d, 121)~ E*(d, Qd

E*(d, Ql V Qz) = max(E*@, Qd, E*@, Qd)

E*(d,-Q) = 1 - E*(d,Q)

in which Q, Qi, Q2 E TQ*. Thus, for a generic Q E TQ’, E” becomes the membership function characterizing M(Q).

Considering the above extension rules, and assuming the usual definitions of U, n, and 1 for fuzzy sets as max, min, and complement, respectively, we have:

M(Ql A Q2) = WQd n M(Q2)

M(Ql v Q2) = M(Ql) U M(Q2)

M-Q) = 1 - M(Q)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 75

Page 7: A fuzzy linguistic approach generalizing Boolean Information ...

in which Q, Ql, Q2 E TQ”. The numeric RSVs greater than 0, which are obtained as a result of a query evaluation, are used for ranking documents and constitute the basis for the subsequent relevance class labeling process.

Let us illustrate by an example how the query evaluation mechanism works. Given the query:

Q = (satellite; important) and

(X-ray; not very important)

the user asks to retrieve all documents highly concerning satellite and weakly concerning X-ray. Let us assume the following values of the F(d, t)s for the documents dl,dZ,d3, and d4 with respect to the terms satellite and X-ray :

F(dl, satellite) = 0.8, F(dl, X-ray) = 0.54;

F(d2, satellite) = 0.47, F(dz,X-ray) = 0.2;

F(d3, satellite) = 0.8, F(dz,X-ray) = 0.2;

F(d4, satellite) = 0.8, F(dz,X-ray) = 0;

Given the core of the ,uimpormnt function [i, j] = [0.7, l] and k = 0.000083 (estimated by applying definition (6) when x0 = 0), the evaluation mechanism yields the fol- lowing results:

RSVl = E*(dl, Q> = min(pimporrant(W,

pnotvev [email protected])) = mink 0.38) = 0.38

RSV2 = E*(& Q) = min(pimportant(0.47),

pnol very ~mpord0.2)) = mid0.60.99) = 0.6

RW = E*&, Q) = min([email protected]),

hOiyery important(O.2)) = mink 0.99) = 0.99

RSV4 = E*(& Q> = min(ti~mporlanr(0.8),

~lL,otveryimportant(0)) = min(LO) = 0 The documents, d3, d2, and dl are retrieved in response to query Q, while document d4 is rejected.

Relevance Classification Process

To give the user a more powerful relevance indication than numeric RSVs for selecting the retrieved documents, a relevance classification mechanism has been defined. This mechanism provides the user with a twofold selection criterion:

(1) to ask the system to rank the retrieved documents according to their compatibility with a predefined linguistic relevance concept;

(2) to ask for a classification of retrieved documents in relevance classes identified by predefined lin- guistic labels.

The common basis of these approaches is the definition of the linguistic labels of relevance, which are formally inter- preted as the values of the linguistic variable, Relevance.

This variable is defined as follows:

{Relevance, T(Relevance), Urel, Grel, M,.,l}

its term set consists of compound linguistic values of the primary term relevant:

T(ReZevance) = (relevant, very relevant,

not very relevant, fairly relevant.. .}

The domain, Urel, of the base variable is the range of the RSVs, i.e., the set [0, l] (Zadeh, 1972; Yager, 1982). Grel is the context-free grammar defined by the 4-tuple {Trelr Nrer, Prelr h.1

the set of terminal symbols is: Trel = {relevant, not, very, fairly} the set of nonterminal symbols is: Nrel = ((term), (primary term), (hedge)} the set of the production rules defined with the previous extended BNF is: P rel = {(term) ::= [not] [(hedge)] (primary term)

(primary term} ::= relevant (hedge) ::= fairlylvery}

the axiom is: Srel = (term)

In the definition of the semantics of the primary term relevant, the first consideration to be taken into account is that the higher the RSV the more relevant is the correspond- ing document. To reflect this concept, the compatibility function of relevant should then be a continuous nonde- creasing function with continuous derivative in the range LO, 11.

The compatibility function of the fuzzy restriction labeled by the primary term relevant is then defined as follows:

e(x-0.7)~ln0.01 x 5 0.7

Prrelevanr(~) = 1 1

0.7 5 n 55 1 e(X- l)* In 0.01 15x

in which x is a value of the base variable. The value 0.7 has been set on the basis of empirical considerations so that the range [0.7, l] of full satisfaction of relevant covers 30% of the set [0, 11.

Assuming this and considering that the concept ex- pressed by relevant is more general than the concept expressed by very relevant, as the base variable value approaches 1, the compatibility function of relevant would increase toward 1 and reach the maximum of 1 sooner than the compatibility function of very relevant; this leads to the interpretation of the compatibility function of very relevant as the compatibility function of relevant shifted to the right in the domain [0, 11. We have defined the hedge very as a right shift operator in [O, 11. By indicating with pi the compatibility function of the linguistic term i then:

Very&i(X)) = pj(X - 0.2)

The shift value has been set to 0.2 so that the range of full satisfaction of pvery relevanr covers 10% of the range of the RSVs. Finally, the value fairly relevant has semantics which could also be expressed by more or less relevant or averagely relevant. Its compatibility function should have a maximum equal to 1 for domain values close to

76 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993

Page 8: A fuzzy linguistic approach generalizing Boolean Information ...

0.5 and decrease rapidly to 0 by varying the domain values both toward 0 and 1. The hedge fairly is interpreted as an operator which applies a left shift:

fairly(pi(X)) = p,j(X + 0.35)

The right-shift value of 0.35 allows the core of /Lfairly relevanr to be symmetric with respect to the point 0.5 and to cover 30% of the range [O,l].

If the user wants to know how much each document satisfies the relevance concept expressed by the linguistic label i E T(Relevance), the system evaluates the degree of compatibility between the document RSVs and the concept i. To this end, the fuzzy relation Ri is evaluated:

Ri: Q X D - [O, l]

Ri = {((ddm,) Id E D,q E Q)

in which pRi = ,ui(E*(d, q)) and E*(d, q) is the RSV of document d with respect to the query q.

If the user wants the retrieved documents to be classified into relevance classes, the system activates the classification mechanism. The compatibility function pi is evaluated on E*(d,q) for each i E T(Relevance) and then the term with the compatibility function maximized by E*(d, q) is selected. Formally, this process is accomplished by evaluating the function:

LABEL: [0, I] - T(Relevance)

defined as:

LABEL(E*(d, q)) = i 1 pi(E*(d, q)) =

max{,uk(E*(d, q)), k E T(ReEevance)}

in which pi is the compatibility function of term i. In the example made at the end of the previous section, both the documents, dl and d2, are classified as fairly relevant, while d3 is classified as very relevant.

As more than one compatibility function can be maxi- mized by a given RSV, the label corresponding to the more restrictive concept is associated to the document considered. For example, the label very reEevant is associated with the RSV= 0.9, as the concept very relevant is more restrictive than relevant.

This classification mechanism provides for a multilevel classification of each retrieved item. This is useful in the case in which a given relevance class contains many docu- ments and the user wants to analyze a further classification of the documents in this class. For example, a second level classification of documents classified at the first level as i, is defined as:

LABEL-i(E*(d, q)) = j 1 pi(E*(d, q)) =

max{pk(E*(d, q)), k E TR(Relevance)}

where TR(ReEevance) = T(Relevance) - {i}. In the previous example, at the second level, dl is

classified as not very relevant, while d2 is classified as relevant. By successively reducing the number of elements of T~(ReZevance), excluding the linguistic terms obtained as labels of the higher classification levels, a multilevel

classification can be obtained. This can be helpful in an IR environment as it supplies several criteria to perform further selections on the retrieved documents belonging to a certain relevance class.

Implementation and Evaluation of the Fuzzy Retrieval Model

The model described in this article was implemented as an extension of the information retrieval system, DOMINO, and has been evaluated on an archive containing 2500 textual documents describing CNR research projects. DOMINO is an IRS developed in the C programming language at IFCTR-CNR and running under the operating system VM-XA/CMS; at present, it is being installed on a PC-486/386. DOMINO was originally conceived and implemented as a pure Boolean system based on a full- text indexing and with a traditional inverted file structure (Bordogna et al., 1990, 1991; Salton, 1984). It was designed as a multimedia IRS to manage documents composed of textual and pictorial information. The textual and pictorial parts of documents are managed by two distinct data structures developed independently and presently under integration. The documents managed by DOMINO can be structured into classes which partition a document into logical subparts containing information with well-defined semantics. A first evolution of the retrieval model was provided by the implementation of the weighted Boolean model defined in (Bordogna et al., 1991) and briefly described previously.

With the second evolution, the linguistic model de- scribed in this study has been implemented and evaluated. An outline of this last extension of the DOMINO system is shown in Figure 1.

The input of the system is a fuzzy query defined as a Boolean expression on pairs (t, q), where t is a term and q, in the case of a textual archive, is a value of the linguistic variable Importance. The query is interpreted by the fuzzy query interpreter (FQI), which builds an evaluation binary tree whose leaves contain the terms with the compatibility functions of their linguistic descriptors q. The intermediate nodes contain the Boolean operators. This tree is supplied to the matching mechanism which evaluates it in postorder against fuzzy document representations. This procedure is applicable as the linguistic model satisfies the separability

I 1 Matching I

FIG. 1. Outline of DOMINO retrieval activity.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 77

Page 9: A fuzzy linguistic approach generalizing Boolean Information ...

property of the wish-list (Cater & Kraft, 1989; Waller & Kraft, 1979). The RSVs produced as a result of this evaluation can be either classified into relevance classes or analyzed to state their compatibility with respect to a specified relevance class by the classification mechanism.

This retrieval model has been tested on an archive of 2500 documents, each describing a CNR research project. In Figure 2 a sample document is shown; the full content of the document is entirely filled by the author, in this case the person who is the scientific leader of the project; each class

,L.\SS IDESUTIFIER S:\.\Il’LE DOCL.\IEST DESCRIBISG .A CSR RESE.ARCtl t’R0.lEC-t.

Cod-Riccrca kle Riccrca [ndir~zo Cap-Citta ret. rcles F3S E-J lail Rcspousabilc C‘ollaboratori

IT.4 CSR PR 010~001 1989 lslituto per rjcerche in fisica cosmica e lecnologie relative \?a hsini. I5 10133 _ klilano - \I1 02 - 7363542 299653 3IZS39 \IL:\CSR I 02 - 2362946 Dl02OO~I\IISI.-01 Borlla Ciuliano

I‘itolo I‘itlc I)bbicttivo

Bonclli Ciuseppe. Chiapprtli Lusio, Conti Ciancarlo. \Iallaini Enrico. Santambro$o frnilio. Quadrini Egidio Ottiche a rnsci S Jei concentratori per il satellite 15.4-Y S-ray rnirrors for the concentrators of the Sax satellite

Parole-Chiavc

I’ro~etto S:\S. Rr3lizzazione dezli spccchi ad incidenza radcnte per raggi s (O.l-IllKeV). Cotttributo alle attivita’ per lo s\-iluppo del satellite e dclla nlissione scientifica.

LIoilaborazioni

.Istronontis S: Ottiche per rarei S; Concentratori per raggi S: Specchi ad iricidenza radenle.

.X-ray astronomy: S-ray optics; S-ray concentrators: Grazing-incidence mirrors. CSR IFC:\l Palermo Italis CSR ITESRE Bologna Italia CZR I.-\S Frasiati Ilalia LSI Dipartimento di Fisica \lilano Jtalia LSI Istituto dell’ Osservatorio Xstronomico Roma ltalia IPC Osscrvatorio ,A\strononiico di Brera-lIerate : .\lilano ltslia OIS SROS Ctrecht Paesi Bassi OIS ESX - SSD Soordxijk Paesi Bassi

Continuazionc SI 002 L’od-Disciplina ‘1.12 Cod Obbicttivo 39.02 \nni L’omo 3.25 Ucscrizione_.~ttivita’

Sistema Ji specshi - Cn cospicuo lavoro e’ stat0 sviluppato press0 I’lsdtuto per mettere a punt0 13 tecnolo$a. :\tlualmente e’ in avanzata costruzione il protolipo del sistema asscmblato di specchi. Le attkita the \ erranno svelte nel 19S9 sono: - completamcnto modello strutturale validazione del procedirnento di realizzazione Je$i spider prove termomezcaniche sul modello strutturale con I’ industria ’ - assistenza allo sviluppo dei sistemi di specchi da parte dell’ industria second0 le procedure da noi detinite - qualiknzione tisica degli specchi al fascia di raggi X. Sviluppo del satellile e della missione - l’n gruppo di attivita’ parallele comprende il support0 all’,\SI per il controllo del contratto industriale per la costruzione dcl satellite. L’n contribute c’ oricntato alla dclinizione dcll’nrchitcttura di raccolta dei dati scientitici a bordo e del metodo di lrasmissione a terra. Cn secondo contribute e’ volt0 alla organizznzione Jel sisterna di terra per ii controllo dcila nlissione e I’ elaborazione dci dati scientifici. In parallrlo viene studiata una prima proposta di programma osservativo e viene progcttato il software di simulazione della risposta degh specchi.

hodotti Tecnolo;ie delle superfici superliscie rrdcribilita Industrie optomeccaniche di alta precisione I)lt~-.-\~Eiorearncr~t(:l 1939.0943 I

FIG. 2. Structure of a sample document describing a CNR research project.

78 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993

Page 10: A fuzzy linguistic approach generalizing Boolean Information ...

structuring the research project content is identified by a label, called the class identifier (e.g., in the first column of Fig. 2 Cod-Ricerca is the identifier of the research subject area, Responsabile is the name of the scientific leader of the project, Titolo and Title are the Italian and English title of the research project, Parole-Chiave and Keywords are the Italian and English keywords). The type and the number of the classes in a document depend on the semantics of the archive and their definition is made before starting the archive generation phase by an expert. During this phase it is possible to define which classes of the documents are to be indexed. In this experiment a full- text indexing procedure on all the classes with a stop list used as the negative dictionary has been adopted to select the index terms, and their index term weights F(d, t) have been computed by applying definition (1). Once the archive was generated, the compatibility function of the linguistic term important has been defined by setting the core [i,j] equal to [0.7,1], and k = 0.000083.

An evaluation of the effectiveness of the CNR research project archive has been carried out by the computation of the parameters of recall and precision on a set of sample test queries. Recall, R(q), is a number expressing the ratio between the number of relevant retrieved documents in response to a query q and the number of relevant documents contained in the whole collection ((K(q)l).

R(q) = K(q) II L(ql IK(q)l

Precision, P(q), is a number expressing the ratio between the number of relevant retrieved documents in response to a query q and the total number of retrieved documents (IL(s>l).

P(q) = IK(q) n L(q)1 IUq)l

While recall is a measure of the ability of the system to retrieve useful documents, precision measures an ability to reject useless information.

A crucial aspect in the evaluation of the recall parameter is that a methodology is required to estimate the set K(q) of relevant documents when the considered collection contains thousands. As the browsing of a whole archive is very time consuming, generally a sample subset of the archive is selected on which to perform the R and P measures; due to the reduced cardinality of the sample set, the feasibility and reliability of users’ estimation of relevant documents increases. In this experiment, instead of using a random subset of the archive, we have selected a meaningful subset consisting of all documents pertaining to the subject area of the test query. This has been achieved based on the following considerations: in the archive about CNR research projects there are 11 subject areas, and for each document there exists a unique index term t, belonging to the class Cod-Ricerca, univocally associated to a subject area s E S (see first class in Fig. 2; this document belongs to the subject area physics). A sample collection for each

of the 20 test queries was easily obtained by selecting documents in which the content of the class Cod-Ricerca is associated with the subject area of the corresponding test query. The sample collections have been browsed to state the user relevance to the test queries. The judgment of the user was aimed at determining all documents worth retrieving, i.e., the set K(q). In fact, the aim of this evaluation was to show the distribution of the documents judged as worth retrieving by the user in the various relevance classes. The results shown in Table 1 have been computed by applying the Salton (1984) increasing output methodology.

Due to both the synthetic description of the CNR projects (documents with average length of 40 lines) and to the full-text indexing which is characterized by a high exhaustivity and a comparatively low specificity which produce high R values and low P values (Salton, 1984) we obtained a 100% recall value for each of the 20 test queries. The precision values for each query are listed in the first column of Table 1. Let us comment the first line of Table 1: DOMINO retrieves by query (satellit*; important) 37 documents, 16 of which have been judged worth retrieving by the user (P = 43%). Moreover, 12 documents, judged worth retrieving by the user, are also classified by DOMINO in the set Rl = very relevant or relevant or fairly relevant, with R = 75% and P = 66%; it follows that DOMINO classifies 18 documents in R 1. Then the set R 1, which groups all the relevant classes except the class not very relevant, collects 75% of the documents judged worth retrieving by the user. Finally, six documents judged worth retrieving by the user are also classified by DOMINO as belonging to the class R3 = very relevant with R = 37% and P = 75%; this means that DOMINO classifies eight documents as very relevant, and this class collects 37% of the documents judged worth retrieving by the user. It can be noticed that the precision increases for most of the test queries by considering more and more restrictive relevance classes: starting from all the documents retrieved (first column), then proceeding with documents classified as Rl (second column) and finally with class R3 (third column).

The advantage of this classification mechanism is that as it provides for a distribution of the retrieved documents in relevance classes, it offers the user a means to select one or more relevance classes characterized by distinct values for P and R.

It is also worthwhile pointing out that the efficiency, in terms of user effort, response time, and cost of the retrieval activity, is an important component of the evaluation of an IRS. As far as the user effort is concerned, most of DOMINO’s users have considered the linguistic model as more appealing and friendly than the weighted model. With regard to the response time in Table 2, some measurements are listed for each of the 20 queries submitted to the system in proper form.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 79

Page 11: A fuzzy linguistic approach generalizing Boolean Information ...

TABLE 1. Recall and precision measurements on DOMINO IRS.

Test Query

No. of No. of No. of retrieved Dots. Dots.

P Documents P R inR1 tl W P R in R3 n W

(satellit*; important) (enzimi;fairly important) {fluid*; very important) (semiconduttorijmportant) (algoritmi; fairly important) (caIcio;important) (carbonio;important) (cataIizzat*;important) (digital*;important or fairly important) (elettric*; very important or important or

fairly important) (eIettrochimica;important) (1aser;important) (fertiIizz*;important) (fluorescenzajmportant) (fotochimica;important)OR(foton*;important) (climatoIogia;important) OR (precipitazion*;important or fairly important) OR (meteoroI*;very imp or important) (inquinamento; important or fairly important) OR (nebbia;very very important) (linfociti; very very important) OR (anticorpi;important or fairly important) (immagini;important or fairly important or not very important) AND (digital*; important) (ischemia;very important or important or fairly important) AND (miocard*; important or fairly important)

43% 37 66% 75% 12 75% 37% 6 71% 42 88% 53% 16 100% 16% 5 68% 22 100% 47% 7 100% 27% 4 76% 38 100% 48% 14 100% 31% 9 29% 24 75% 86% 6 100% 42% 3 84% 13 100% 36% 4 100% 18% 2 40% 24 60% 30% 3 100% 30% 3 62% 39 81% 37% 9 100% 24% 6 63% 22 75% 64% 9 100% 43% 6 37% 35 58% 54% 7 50% 7% 1

56% 38 100% 76% 37 90% 75% 3 100% 75% 14 100% 48% 34 80% 35% 34 100%

29%

50% 50% 25% 42%

6 100% 26 92%

1 100% 5 100% 4 100% 5 100%

4 22

1 1 3 2

60%

58%

71%

50%

44

41

21

44

100%

77%

100%

70%

54%

58%

33%

54%

14

14

5

12

100%

88%

100%

66%

19% 79% 50% 10% 19% 17%

15%

33%

20%

4

8

3

8

Average values 59% 30 86% 51% 9 94% 29% 5

W = set of documents judged worth retrieving by the user; Rl = set of documents classified as very relevant or relevant or fairly relevant by DOMINO; R3 = set of documents classified as very relevant by DOMINO.

80 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993

Page 12: A fuzzy linguistic approach generalizing Boolean Information ...

TABLE 2. Response and CPU time measurements of query evaluation in DOMINO (release 3.0, on an IBM 3090NM-XA-CMS).

Test Query

No. of Retrieved

Documents

Response Time

(se4

CPU Time

(se4

(satellit*; important) (enzimi;fairly important) (fluid*; very important) (semiconduttori;important) (algoritmi; fairly important) (calcio;important) (carbonio;important) (catalizzat*;important) (digital*; important or fairly important) (elettric’; very important or important or

fairly important) (elettrochimica;important) (1aser;important) (fertilizz*;important) (fluorescenza;important (fotochimica;important)OR (foton*;important) (climatologia;important) OR (precipitazion*;important or fairly important) OR (meteorol*;very imp. or important) (inquinamento; important or fairly important) OR (nebbia;very very important) (linfociti; very very important) OR (anticorpi;important or fairly important) (immagini;important or fairly important or not very important) AND (digital*; important) (ischemia;very important or important or fairly important) AND (miocard*; important or fairly important)

37 2.4 0.15 42 1.4 0.07

22 7.0 0.37

38 1.4 0.07

24 1.3 0.06

13 2.0 0.08 24 1.6 0.07

39 1.5 0.06

22 4.2 0.22

35 4.0 0.21

38 1.4 0.05 37 2.0 0.09

3 1.5 0.06

14 1.3 0.06 34 2.8 0.12

34 6.0 0.30

44 3.2 0.12

41 2.7 0.09

21 5.0 2.40

44 4.8 2.38

Average values 30 2.8 0.35

Conclusions Acknowledgments

The introduction of linguistic weights in queries has improved the expressive power of the weighted Boolean query language. By formulating fuzzy queries, the user is no longer obliged to translate into a precise number the concept of the importance of a term in a desired document. Moreover, the relevance classification partitions the retrieved documents into sets characterized by specific ranges for the recall and precision values, the classes containing the most relevant documents being characterized by a high precision and a low recall. This is a helpful means for directing the user in further selections of the retrieved documents.

The authors are grateful to Dr. Paola Carrara for her helpful discussions and contribution to this work and also thank Prof. Giuliano Boella for his support.

References

Biswas, G., Bezdek, J.C., Subramanian, V., & Marques, M. (1987). Knowledge-assisted document retrieval: I the natural language interface. Journal of the American Society for Information Science, 38, 50-96.

Another potential of the approach presented in this article is its straightforward applicability to linguistically evolve other weighted Boolean models based on a threshold or a relative semantics for query weights. Starting from these refinements, the definition of a unifying linguistic model interpreting linguistic query weights with different semantics could be the basis for a single, more powerful and complete, extended Boolean IRS.

Biswas, G., Bezdek, J.C., Subramanian, V., & Marques, M. (1987). Knowledge-assisted document retrieval: II the retrieval process. Journal of the American Society for Information Science, 38, 97-110.

Bolt, L., Kowalski, A., & Kozlowska, M. (1985). A natural language information retrieval system with extensions towards fuzzy reason- ing. international Journal of Man-Machine Studies, 23, 335-367.

Bookstein, A. (1980). Fuzzy requests: an approach to weighted Boolean searches. Journal of the American Society for Information Science, 31, 240-247.

Bordogna, G., Carrara, P., Gagliardi, I., Merelli, D., Naldi, F., & Padula, M. (1990). A system architecture for Multimedia Informa- tion Retrieval. Journal of Information Science, 16, 229-238.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993 81

Page 13: A fuzzy linguistic approach generalizing Boolean Information ...

Bordogna, G., Carrara, P., Gagliardi, I., Merelli, D., Naldi, F., Padula, M., & Pasi, G. (1991). Domino: un sistema di information retrieval multimediale. Proceedings of the AZCA Annual National Confer- ence. Siena, 25-27 September.

Bordogna, G., Carrara, C., & Pasi, G. (1991). Query term weights as constraints in fuzzy information retrieval. Information Processing & Management, 27, 15-26.

Bose, P., Pivert, 0. (1992). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems (to appear).

Buckles, B. P., Petry, F. E., & Cheung, Y. Y. (1989). Attribute gram- mars for the heuristic translation of query languages. Information Systems, 14, 507-514.

Buell, D.A. & Kraft, D.H. (1981). Threshold values and Boolean retrieval systems. Information Processing & Management, 17, 127-136.

Buell, D.A. & Kraft, D. H. (1981). A model for a weighted retrieval system. Journal of the American Society for Information Science, 32, 211-216.

Cater, S. C. & Kraft, D. H. (1987, June). TZRS: a topological infor- mation system satisfying the requirements of the Wailer-Krafi Wish List. Paper presented at the 10th Annual ACM-SIGIR Conference on research and development in IR, New Orleans, LA.

Cater, S. C. & Kraft, D. H. (1989). A generalization and clarification of the Waller-Kraft Wish-list. Information Processing& Management, 25, 15-25.

Cooper, W. (1988). Getting beyond Boole. Information Processing & Management, 24, 243-248.

Doszkocs, T. (1986). Natural language processing in information retrieval. Journal of the American Society for Information Science, 37, 191-196.

Dubois, D. & Prade, H. (1985). A review of fuzzy set aggregation connectives. Information Science, 36, 85- 121.

Kacprzyk, J. & Ziolkowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man, and Cybernetics, 16, 474-479.

Kacprzyk, J., Zadrozny, S., & Ziolkowski, A. (1989). FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 14, 443-453.

Kamel, M., Hadfield, B., & Ismail M. (1990). Fuzzy query processing using clustering techniques. Information Processing & Manage- ment, 26, 279-293.

Kerre, E.E., Zenner, R.B., & De Caluwe, R.M. (1986). The use of fuzzy set theory in information retrieval and databases: A survey. Journal of the American Society for Information Science, 37, 341-345.

Koll, M. & Srinivasan, P. (1990). Fuzzy versus probabilistic models for user judgements. Journal of the American Society for Znforma- tion Science, 41, 264-271.

Kraft, D.H. & Buell, D.A. (1983). Fuzzy sets and generalized Boolean retrieval systems. International Journal of Man-Machine Studies, 19, 45-56.

Lucarella, D. (1990) Uncertainty in information retrieval: An ap- proach based on fuzzy sets, Proceedings of the International Con- ference on Computers and Communications, 809-814.

Miyamoto, S. (1990). Fuzzy sets in information retrieval and cluster analysis. Boston: Kluwer.

Paice, CD. (1984). Soft evaluation of Boolean search queries in information retrieval systems. Information Technology Research in Device Applications, 3, 33 -41.

Radecki, T. (1979). Fuzzy set theoretical approach to document retrieval. Information Processing & Management, 15, 247-260.

Radecki, T. (1988). Trends in research in information retrieval. The potential for improvements in conventional Boolean retrieval systems. Information Processing & Management, 24, 219-227.

Salton, G., Fox, E. A., & Wu, H. (1983). Extended Boolean informa- tion retrieval. Communications of the ACM, 26, 1022-1036.

Salton, G. & McGill, M. J. (1984). Introduction to modern information retrieval. New York: McGraw-Hill.

Salton, G. (1989). Automatic text processing. The transformation, analysis and retrieval of information by computer. Addison Wesley series in Computer Science. Reading, MA: Addison-Wesley.

Sanchez, E. (1989). Importance in knowledge systems. Information Systems, 14, 455 -464.

Van Rijsbergen C. J. (1979), Information retrieval. London: Butter- worths.

Waller, W.G. & Kraft, D.H. (1979). A mathematical model of a weighted Boolean retrieval system. Information Processing & Management, 15, 235 - 245.

Yager, R. R. (1982). Linguistic hedges: their relation to context and their experimental realization. Cybernetics and Systems: An International Journal, 13, 357-374.

Yager, R. R. (1992). Fuzzy sets and approximate reasoning in decision and control. Proceedings of the First IEEE International Conference on Fuzzy Systems, San Diego, CA, 8-12 March, pp. 219-227.

Zadeh, L. A. (1972). A fuzzy-set theoretic interpretation of linguistic hedges. Journal of Cybernetics, 2, 4-34.

Zadeh, L.A. (1975), The concept of a linguistic variable and its application to approximate reasoning-I, II, Information Science, 8, 199-249, 301-357.

Zadeh, L. A. (1978). Fuzzy sets. Information and Control, 8, 338-353. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility.

Fuzzy Sets and Systems, I, 3-28. Zemankova, M., (1989). FILIP: A fuzzy intelligent information system

with learning capabilities. Information Systems, 14, 473-486.

82 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-March 1993