A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information...

20
Journal of Computing and Information Technology - CIT 12, 2004, 3, 175–194 175 A Taxonomy of Information Retrieval Models and Tools Gerardo Canfora and Luigi Cerulo RCOST – Research Centre on Software Technology, University of Sannio, Benevento, Italy Information retrieval is attracting significant attention due to the exponential growth of the amount of infor- mation available in digital format. The proliferation of information retrieval objects, including algorithms, methods, technologies, and tools, makes it difficult to assess their capabilities and features and to understand the relationships that exist among them. In addition, the terminology is often confusing and misleading, as different terms are used to denote the same, or similar, tasks. This paper proposes a taxonomy of information retrieval models and tools and provides precise definitions for the key terms. The taxonomy consists of superimposing two views: vertical taxonomy, that classifies IR models with respect to a set of basic features, and horizontal taxonomy, which classifies IR systems and services with respect to the tasks they support. The aim is to provide a framework for classifying existing information retrieval models and tools and a solid point to assess future developments in the field. Keywords: information retrieval, taxonomy, tools, mo- dels. 1. Introduction In recent years information retrieval has become an important subject of much research, because the amount of information available in digital formats has grown exponentially and the need for retrieving relevant information has assumed a crucial importance. The World Wide Web and the Digital Libraries have shown to a large au- dience the importance of effective mechanisms and tools to retrieve documents from a very large document collection based on user information needs. Information Retrieval IR is the scientific dis- cipline that deals with the analysis, design and implementation of computerized systems that address the representation, organization of, and access to large amounts of heterogeneous infor- mation encoded in digital format 58 . In this paper we focus on text document re- trieval, in which the information is represented by text documents. Therefore, for the purposes of this paper, the terms information and docu- ments are used interchangeably. Text document retrieval is the most traditional subfield of IR; however, IR comprises other subfields, such as image retrieval, speech retrieval, information generation, query answering, and text summa- rization, that we do not cover in this paper. A key feature of a text IR systems is retrieving the documents that can satisfy the information needs of a user from a large collection of docu- ments. Such systems, especially in the context of the web, are usually known as search en- gines, so that in the rest of the paper we will consider search engine as a synonym of infor- mation retrieval system. IR systems prepare the collection of documents for retrieval through an indexing step. User information needs are usu- ally represented by keywords or phrases, which are themselves indexed, although more complex representation languages are available. This representation, which causes inevitably a loss of information, is usually known as query. In- dexing can assume different forms according to the model adopted to represent both the docu- ments in the collection and the user information needs. Many current IR systems exploit ranked IR methods, i.e. they rank the documents in the collection based on a measure of their relevance with respect to the user information needs as represented by a query. The proliferation of information retrieval al- gorithms, methods, technologies, and tools, is

Transcript of A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information...

Page 1: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

Journal of Computing and Information Technology - CIT 12, 2004, 3, 175–194 175

A Taxonomy of InformationRetrieval Models and Tools

Gerardo Canfora and Luigi CeruloRCOST – Research Centre on Software Technology, University of Sannio, Benevento, Italy

Information retrieval is attracting significant attentiondue to the exponential growth of the amount of infor-mation available in digital format. The proliferationof information retrieval objects, including algorithms,methods, technologies, and tools, makes it difficult toassess their capabilities and features and to understandthe relationships that exist among them. In addition,the terminology is often confusing and misleading, asdifferent terms are used to denote the same, or similar,tasks.

This paper proposes a taxonomy of information retrievalmodels and tools and provides precise definitions forthe key terms. The taxonomy consists of superimposingtwo views: vertical taxonomy, that classifies IR modelswith respect to a set of basic features, and horizontaltaxonomy, which classifies IR systems and services withrespect to the tasks they support.

The aim is to provide a framework for classifying existinginformation retrieval models and tools and a solid pointto assess future developments in the field.

Keywords: information retrieval, taxonomy, tools, mo-dels.

1. Introduction

In recent years information retrieval has becomean important subject of much research, becausethe amount of information available in digitalformats has grown exponentially and the needfor retrieving relevant information has assumeda crucial importance. The World Wide Web andthe Digital Libraries have shown to a large au-dience the importance of effective mechanismsand tools to retrieve documents fromavery largedocument collection based on user informationneeds.

Information Retrieval �IR� is the scientific dis-cipline that deals with the analysis, design andimplementation of computerized systems that

address the representation, organization of, andaccess to large amounts of heterogeneous infor-mation encoded in digital format �58�.

In this paper we focus on text document re-trieval, in which the information is representedby text documents. Therefore, for the purposesof this paper, the terms information and docu-ments are used interchangeably. Text documentretrieval is the most traditional subfield of IR;however, IR comprises other subfields, such asimage retrieval, speech retrieval, informationgeneration, query answering, and text summa-rization, that we do not cover in this paper.

A key feature of a text IR systems is retrievingthe documents that can satisfy the informationneeds of a user from a large collection of docu-ments. Such systems, especially in the contextof the web, are usually known as search en-gines, so that in the rest of the paper we willconsider search engine as a synonym of infor-mation retrieval system. IR systems prepare thecollection of documents for retrieval through anindexing step. User information needs are usu-ally represented by keywords or phrases, whichare themselves indexed, althoughmore complexrepresentation languages are available. Thisrepresentation, which causes inevitably a lossof information, is usually known as query. In-dexing can assume different forms according tothe model adopted to represent both the docu-ments in the collection and the user informationneeds. Many current IR systems exploit rankedIR methods, i.e. they rank the documents in thecollection based on a measure of their relevancewith respect to the user information needs asrepresented by a query.

The proliferation of information retrieval al-gorithms, methods, technologies, and tools, is

Page 2: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

176 A Taxonomy of Information Retrieval Models and Tools

making it more difficult to assess the featuresand the characteristics of each IR aspect and tounderstand the relationships that exist amongthem. The terminology is often confusing; forexample, terms such as crawling, indexing, spi-dering, are often used to denote similar tasks,with no clear distinction of the differences.

In this paper we propose a classification of IRmodels and tools and provide definitions for thekey terms. The classification consists of super-imposing two views: one for the IR models andone for the IR objects, either tools or services.A vertical taxonomy classifies IR models withrespect to a set of basic features, and a horizon-tal taxonomy classifies IR objects with respectto their tasks, form, and context. The verticaltaxonomy is built by exploding two basic fea-tures of any IR model: the representation, thatis the model adopted to represent both the docu-ments and the user queries; and the reasoning,which refers to the framework adopted to re-solve a representation similarity problem. Thehorizontal taxonomy is derived from an analysisof the application areas of IR.

1.1. Related Works

In the literature, several studies have been pro-posed that outline classifications of IR modelsand tools. However, most of these studies donot cover the entire spectrum of IR objects; thereasons can be found either in the age of the pa-pers or in the specific objectives of the studies.For example, in 1984 Smith and Warner �69�published a document representation taxonomywith the aim of relating new research worksto previous works and to suggest new areas ofresearch. Nowadays, this taxonomy is largelyincomplete, because it does not consider, forexample, the representation of structured docu-ments. In 1987 Belkin and Croft �20� publisheda classification of the most important retrievaltechniques in which no reference is made to therelevance feedback model, because, as the au-thors explicitly state, relevance feedback is notconsidered a retrieval technique, rather a helpto refine the retrieval model.

In a more recent work, Paijmans �54� made aninteresting analysis of the most important re-trieval models. The approach adopted to con-struct a taxonomyof IR models consists of iden-tifying a generic model that forms a basis for a

variety of more specific models. Paijmans iden-tified the vector documentmodel as the basis forbuilding the classification and showed how thevector model can subsume other popular mo-dels. Whilst this constitutes a concise style ofclassification, it is unable to classify IR tech-niques that are not derived from the vector basedmodel, such as the logic-based techniques.

Our approach is different, as we start from aclassification of the basic features of IR mod-els and proceed with a classification of the ob-jects produced in the various fields of infor-mation retrieval in terms of tools and services.The flexibility of this faceted view is evidentwhen we consider that different information re-trieval objects can be based on the same in-formation retrieval model, and the same infor-mation retrieval model can be exploited to im-plement different information retrieval objects.For example, the classic vectormodel, generallypresented as a retrieval technique, can be usedfor building information filtering and documentclustering tools, too. The latter are different in-formation retrieval objects that exploit the sameinformation retrieval model.

1.2. Content and Structure of the Paper

There are two main viewpoints that characterizeinformation retrieval: we call these two view-points information retrieval objects and infor-mation retrieval models. The former is gener-ally an artifact that exists in the form of a tool ora service and responds to the “what” question;the latter is a set of theories on which the in-formation retrieval object is based and respondto the “how” question. The two aspects are re-lated, as one object can be based on more thanone model and one model can be the basis formore than one object. On this framework wehave built a horizontal taxonomy and a verticaltaxonomy. The horizontal taxonomy refers toIR objects, while the vertical one considers IRmodels.

The remainder of the paper is organized as fol-lows. Sections 2 and 3 introduce the verticaland the horizontal taxonomies, together withexamples of their application. Section 4 super-imposes the vertical and horizontal taxonomiesand shows how this can be used to obtain a map-ping of the object’s features on the underlyingmodels.

Page 3: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 177

2. Vertical Taxonomy

Modeling the process of information retrievalis complex, because many parts are, by theirnature, vague and difficult to formalize. Thehuman component assumes an important roleand many concepts, such as relevance and in-formation needs, are subjective. Therefore, in-formation retrieval models can be very com-plex and, consequently, their classification canbe hard. However, in the definition of any IRmodel we can identify some common aspects.Generally, the first step is the representation ofdocuments and information needs. From theserepresentations a reasoning strategy is definedthat solves a representation similarity problemto compute the relevance of documents with re-spect to queries. Various strategies have beenintroduced with the aim of improving the re-trieval process: we classify these methodolo-

gies under the reasoning component.

Representation and Reasoning can be used tocharacterize an information retrievalmodel. Forexample, in �52� an information retrieval modelis characterized as a quadruplefD,Q, F,R�q,d�gwhere:

� D is a set of logical views for the docu-ments in the collection �Representation com-ponent�;

� Q is a set of logical views for the user infor-mation needs �Representation component�;

� F is a framework for modeling documentrepresentation, queries and their relation-ships �Reasoning component�;

� R�q,d� is a ranking functionwhich associatesa real number with a query q � Q and a doc-ument d � D �Reasoning component�.

Fig. 1. Vertical taxonomy.

Page 4: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

178 A Taxonomy of Information Retrieval Models and Tools

An information retrieval model can be modeledas a couple �Rp, Rs� where Rp is the repre-sentation model of documents and queries, andRs is a framework for modeling the relationshipbetween document and query representations,which is the reasoning strategy. Every compo-nent can be divided into subcomponents and forevery subcomponent we can build a tree of pos-sible approaches and solutions presented in theliterature, as shown in Fig. 1.

Defining the approaches used for each compo-nent identifies an IR model. For example, thecouple �Rp, Rs�:

Rp�query� � f keyword-based g

Rp�document� � f weighted vector g

Rs�with logic� � f vector algebra g

identifies the well-known vector model, as wewill discuss later. We will now go into each ofthese components.

2.1. Representation

A fundamental component of an IR system isthe representation of the information itself: in-formation can be processed if it is representedin some way.

In text information retrieval, representationmeans representing documents and queries. Adocument is the representation of the informa-tion the author wished to encode; it is the unityof information that can be retrieved by an IRsystem. Queries are the representation of infor-mation needs of a user.

Any text can be characterized by using four at-tributes: syntax, structure, semantics, and style.A text has a given syntax and a structure, whichare usually dictated by the application or by theperson who created it. Text also has a seman-tics, specified by the author of the document.Additionally, a document may have a presen-tation style associated with it, which specifieshow it should be displayed or printed. In manyapproaches to text representation the style iscoupled with the document syntax and structure�see for example the LaTeX document prepara-tion system �40��. Modern representations, suchas XML �80�, separate the representation of syn-tax and structures, which are defined either bya DTD or an XSD, and style, which is capturedby XSL.

Whilst documents are characterized by syntax,structure, semantics and style, the structure andsemantics of text are generally sufficient to char-acterize queries.

Query Representation

A query is the representation of a user infor-mation needs. The user information needs isoriginated by a problem that the user should re-solve; it is implicit in the user mind and its pur-pose is the necessity to bridge a knowledge gap.An information need can be of three types �50�:known item information need, conscious infor-mation need, and confused information need.The first is when users search or verify the exis-tence of documents they know. The second iswhen users search for documents they do notknow, but regard a subject they know. The thirdis when users know neither the documents northe subject. The following classes of query rep-resentations can be identified:

� Keyword-based. This is the simplest formfor a query. It is composed by keywords andthe documents containing such keywords aresearched for. Keyword-based queries arepopular, because they are intuitive and easyto express. Usually, a keyword query is asingle word, but, in general, it can be a morecomplex combination of �Boolean� opera-tions applied to several words.

— Single word. It is the most elementaryquery that can be formulated in a textretrieval system. Depending on the rea-soning component, the result of a singleword query is generally the set of docu-ments containing at least one occurrenceof the searched word.

— Boolean. It is the oldest and still widelyused formof combining the keywords in aquery. A Boolean query is an expressionwhose elements are keywords, Booleanoperators and a precedence notation. Inaddition to classical Boolean operators,several new operators have been pro-posed, such as: theNEARoperator, whichallows context search capabilities and thefuzzy Boolean operator, which relaxesthe meaning of canonical AND and OR.

� Pattern-based. It is a more specific queryformulation, which allows the specification

Page 5: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 179

of text having some properties. A pattern isa set of syntactic features that must occur ina text segment. The segments satisfying thepattern specification are said to match thepattern.

� Structural. Structural queries are a mecha-nism to improve the retrieval quality of struc-tured information. Thismechanism is gener-ally built on top of the basic queries with theaddition of structural constrains expressedusing containment, proximity, or other re-strictions on the structural elements in thedocuments. Structural queries can be cate-gorized into three main categories: fixedstructure, hypertext, and hierarchical struc-ture. The first is the simplest form and, forthis reason, it is more restrictive. The docu-ments are divided into a set of fields each ofwhich contains some text. A fixed structuralquery restricts the search to text containedin certain document fields. The hypertext isprobably the most flexible form of structur-ing. It is a directed graph where the nodeshold some text and the links represent con-nections between the nodes. However, itis not possible to query the hypertext struc-tural connectivity, but only the text contentof the nodes. This transforms the retrievalactivity into a navigational activity �brows-ing task�. The hierarchical structure is an in-termediate structuring model and representsa natural decomposition for many text col-lections �books, articles, structural programsetc.�. For example, XML is the most promi-nent structural representation model and theXPath �81� is a query language for addressingpieces of content in the hierarchical struc-ture.

Document Representation

A document is a retrievable element of the doc-ument space of an information retrieval system.It can be considered as the minimal resourcethat an information retrieval system can retrieve.Historically, documents have been representedby a set of terms called keywords, which areusually extracted from the text or inserted bythe author. The following are the most signifi-cant types of document representation:

� Stream of characters. Text is represented asa stream of characters and no interpretationis made on its structure or semantic content.

� Vector space. The basic principle of this textrepresentation model is to consider that eachdocument is described by a vector of compo-nents that are representative of the semanticcontent of the document. Traditional vec-tor space approaches use a set of keywords,called index terms, but other types of repre-sentative components, such as n-grams, areused. An index term is a word whose se-mantics helps in identifying the documentsmain themes. Of course, not all terms of adocument are useful for describing the doc-ument content. In fact, there are index termswhich are vaguer than others. Deciding theimportance of terms is not a trivial task. In alarge collection of documents a word whichappears in each document is useless as anindex term, because it does not discriminatebetween documents. On the other hand, aterm that appears in one documentwill likelydescribe the content of this document ��45�,�83��. Vector representations can be furthercategorized a s follows.

— Binary. The text document is representedas a binary vector of terms. Each ele-ment of the vector represents a term andits value is ‘1’ if the term appears in thedocument, ‘0’ otherwise.

— Weighted. In this case element valuesare real numbers between 0 and 1, calledterm weights, and represent the affinityof the term with respect to the document.A widespread method to compute theterm weights exploits two factors �58�:Term Frequency �TF� and Inverse Doc-ument Frequency �IDF�. The first pro-vides a measure of how well the termdescribes the document contents �intra-cluster similarity�; the second measureshow well the term can discriminate docu-ments among the collections cluster dis-similarity�. A well-known term weight-ing scheme, valid for generic collections,is the product between the TF and IDFfactors. Several variations are describedby Salton and Buckley �66�.

� Latent semantic. In the traditionalvector space approach each documentis represented by a vector of n compo-nents, where n is the number of termsoccurring in the collection �dimension

Page 6: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

180 A Taxonomy of Information Retrieval Models and Tools

of the document space�. Latent Se-mantic Indexing �LSI� �27� reducesthe dimension of the document spaceby capturing term-to-term statisticalrelationships. The document space isthen represented by a new coordinatesystem of dimension k � n, called k-space �or LSI space�, in which each ofthe k dimension is a derived conceptoften called LSI factor or LSI feature.LSI features are identified by usinga method for matrix decompositioncalled Singular Value Decomposition�SVD�. The derived concepts may bethought of as artificial concepts; theyrepresent extracted common meaningcomponents of many different wordsand documents.

� Fuzzy subset. Fuzzy set theories dealwith the representation of classeswhose boundaries are non-well de-fined. Each element of the class is as-sociated with a membership functionthat defines the membership degree ofthe element in the class. In manyfuzzy representation approaches theTF-IDF function of the weighted vec-tor model is used as the fuzzy mem-bership function ��35�, �37��.

— N-Gram. The n-gram approach is insome respects an evolution of vector spaceapproaches. In the traditional vectorspace approaches the dimensions of thedocument space for a given collection ofdocuments are the words �or sometimesphrases� that occur in the collection. Bycontrast, in the n-gram approach, the di-mensions of the document space are n-grams: strings of n consecutive charac-ters extracted from the text without con-sidering word lengths, and even wordboundaries. Hence, the n-gram is a re-markably pure statistical approach, onethat measures the statistical properties ofstrings of text in the given collection anddoes not consider the vocabulary, lexi-cal, or semantic properties of the natu-ral language in which the documents arewritten. The n-gram length �n� and themethod for extracting n-grams from doc-uments vary from one author to another.In �22� Damashek uses n-grams of length5 and 6 for clustering text by language

and topic. He uses a sliding window ap-proach in which n-grams are obtained bymoving a windowof n characters througha document or a query, one character ata time. Some authors �82� also use n-grams that cross word boundaries, i.e.,that start within one word, end in anotherword, and include the space charactersthat separate consecutive words.

� Structural. Structural documents, similarlyto structural queries, are a mechanism to im-prove the retrieval quality. The main idea isto enrich documents with additional infor-mation that allow a computer to make partof the semantic content explicit. XML is themost prominent standard for modeling theseaspects of information.

2.2. Reasoning

With the term reasoning we refer to the setof methods, models, and technologies used tomatch document and query representations ina retrieval task. Strictly related with the rea-soning component is the concept of relevance.The primary goal of an information retrievalsystem is to retrieve the documents relevant toa query. The reasoning component defines theframework to measure the relevance betweendocuments and queries using their representa-tions.

A key question to address in order to understandthe reasoning component of an IR system is tofind a precise definition for relevance. This isstill an open problem within the IR community;the literature reports different definitions, but awidespread definition is �67�:

Relevance is the (A) of a (B) existingbetween a (C) and a (D) as determinedby an (E).

Where:

(A). measure, estimate, judgment� � �

(B). utility, matching, satisfaction� � �

(C). document, document representation,information provided� � �

(D). question, question representation,information need� � �

(E). request, intermediary, export� � �

Page 7: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 181

An attempt to clarify this definition has beenproposed by Mizzaro �51�. Starting from an ac-curate analysis of the interactions between theusers and the system, the paper identifies vari-ous types of relevance on which it is possible todefine an order relation.

An information retrieval reasoning strategy canbe one �or any combination� of: reasoning withlogic, reasoning with uncertainty, and reason-ing with learning. A reasoning with logic ap-proach deals especially with models developedas logical-mathematical theories. A reasoningwith uncertainty approach comes useful when-ever the system is unable to assess the truth ofall the aspects of the environment in which itoperates. In these cases its behavior is affectedby uncertainty. This is due to many reasons:it does not understand the environment prop-erties; there are many variables to process andnot enough time available, etc. Reasoning withlearning approaches apply with inductive ma-chine learning techniques. Machine learning isconcerned with systems that learn from expe-rience. In a classical system, the system de-signer inserts all the knowledge. Whenever thedesigner does not possess complete knowledgeof the system’s application domain, a learningmechanism is the only way to acquiring newknowledge. Learning mechanisms are usedboth for fulfilling an objective or to improveit. In IR the primary goal is to improve retrievaleffectiveness, for example, in terms of precisionand recall.

Most of the classical information retrieval mod-els deal with the reasoning with logic and rea-soning with uncertainty strategies. In the first,for example, fall methods based on first or-der logic ��47�, �8�, �6��, and methods basedon Boolean and vector algebra ��74�, �64�, �25�,�78�, �77��. In the second fall methods in whichthe vagueness and uncertainty aspects of IR aretreated in terms of probabilistic and fuzzy set ap-proaches. Since many information retrieval as-pects are affected by vagueness and uncertainty,many reasoning processes based on uncertaintyhave been proposed ��59�, �13�, �14�, �76�, �10�,�53�, �49�, �48�, �63�, �70��. Machine learningtechniques gained a growing popularity in thepast ten years ��23�, �16�, �43��.

Recently, several novel approaches have beenproposed, based on either graph theory ��12�,�24�, �33�, �55�� or formal ontology �31�.

Reasoning with Logic

� Logic. The logical approach to informationretrieval can be formulated in terms of thelogical formula P�d � n�, where the arrowis the conditional connective formalized bya logic to be chosen and P is the predicate:“the representation of document d is relevantto the representation of information need n”.The central problem is selecting the right im-plication connective, i.e. selecting the logicwhose implication connective best mirrorsrelevance. An overview of the role of logicinformation retrieval is reported in �68�.

� Algebra. Algebra calculus is the most com-mon approach. Under this item we includethe reasoning strategies which are based ona set of operations defined in an algebraicfield.

— Boolean algebra. In the conventionalBoolean algebra reasoning strategy thequery Boolean expression is computedto verify whether a document either sat-isfies a query �is relevant� or does notsatisfy it �is non-relevant�. No rankingis possible, and this is a significant lim-itation. A number of extended Booleanmodels have been developed to provideranked output. These extended Booleanmodels employ extended Boolean opera-tors �also called soft Boolean operators��42�.

— Vector algebra. Using a weighting sche-me for document and query representa-tions the vector algebra approach com-putes a numeric similarity between thequery and each document. The doc-uments can then be ranked accordingto how similar they are to the query.The usual similaritymeasure exploited indocument vector space is the inner prod-uct between the query vector and a givendocument vector �65�. If both vectorshave been cosine normalized, then theinner product represents the cosine of theangle between the two vectors; hence thissimilarity measure is often called cosinesimilarity. Other well-known variants ofsimilarity functions are: Dice’s coeffi-cient and Jaccard’s coefficient �58�.

� Graph theories. Graph theories deal withstructures formed by vertices and edges. The

Page 8: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

182 A Taxonomy of Information Retrieval Models and Tools

application of graphs algorithms to informa-tion retrieval becomes more interesting withthe advent of the web. Web resources canbe well modelled with a graph structure inwhich documents represent vertices and hy-perlinks represent edges. In �24� a Maxi-mum Flow method is introduced to identifyweb communities. Previous graph-based ap-proaches were applied to bibliographic doc-uments and were principally based on bib-liometric methods such as co citation andbibliographic coupling. Some of these areused in the web context, too. Such algo-rithm includes: PageRank algorithm �12� onwhich the Google �104� web search engineis based, HITS algorithm �33�, and SAE al-gorithm �55�.

Reasoning with Uncertainty

� Probability theories. Probabilistic theorieswere introduced by Robertson and SparckJones �59�. The fundamental reasoning ap-proach is based on the following assumption:given a user query and a document in the col-lection, the probabilistic reasoning processtries to estimate the probability that the userwill find the document interesting. Thereexist some alternative approaches based onBayesian networks. In particular, the infer-ence network �71� model has been used inthe INQUERY system �13�, while reference�57� introduces a generalization called beliefnetwork.

� Fuzzy set theories. Fuzzy IR models havebeen defined to overcome the limitations ofthe crisp Boolean IR models, in particularto manage the vagueness and incomplete-ness of users in query formulation. Fuzzyextended Boolean models are a superstruc-ture of theBooleanmodel bymeans ofwhichexistingBoolean IR systems can be extendedwithout redesigning them completely. Thestandard Boolean models apply an exactmatch between the query and the documentrepresentations, and then partition the docu-ment base into two sets: the retrieved doc-uments and the rejected ones. As a con-sequence of this crisp behavior, they areliable to reject useful items as a result oftoo restrictive queries, and to retrieve use-less material in reply to excessively gen-eral queries. Thus, softening the retrieval

activity to rank the retrieved items in de-creasing order of relevance to a user querycan greatly improve the effectiveness of suchsystems. This objective can be reached byextending the Boolean mode in several ways�35�. In the fuzzy extensions of documentrepresentations the aim is to provide morespecific and exhaustive representations ofthe documents information content, in or-der to reduce the imprecision and incom-pleteness of the Boolean indexing. For ex-ample, a document can be represented asa fuzzy set of terms. In the fuzzy gener-alization of the Boolean query language theobjective must have a more expressive querylanguage, in order to capture the vaguenessof the user needs as well as to simplify theuser system interaction. Various approacheshave been proposed. One of these intro-duces soft connectives of selection criteria�11�, characterized by a parametric behaviorwhich can be set between the two extremes“AND” and “OR”. In other approaches, theBoolean query language has been genera-lized by defining aggregation operators aslinguistic quantifiers, such as “at least k” or“about k”.

Reasoning with Learning

Several authors have proposed the use of ma-chine learning approach in IR. The most fre-quently used techniques include �16�: multiplelayered and feed-forward neural networks suchas back propagation networks �62�, symbolicand inductive learning algorithms such as ID3�56� and ID5R �72�, and evolution-based algo-rithms such as genetic algorithms �34�.

� Neural networks. Neural network comput-ing seems to fit well with conventional re-trievalmodels such as the vector spacemodeland the probabilistic model. One of the firstapplications in IR comes from Belew �7�. Hedeveloped a three-layer neural network ofauthors, index terms, and documents. Thesystemused relevance feedback from its userto change its representation of authors, indexterms, and documents over time. An evolu-tion of this application has been introducedby Kwok �39�, who uses a modified Hebbianlearning rule to reformulate probabilistic in-formation retrieval. In other applications the

Page 9: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 183

Neural Network approach has been used formore specific tasks. For example, in �44�, aKohonen’s self-organizing feature map wasapplied to construct a self organizing repre-sentation of the semantic relationships be-tween documents. A Neural Network doc-ument clustering algorithms was developedin �46�. The Hopfield neural network’s par-allel relaxation method was used in �17� forconcept-based document retrieval and explo-ration.

� Symbolic learning. In IR the use of symboliclearning is more limitedwith respect to otherlearning techniques. In �9� a symbolic learn-ing technique is used for automatic text clas-sification. The symbolic learning processrepresents the numeric classification resultsin terms of IF-THEN rules. In �26� a regres-sion method and ID3 were used to imple-ment a feature-based indexing technique. In�18� ID3 and the incremental ID5R algorithmwere adopted for information retrieval. Bothalgorithms were able to use user-suppliedsamples of desired documents to constructdecision trees of important keywords whichcould represent the user’s query.

� Genetic algorithms. Several genetic algo-rithms implementations have been devel-oped in the context of IR. �29� presents a ge-netic algorithm-based approach to documentindexing, in which competing document de-scriptions �binary vector of term� are associ-ated with a document and altered over timeby using geneticmutation and crossover ope-rators. In this design, a keyword representsa gene �bit pattern�, a document which isa vector of keywords �bit string� representsindividuals, and a collection of documents,initially judged relevant by a user, repre-sents the initial population. Based on a Jac-card’s matching function, the initial popula-tion evolves through generations and eventu-ally converges to an optimal, improved pop-ulation. In �30� a similar approach is adoptedfor document clustering.

2.3. An Example

As an example of application of the verticaltaxonomy, we have taken some relevant worksfrom the IR models field and tried to classify

Table 1. Vertical taxonomy of a set of Information Retrieval Models.

Page 10: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

184 A Taxonomy of Information Retrieval Models and Tools

them using the vertical taxonomy. We iden-tify each information retrieval model in relationto the representation and reasoning componentsdescribed above. This is shown in Tab. 1. Anotable aspect is that many models contain theweighted vector as a representation component;this is why Paijmans �54� introduced the vectordocument model.

3. Horizontal Taxonomy

The vertical taxonomy alone is not sufficient totake into account all the objects that have beenproduced under the IR umbrella. Users do notinteract with a model, but generally they use asoftware tool that is able to solve an informationretrieval problem. This calls for the introduc-tion of a further dimension, a newviewpoint thatwe call horizontal taxonomy. Through the hor-izontal taxonomy we classify information re-trieval objects. An information retrieval objectis an artifact that solves a more or less generalIR problem. An information retrieval object is

Fig. 2. Horizontal taxonomy.

identified by three components, as illustrated inFig. 2: Tasks, Form, and Context.

3.1. Tasks

Information retrieval tasks are concerned witha particular aspect of information retrieval de-rived from a user point of view and should notbe confused with the tasks in an informationretrieval process, such as query formulation,query expansion, comparison, ranking, docu-ment presentation. An information retrieval ob-ject can support one or more tasks and a taskcan be stand-alone or it can be integrated ina process to perform a larger task. We haveidentified the following tasks: ad hoc retrieval,known item search, interactive retrieval, filter-ing, browsing, clustering, mining, gathering andcrawling. Sometime they are known by differ-ent names because they are inherited from var-ious research areas.

Ad Hoc Retrieval

An ad hoc retrieval task is characterized by anarbitrary subject of the search and a short du-ration �73�. It is typically performed by a re-searcher doing a literature search in a library.In this environment the retrieval system knowsthe set of documents to be searched, but cannotanticipate the particular topic that will be inves-tigated �73�. A retrieval system’s response to anad hoc search is generally a list of documentsranked by decreasing similarity to the query.The internet search engines are examples of in-formation retrieval objects from which one canperform ad hoc search.

Known Item Search

A known item search is similar to an ad hocsearch, but the target of the search is a partic-ular document �or a small set of documents�that the searcher knows to exist in the collec-tion and wants to find it �73�. An informationretrieval object that performs this task usuallyimplements a precise query language �for ex-ample, structural query language� with whicha searcher can reach parts of a document withknown structure and semantics. For example,in the library environment, a researcher that willretrieve all articles by an author.

Page 11: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 185

Interactive Retrieval

A user’s judgment of the usefulness of a doc-ument may vary during an information seek-ing activity �38�; this can be captured by thesystem through an interactive information re-trieval task. During the interactive task the sys-tem attempts to perceive how the user interactswith it and, as a consequence, it can modifythe current search strategy �60�. Classical rel-evance feedback approaches �61� can be seenas early techniques for interactive retrieval; theuser interaction is captured as yes�no judgmentof documents relevance. The system uses thesejudgments to expand and�or reweigh the query�32�.

Filtering

Also known as selective dissemination of in-formation, or text routing, filtering combinesaspects of text retrieval and text categorization.Like text categorization, a text filtering systemprocesses documents in real time and assignsthem to zero or more classes. However, like textretrieval, each class is typically associated withthe information needs of one or a small groupof users. Each user, or user group, can typicallyadd, remove, or modify the queries, or profiles,according to their needs. Examples include:NewsSieve �100� a client�server USENET newsfiltering system that can be used in a desktop en-vironment, NewsWeeder �87� an experimentalUSENET news filtering service, and SIFT theStanford Information Filtering Tool �86�, whichincludes two selective dissemination services,one for computer science technical reports andone for USENET news articles.

Browsing

When users are not interested in posing a spe-cific query to the system, but they invest sometime in exploring the document space, lookingfor interesting references, then they are brows-ing the space, instead of searching. There arethree types of browsing, namely, flat, structure-guided and hypertext. In flat browsing the ideais that the user explores a document space whichhas a flat organization; for example, files in adirectory. In structure-guided browsing the useris generally guided by a hierarchical structure

in which documents are organized in categoriesand subcategories. The hypertext model intro-duces a navigational structure which allows auser to browse text in a non sequential man-ner. The web is the most well know example ofhypertext structure.

Clustering

The term emerges from the statistics commu-nity, where it is well known as classificationanalysis and discriminant analysis �3�. In theartificial intelligence community, the task is of-ten called concept learning. Clustering is theautomatic recognition and the generation of cat-egories of entities that can be text documents.It is usually based on some similarity measurebetween documents, as well as an explicit orimplicit definition of what distinguishing char-acteristic should the groups of documents have.It is generally used to improve the retrieval pro-cess, because the search can be restricted on aset of interested category. In conjunction withclustering is categorizing, which is the recog-nition and assignment of the document to oneor more pre-existing categories. An example ofcategorization tools is CORA �Computer Sci-enceResearch Paper SearchEngine� �84�, an au-tomatic categorizing tool for scientific papers.An example of categorizing service is the YahooDirectory �99�; in this case the categorization isperformed manually, by human experts.

Mining

Mining is the process of automatically extract-ing key information from text documents. Suchinformation can be: language identification,feature extraction, terminology extraction, pre-dominant themes extraction, abbreviation ex-traction and relation extraction. LEXA �89� isan example of a corpus processing software,while the IBM text miner �91� is a mining toolintegrated with the homonymous text search en-gine.

Page 12: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

186 A Taxonomy of Information Retrieval Models and Tools

Gathering

This is an activity involving pro-active acqui-sition of information from possibly heteroge-neous sources. The metasearch engines exem-plify a particular type of gathering task. Meta-crawler �92�, InFind �116� are some examples.They combine outputs of several search enginesand present the results as if produced by a singlesearch engine.

Crawling

Crawling is concerned with the activity of se-lecting new, or updating the existing, sourcesof information that will be processed by suc-cessive activities, for example mining and�orgathering. It is also known as indexing processand, especially in the Web context, as spidering.Well known examples are: Scooter �94�, Archi-textSpider �110�, Sidewinder �112�, Slurp �102�and Guliver �114�; the spiders of Altavista �93�,Excite �109�, Infoseek �111�, Inktomi �101� andNorthernlight �113�.

3.2. Form

The form refers to the way in which the object issupplied to the final user. It can be supplied inthe form of tool or service. When the object isimplemented as a software product, then it is atool. It exists because, for example, a companyhas produced it to make business. It can be dis-tributed, installed, sold, etc. When the objectexists only in one, or a few instances used to de-liver some information retrieval services, thenit is a service. Examples are search engines onthe web.

3.3. Context

The context of an information retrieval objectregards its domain of application. It can begeneral or specific. A general purpose infor-mation retrieval object operates on heteroge-neous domains and contents, unlike a contextspecific system that operates on document col-lections belonging to a specific domain, such aslegal and business documents, technical papersetc. Notable examples are web search engines,

where the high heterogeneity of the informa-tion calls for a very general purpose approach.Google �104�, Altavista �93�, and Infoseek �111�,are some general purpose engines that currentlyoperate on the web. A specialized retrieval sys-tem is one that is developed with a particularapplication domain in mind. For instance, theLEXIS-NEXIS �119� retrieval system is a spe-cialized retrieval system that provides access toa very large collection of legal and business doc-uments. Similarly, the ResearchIndex service�105� provides free access to a large collectionof scientific paper.

3.4. An Example

As we did with the vertical taxonomy, here weapply the horizontal taxonomy to a set of in-formation retrieval objects. We have chosen31 objects from various sources: research labs,companies, and institutions.

Themain classification scheme consists of iden-tifying, for each object, its horizontal compo-nents included in Fig. 2.

This is done by analyzing the object as a blackbox and trying to fetch information about whatit does. The result is viewed in the Appendix inwhich information retrieval objects are listedwith some information notes and references.The presence of a cross establishes that the cor-responding horizontal component is supportedby the information retrieval object.

4. Concluding Remarks

For the purpose of simplicity, we have con-ducted the classification on two separate paths:a horizontal taxonomy and a vertical taxonomy.In reality, these taxonomies are not disjoint andin this concluding section we show how thesetwo important aspects of information retrievalcan be combined. We have already remarkedthat an information retrieval object can be basedon more than one model and an information re-trieval model can be the basis for more than oneobject.

The vertical dimension classifies informationretrieval models based on a two componentsview, namely representation and reasoning. Thehorizontal dimension classifies information re-trieval objects with respect to the application

Page 13: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 187

Table 2. Vertical projections.

areas. Indeed, objects can themselves be clas-sified with respect to the vertical components,namely representation and reasoning. We callthis further classification of an IR object the ver-tical projection of the object; Tab. 2 shows thevertical projection for the IR objects referred toin the Appendix. Note that a few rows in the ta-ble are left blank, as we were not able to access

the information needed to produce the verticalprojections of the related objects.

In recent years, information retrieval has as-sumed an increasing importance because of thedramatic growth of the amount of informationavailable in digital formats. The proliferationof information retrieval algorithms, methods,

Page 14: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

188 A Taxonomy of Information Retrieval Models and Tools

technologies, and tools calls for the definitionof basic concepts and terminology; this is use-ful to assess the features and the characteristicsof each IR object and to understand the rela-tionships that exist between the objects. In thispaper we have proposed a taxonomy of IR ob-jects, accompanied with definitions for the keyterms. This taxonomy is a tentative first step inclassifying IRmodels and tools, since it does notcover all aspects of IR. The market and the de-velopment of IR technologies are still evolvingand this evolution will make some observationscontained in this paper obsolete. As a result,this work will need to be updated incrementallyas the technology develops. However, we thinkthat the taxonomy presented in this paper pro-vides a good starting point for such a continuousupdating.

One of the main limitations of the taxonomypresented in this paper is the fact that it coversonly text information retrieval. Indeed, cur-rent information needs require more and moreintegrated retrieval models and tools that com-bine the traditional retrieval of text documentswith the retrieval of multimedia content, suchas images and speech, and even structured datafrom databases. Therefore, there is room forimprovement of the proposed taxonomy and weare currently working on extending it in order toinclude other important aspects of IR not cove-red here, primarily the retrieval of multimediacontent.

5. Acknowledgment

The work described in this paper has been sup-ported by the EUREKA Project E!2235, IKF –Information and Knowledge Fusion.

References

�1� AGOSTI, M., CRESTATI, F., TACHIR: a Tool for theAutomated Construction of Hypertexts in Infor-mation Retrieval, Proceedings of RIAO, RockfellerUniversity, �1994�, NewYork �USA�.

�2� ANANDEEP S., SYCARA, P.K., A Learning Per-sonal Agent for Text Filtering and Notification,Proceedings of the International Conference ofKnowledge Based Systems, �1996�, �http���www�ri�cmu�edu�pubs�pub �����html�.

�3� ANDERBERG, M.R., Cluster analysis for applica-tions, Academic Press, NewYork, 1973.

�4� BAEZA-YATES, R., GONNET, G., Efficient textsearching of regular expressions, Proceedings ofthe 16th International Colloquium on Automata,Languages and Programming, LNCS 372, �1989�,pp. 46–62, Berlin �Germany�.

�5� BAEZA-YATES, R., NAVARRO, G., Fast approximatestring matching, Algorithmica, 23(2), �1999�, pp.127–158.

�6� BEERI, C., KORNATZKY, Y., A logical query lan-guage for hypertext systems, Proceedings of theEuropean Conference on Hypertext, �1990�, pp.67–80, Versailles, �France�.

�7� BELEW, R.K., Adaptative information retrieval,Proceedings of the 12th Annual InternationalACM/SIGIR Conference on Research and De-velopment in information Retrieval, �1989�, pp.11–20, Cambridge �MA�.

�8� BERND T., Logic Programs for Intelligent WebSearch, Proceedings of the 11th International Sym-posium on Methodologies for Intelligent Systems,�1999�, LNAI 1609, Warsaw, �Poland�.

�9� BLOSSEVILLE,M.J., HEBRAIL, G., MONTEIL, M.G.,PENOT, N., Automatic document classification:natural language processing, statistical analy-sis, and expert system techniques used together,Proceedings of the 15th Annual InternationalACM/SIGIR Conference on Research and De-velopment in information Retrieval, �1992�, pp.51–57, Copenhagen �Denmark�.

�10� BOOKSTEIN A., Fuzzy request: an approach toweighted Boolean searches, Journal of the Amer-ican Society for Information Science, 31, �1980�,pp. 240–247.

�11� BORDOGNA, G., PASI, G., A Fuzzy LinguisticApproach Generalizing Boolean Information Re-trieval; a Model and Its Evaluation, Journal ofthe American Society for Information Science, 44,�1993�, pp. 70–82.

�12� BRIN, S., PAGE, L., MOTWANI, R., WINOGRAD, T.,The PageRank Citation Ranking: Bringing Orderto the Web, Technical report, Stanford University,1998.

�13� BROGLIO, J., CALLAN, J.P., CROFT, W.B., NACH-BAR, D.W., Document retrieval and routing usingINQUERY system, Proceedings of the 3rd Re-trieval Conference TREC, �1995�, pp. 29–38,Gaithersburg �Maryland�.

�14� CALLAN, J., Document filtering with inferencenetwork. Proceedings of the 19th Annual Int. ACMSIGIR Conference on Research and Developmentin Information Retrieval, �1996�, pp. 262–269,Zurich �Switzerland�.

�15� CHANG, S.J., RICE, R.E., Browsing: a multidimen-sional framework, Annual Review of InformationScience and Technology, 28, �1993�, pp. 231–276.

Page 15: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 189

�16� CHEN, H., Machine learning for information re-trieval: neural networks, Symbolic learning, andgenetic algorithms, Journal of the American So-ciety for Information Science, 46(3), �1995�, pp.194–216.

�17� CHEN, H. LYNCH, K.J., BASU, K., NG.,D.T., Gen-erating, integrating, and activating thesauri forconcept-based document retrieval, IEEE EXPERT,Special Series on Artificial Intelligence in Text-based Information Systems, 8(2), �1993�, pp.25–34.

�18� CHEN, H., SHE, L., Inductive query by examples�IQBE�: A machine learning approach, Proceed-ings of the 27th Annual International Confer-ence on System Sciences, Information Sharingand Knowledge Discovery Track, �1994�, Maui�Hawaii�.

�19� COOPER, W.S., GEY, F.C., DABNEY, D.P., Proba-bilistic retrieval based on staged logistic regres-sion, Proceedings of the 15th Annual Int. ACMSIGIR Conference on Research and Developmentin Information Retrieval, �1992�, pp. 198–210,Copenhagen �Denmark�.

�20� CROFT, W.B., Approaches to intelligent informa-tion retrieval, Information Processing and Man-agement, 23(4), �1987�, pp. 249–254.

�21� CUTTING, D.R., PEDERSEN, J.O., KARGER, D.,TUKEY, J.W., Scatter�gather: a cluster-based ap-proach to browsing large document collections,Proceedings of the 15th Annual Int. ACM SI-GIR Conference on Research and Developmentin Information Retrieval, �1992�, pp. 318–329,Copenhagen �Denmark�.

�22� DAMASHEK, M., Gauging similarity with n-grams:Language-independent categorization of text, Sci-ence, 267, �1995�, pp. 843–848.

�23� DOSZKOCS, T.E., REGGIA, J., LIN, X., Connec-tionist models and information retrieval, AnnualReview of Information Science and Technology,25, �1990�, pp. 209–260.

�24� FLAKE, G.W., LAWRENCE, S., GILES, C.L., COET-ZEE, F.M., Self Organization and Identification ofWeb Communities, Journal of the IEEE ComputerSociety, 35(3), �2002�, pp. 66–71.

�25� FOX, E. A., Extending the Boolean and vectorspace models of information retrieval with P-normqueries and multiple concept types, PhD thesis,Cornell University, 1983.

�26� FUHR, N., HARTMANN, S. KNORZ, G., LUSTIG, G.,SCHWANTNER, M., TZERAS, K., AIR�X – a rule-based multistage indexing system for large subjectfields, Proceedings of the 8th National Conferenceon Artificial Intelligence, �1990�, pp. 789–895,Boston �MA�.

�27� FURNAS, G. W., DEERWESTER, S., DUMAIS, S. T.,LANDAUER, T.K., HARSHMAN, R.A., STREETER,L.A., LOCHBAUM, K.E., Information retrieval us-ing a singular value decomposition model of latentsemantic structure, Proceedings of the 11th An-nual Int. ACM SIGIR Conference on Research and

Development in Information Retrieval, �1998�, pp.257–265, Grenoble �France�.

�28� GARFIELD, E., Citation Indexing: Its Theoryand Application in Science, John Wiley & Sons,NewYork, 1979.

�29� GORDON, M., Probabilistic and genetic algorithmsfor document retrieval, Comunication of the ACM,31(10), �1988�, pp. 1208–1218.

�30� GORDON, M.D., User-based document clusteringby redescribing subject descriptions with a geneticalgorithm, Journal of the American Society forInformation Science, 42(5), �1991�, pp. 311–322.

�31� GUARINO, N., MASOLO,C., VETERE, G., Ontoseek:Content-Based access to the web, IEEE IntelligentSystems, 14(3), �1999�, pp. 70–80.

�32� HAINES, D., CROFT, W.B., Relevance feedback andinference networks, Proceedings of the 16th An-nual Int. ACM SIGIR Conference on Research andDevelopment in Information Retrieval, �1993�, pp.2–11, Pittsburgh �USA�.

�33� KLEINBERG, J.M., Authoritative Sources in a Hy-perlinked Environment, Proceedings of the 9thAnnual Int. ACM SIAM Symposiumon Discrete Al-gorithms, �1998�, pp. 668–677, New York �USA�.

�34� KOZA, J.R., Genetic Programming: On the Pro-gramming of Computers by Means of NaturalSelection, The MIT Press, Cambridge, MA, 1992.

�35� KRAFT, D., BUEL, D.A., Fuzzy sets and generalizedBoolean retrieval systems, International Journalof Man-machine Studies, 19, �1983�, pp. 45–56.

�36� KRAFT, D., PETRY, F.E., BUCKLES, B.P., SADASI-VAN, T., The use of genetic programming to buildqueries for information retrieval, IEEE Sympo-sium on Evolutionary Computation, �1994�, pp.468–473, Orlando �USA�.

�37� KRAFT, D.H., BORDOGNA, G., PASI, G., Fuzzy settechniques in information retrieval, in J. Bezdek,D. Dubois and H. Prade �eds�, Fuzzy Sets inApproximate Reasoning and Information Systems,3(8), �1999�, pp. 469–510, Kluwer AcademicPublishers.

�38� KUHLTHAY, C. C., Inside the search process: In-formation seeking from the user’s perspective,Journal of the American Society for InformationScience, 42(5), �1991�, pp. 361–371.

�39� KWOK, K.L., A neural network for probabilisticinformation retrieval, Proceedings of the 12th An-nual Int. ACM SIGIR Conference on Research andDevelopment in Information Retrieval, �1989�, pp.202–210, Cambridge �USA�.

�40� LAMPORT, L., LaTeX: A document PreparationSystem, User’s guide and Reference manual; 2ndedition, Prentice Hall, 1994.

�41� LAYAIDA, R., BOUGHANEM, M. CARON, A., Con-structing an information retrieval system with neu-ral networks, Lecture Notes in Computer Science,856, �1994�, pp. 561–570.

Page 16: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

190 A Taxonomy of Information Retrieval Models and Tools

�42� LEE, J.H., Properties of extended boolean mod-els in information retrieval, Proceedings of the17th Annual International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval, �1994�, pp. 182–190.

�43� LEWIS, D.D., Learning in intelligent informationretrieval, Proceedings of the 8th InternationalWorkshop on Machine Learning, �1991�, pp. 235–239, Morgan Kaufmann.

�44� LIN, X., SOERGEL, D., MARCHIONINI, G., A self-organizing semantic map for information retrieval,Proceedings of the 14th Annual Int. ACM SI-GIR Conference on Research and Developmentin Information Retrieval, �1991�, pp. 262–269,Chicago �IL�.

�45� LUHN, H.P., A statistical approach to mechanizedencoding and searching of library information,IBM Journal of Research and Development, 1,�1957�, pp. 309–317.

�46� MACLEOD, K.J., ROBERTSON, W., A neural algo-rithm for document clustering, Information Pro-cessing & Management, 27(4), �1991�, pp. 337–346.

�47� MCCUNE, B., TONG, R., DEAN, J.S., SHAPIRO, D.,Rubric: a system for rule-based information re-trieval, IEEE Transaction on Software Engineer-ing, 1985, 11(9).

�48� MIYAMOTO, S., NAKAYAMA, K., Fuzzy informationretrieval based on a fuzzy pseudo thesaurus, IEEETransactions on Systems and Man Cybernetics,1986, 16(2), pp. 278–282.

�49� MIYAMOTO, S., TERUHISA, M., KAZUHIKO, N.,Generation of a Pseudothesaurus for Informa-tion Retrieval base co-occurrences and fuzzy setoperations, IEEE Transaction Systems, Man andCybernetics, 13(1), �1983�, pp. 62–69.

�50� MIZZARO, S., A cognitive analysis of informa-tion retrieval, Proceedings of CoLIS2, �1996�, pp.233–250, Copenhagen �Denmark�.

�51� MIZZARO, S., How many relevancies in informa-tion retrieval?, Interacting with Computers, 10(3),�1998�, pp. 305–322.

�52� NAEZA-YATES, R., RIEBEIRO-NETO, B., ModernInformation Retrieval,Addison Wesley,New York,1999.

�53� OGAWA, Y., MORITA, T., KOBAYASHI, K., A fuzzydocument retrieval system using the keyword con-nection matrix and a learning method, Fuzzy Setsand Systems, 39, �1991�, pp. 163–179.

�54� PAIJMANS, H., Explorations in the documentvector model of information retrieval, Dis-sertation, Tilburg University, 1999. http���pi����kub�nl�����Paai�Bibliogr�

�55� PIROLLI, P., PITKOW, J., RAO, R., Silk from Sow’sEar: Extracting Usable Structures from the web,Proceedings of the ACM Conference on HumanFactors in Computing Systems, �1996�, pp. 118–125, New York �USA�.

�56� QUINLAN, J.R., Learning efficient classificationprocedures and their application to chess andgames, Machine Learning, an Artificial Intel-ligence Approach, �1983�, pp. 463–482, TiogaPublishing company, Palo Alto, CA.

�57� RIBEIRO-NETO, B.A., MUNTZ, R., A Belief net-work model for IR, Proceedings of the 19th An-nual Int. ACM SIGIR Conference on Research andDevelopment in Information Retrieval, �1996�, pp.253–260, Zurich �Switzerland�.

�58� RIJSBERGEN, C.J., Information Retrieval, Butter-worths, London, 1979.

�59� ROBERTSON, S.E., SPARCK JONES, K., Relevanceweighting of search terms, Journal of the AmericanSociety for Information Sciences, 27(3), �1976�,pp. 129–146.

�60� ROBINS, D., Interactive Information Retrieval:Context and Basic Notions, Information Science,3(2), �2000�, pp. 57–61.

�61� ROCCHIO, J.J., Relevance Feedback in InformationRetrieval, Prentice Hall, 1971.

�62� RUMELHART, D.E., HINTON, G.E., WILLIAMS, R.J.,Learning Internal Representations by Error Prop-agation, Parallel Distributed Processing, �1986�,pp. 318–362, The MIT Press, Cambridge, MA.

�63� SACHS W.M., An approach to associative retrievalthrough the theory of fuzzy sets, Journal ofthe American Society for Information Sciences,�1976�, pp. 85–87.

�64� SALTON,G., The SMART Retrieval System – Exper-iments in Automatic Document Processing, Pren-tice Hall, New York, 1971.

�65� SALTON, G., Automatic text processing: The trans-formation, analysis, and retrieval of informationby computer, Addison-Wesley, 1989.

�66� SALTON, G., BUCKLEY C., Term weighting ap-proaches in automatic retrieval, Information Pro-cessing and Management, 24(5), �1988�, pp. 513–523.

�67� SARACEVIC, T., RELEVANCE: A Review of anda Framework for the thinking of the notion ininformation science, Journal of the American So-ciety for Information Science, 26(6), �1975�, pp.321–343.

�68� SEBASTIANI, F., On the Role of Logic in In-formation Retrieval, Information Processing &Management, 34(1), �1998�, pp. 1–18.

�69� SMITH, L.C., WARNER, A.J., A taxonomy of repre-sentation in information retrieval design, Journalof Information Science, 8, �1984�, pp. 113–121.

�70� TAHANI, V.A., A fuzzy model of document re-trieval systems, Information Processing and Man-agement, 12, �1976�, pp. 177–187.

Page 17: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 191

�71� TURTLE, H., CROFT. W.B., Inference networks fordocument retrieval, Proceedings of the 13th An-nual Int. ACM SIGIR Conference on Research andDevelopment in Information Retrieval, �1990�, pp.1–24, Brussels �Belgium�.

�72� UTGOFF, P.E., Incremental induction of decisiontrees, Machine Learning, 4, �1989�, pp. 161–186.

�73� VOORHESS, E.M., HARMAN, D., Overview of TREC2001, National Institute of Standards and Technol-ogy, 2001.

�74� WALLER, W.G., KRAFT, D.H., A MathematicalModel of a Weighted Boolean Retrieval System,Information Processing & Management, 15(5),�1979�, pp. 235–245.

�75� WILKINSON, R., HINGSTON, P., Using the cosinemeasure in neural network for document retrieval,Proceedings of the 14th Annual Int. ACM SI-GIR Conference on Research and Developmentin Information Retrieval, �1991�, pp. 202–210,Chicago �USA�.

�76� WONG, S.K.M., YAO, Y.Y., On modeling in-formation retrieval with probabilistic inference,ACM Transactions on Information Systems, 13(1),�1995�, pp. 39–68.

�77� WONG, S.K.M., ZIARKO, W., RAGHAVAN, V.V.,WONG, P.C.N., On Extending the Vector SpaceModel for Boolean Query Processing, Proceed-ings of the 9th Annual Int. ACM SIGIR Conferenceon Research and Development in Information Re-trieval, �1986�, pp. 175–185, Pisa �Italy�.

�78� WONG, S.K.M., ZIARKO, W., WONG, P.C.N., Gen-eralized vector space model in information re-trieval, Proceedings of the 8th Annual Int. ACMSIGIR Conference on Research and Developmentin Information Retrieval, �1985�, pp. 18–25, NewYork �USA�.

�79� WU, S., MANBER, U., Agrep: a fast approximatepattern matching tool, Proceedings of USENIXTechnical Conference, �1992�, pp. 153–162, SanFrancisco �USA�.

�80� XML eXtensible Markup Language 1.0 �SecondEdition� W3C Recommendation 6 October 2000.http���www�w��org�XML�

�81� XPath XML Path Language 1.0 W3C Recommen-dation 16 November 1999.http���www�w��org�TR�xpath

�82� YANNAKOUDAKIS, E.J., GOYAL, P., HUGGIL, J.A.,The generation and use of text fragments for datacompression, Information Processing and Man-agement, 18, �1982�, pp. 15–21.

�83� ZIPF, H.P., Human Behaviour and the Principle ofLeast Effort, Addison-Wesley, Cambridge, 1949.

�84� CORA. http���cora�whizbang�com

�85� TACHIR. http���www�dei�unipd�it��ims�tachir�html

�86� SIFT. ftp���db�stanford�edu�pub�sift�sift�����netnews�tar�Z

�87� NewsWeeder. http���anther�learning�cs�cmu�edu�ifhome�html

�88� Grep. http���www�gnu�org

�89� LEXA.http���nora�hd�uib�no�lexainf�html

�90� OCP. http���info�ox�ac�uk�ctitext�resguide�resources�o���html

�91� IBM text miner. http���www�ibm�com

�92� Metacrawler. http���www�metacrawler�com�

�93� Altavista. http���www�altavista�com

�94� Scooter. http���www�altavista�com

�95� INQUERY. http���www�ciir�cs�umass�edu

�96� SMART.ftp���ftp�cs�cornell�edu�pub�smart�

�97� ILA. Internet Learning Agent. http���www�cs�washington�edu�homes�map�ila�html

�98� WebLearner.http���www�ics�uci�edu��pazzani�Coldlist�html

�99� Yahoo Directory. http���www�yahoo�com

�100� NewsSieve. http���www�newssieve�com�

�101� Inktomi. http���www�inktomi�com

�102� Slurp. http���www�inktomi�com

�103� Isearch.http���www�cnidr�org�isearch�html

�104� Google. http���www�google�com

�105� ResearchIndex.http���www�researchindex�com

�106� Agrep, Glimpse.http���glimpse�cs�arizona�edu�

�107� Scatter�Gather: http���www�sims�berkeley�edu��hearst�sg�overview�html

�108� Amalthaea. http���lcs�www�media�mit�edu��moux�papers�PAAM� �PAAM� �html

�109� Excite. http���www�excite�com

�110� ArchitextSpider. http���www�excite�com

�111� Infoseek. http���www�infoseek�com

�112� Sidewinder. http���www�infoseek�com

�113� Northern Light.http���www�northernlight�com

�114� Guliver. http���www�northernlight�com

�115� WEBSOM. http���websom�hut�fi�websom

�116� Infind. http���www�infind�com

�117� Lycos. http���www�lycos�com

�118� GeoSearch. http���www�northernlight�com

�119� LEXIS-NEXIS.http���www�lexis�nexis�com

Page 18: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

192 A Taxonomy of Information Retrieval Models and Tools

Appendix: Horizontal Taxonomy of a Set of Information Retrieval Objects

Page 19: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

A Taxonomy of Information Retrieval Models and Tools 193

Page 20: A Taxonomy of Information Retrieval Models and Tools · 2017-05-03 · A Taxonomy of Information Retrieval Models and Tools 177 2. Vertical Taxonomy Modeling the process of information

194 A Taxonomy of Information Retrieval Models and Tools

Received: September, 2002Revised: January, 2004

Accepted: May, 2004

Contact address:

Gerardo CanforaResearch Centre on Software Technology

Department of EngineeringUniversity of Sannio

Palazzo ex Poste – Via Traiano82100 Benevento

ITALYe-mail: gerardo�canfora�unisannio�it

GERARDO CANFORA received the Laurea degree in electronic engineer-ing from the University of Naples, Federico II, Italy, in 1989. He iscurrently a full professor of computer science at the Faculty of Engineer-ing and the Director of the Research Centre on Software Technology�RCOST� of the University of Sannio in Benevento, Italy. From 1990to 1991, he was with the Italian National Research Council �CNR�.During 1992, he was at the Department of Informatica e Sistemisticaof the University of Naples, Federico II, Italy. From 1992 to 1993, hewas a visiting researcher at the Centre for Software Maintenance of theUniversity of Durham, UK. In 1993, he joined the Faculty of Engineer-ing of the University of Sannio in Benevento, Italy. He has served onthe program committees of a number of international conferences. Hewas a program co-chair of the 1997 International Workshop on Pro-gram Comprehension and of the 2001 International Conference and theGeneral Chair of the 2003 European Conference on Software Main-tenance and Reengineering. His research interests include softwaremaintenance, program comprehension, reverse engineering, workflowmanagement, document and knowledge management, and informationretrieval. He serves on the Editorial Board of the IEEE Transactionson Software Engineering. He is a member of the IEEE and the IEEEComputer Society.

LUIGI CERULO received the Laurea degree in computer engineering fromthe University of Sannio, Italy, in 2001. He is currently an assistantresearcher at the Research Centre on Software Technology �RCOST�of the University of Sannio in Benevento, Italy. His research interestsinclude information retrieval, fuzzy logic, and visual languages.