A Task-Oriented Non-Interactive Evaluation Methodology for Information Retrieval Systems

15
Information Retrieval, 2, 115–129 (2000) c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands. A Task-Oriented Non-Interactive Evaluation Methodology for Information Retrieval Systems JANE REID [email protected] Department of Computer Science, Queen Mary and Westfield College, University of London, E1 4NS, England Received September 10, 1998; Revised July 12, 1999 Abstract. Past research has identified many different types of relevance in information retrieval (IR). So far, however, most evaluation of IR systems has been through batch experiments conducted with test collections containing only expert, topical relevance judgements. Recently, there has been some movement away from this traditional approach towards interactive, more user-centred methods of evaluation. However, these are expensive for evaluators in terms both of time and of resources. This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis. The main features of a task-oriented test collection are the adoption of the task, rather than the query, as the primary unit of evaluation and the naturalistic character of the relevance judgements. Keywords: user-centred evaluation, test collection, nature of relevance, task framework 1. Introduction The effectiveness of an IR system depends on its ability to retrieve ‘relevant’ documents and suppress retrieval of ‘non-relevant’ documents (van Rijsbergen 1979, page 6). The meaning of ‘relevance’ in this context is far from clear, however. So far, most evaluation of IR systems has been through experiments conducted with test collections which contain only expert, topical relevance judgements. This narrow view of relevance as an objective and static concept has recently been strongly challenged by researchers from very different viewpoints who have acknowledged its subjective, dynamic and multi-dimensional nature (Wilson 1973, Saracevic 1975, Schamber et al. 1990, Harter 1992, Mizzaro 1998). The implication for IR evaluation is clear: the effectiveness of an IR system should be measured not by its performance according to a ‘perfect solution’, but rather by its performance with regard to the normal context of its use (Park 1994). In a real-world IR situation, the primary motivation for a user sitting down with an IR system is to retrieve information which will allow him to complete his current task. Any actions which he performs as part of the IR session are motivated by this overall goal, and his degree of satisfaction with the retrieval results will depend on the true ‘task-relevance’ of the results, i.e. their importance in enabling him to complete his task successfully. As Soergel (1976) points out, “The ultimate objective of any information storage and retrieval system is [...] improved task performance/problem-solving/decision-making by the user” (page 257). It would seem only realistic, then, that the effectiveness of an IR system should be measured in terms of its ability to retrieve ‘task-relevant’ documents, where a task-relevant

Transcript of A Task-Oriented Non-Interactive Evaluation Methodology for Information Retrieval Systems

Information Retrieval, 2, 115–129 (2000)c© 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Task-Oriented Non-Interactive EvaluationMethodology for Information Retrieval Systems

JANE REID [email protected] of Computer Science, Queen Mary and Westfield College, University of London, E1 4NS, England

Received September 10, 1998; Revised July 12, 1999

Abstract. Past research has identified many different types of relevance in information retrieval (IR). So far,however, most evaluation of IR systems has been through batch experiments conducted with test collectionscontaining only expert, topical relevance judgements. Recently, there has been some movement away from thistraditional approach towards interactive, more user-centred methods of evaluation. However, these are expensivefor evaluators in terms both of time and of resources. This paper describes a new evaluation methodology, usinga task-oriented test collection, which combines the advantages of traditional non-interactive testing with a moreuser-centred emphasis. The main features of a task-oriented test collection are the adoption of the task, rather thanthe query, as the primary unit of evaluation and the naturalistic character of the relevance judgements.

Keywords: user-centred evaluation, test collection, nature of relevance, task framework

1. Introduction

The effectiveness of an IR system depends on its ability to retrieve ‘relevant’ documentsand suppress retrieval of ‘non-relevant’ documents (van Rijsbergen 1979, page 6). Themeaning of ‘relevance’ in this context is far from clear, however. So far, most evaluationof IR systems has been through experiments conducted with test collections which containonly expert, topical relevance judgements. This narrow view of relevance as an objectiveand static concept has recently been strongly challenged by researchers from very differentviewpoints who have acknowledged its subjective, dynamic and multi-dimensional nature(Wilson 1973, Saracevic 1975, Schamber et al. 1990, Harter 1992, Mizzaro 1998). Theimplication for IR evaluation is clear: the effectiveness of an IR system should be measurednot by its performance according to a ‘perfect solution’, but rather by its performance withregard to the normal context of its use (Park 1994).

In a real-world IR situation, the primary motivation for a user sitting down with an IRsystem is to retrieve information which will allow him to complete his current task. Anyactions which he performs as part of the IR session are motivated by this overall goal, and hisdegree of satisfaction with the retrieval results will depend on the true ‘task-relevance’ of theresults, i.e. their importance in enabling him to complete his task successfully. As Soergel(1976) points out, “The ultimate objective of any information storage and retrieval systemis [. . .] improved task performance/problem-solving/decision-making by the user” (page257). It would seem only realistic, then, that the effectiveness of an IR system should bemeasured in terms of its ability to retrieve ‘task-relevant’ documents, where a task-relevant

116 REID

document may be defined as one which contributes in some way to the successful completionof the task in hand. The concept of task-relevance therefore embraces non-topical relevancecriteria, such as those identified by Park (1993) and Barry and Schamber (1998), amongothers.

Clearly, this new emphasis implies moving towards more user-centred evaluation meth-ods. However, the techniques influenced by HCI which have been adopted by some re-searchers have proved expensive for the evaluator in terms both of time and of resources(Saracevic et al. 1988a, b, c, Pejtersen 1996). It has also so far proved impossible to developa standard interactive evaluation methodology which will allow for comparison across dif-ferent systems and users. The main aims of this paper are to describe a new, task-orientedframework for the IR process, and to introduce a novel evaluation methodology based onthe concept of a task-oriented test collection. This technique will combine the advantagesof standard test collections for evaluators, i.e. speed, cheapness and direct comparabilitybetween systems, with a more realistic, user-centred emphasis built in at the test collectionconstruction stage.

The remainder of the paper is structured as follows. Section 2 highlights the limitationsof current non-interactive IR evaluation. Section 3 describes the task-oriented frameworkfor IR, and Section 4 defines the components of a task-oriented test collection. In Sections5 and 6, the issues involved in the use and creation of such a test collection are investigated.Section 7 presents a redefinition of traditional statistical measures to be employed in thetask-oriented test collection methodology. Section 8 draws some brief conclusions, andSection 9 outlines some implications for future research arising from this work.

2. Non-interactive evaluation

Non-interactive evaluation usually involves two components: a test collection and somestatistical measures (Cleverdon et al. 1966) to perform the comparison between the testcollection and the system’s results (van Rijsbergen 1979, chapter 7). A basic test collectionconsists of a collection of documents, a set of queries (or, providing more detail, a set ofrequests) to be run against the collection, and the associated relevance judgements. Queriesand relevance judgments can be generated by a number of methods, but are most commonlycreated by domain experts.

An end-evaluator may then run the queries on his own IR system, using the collection ofdocuments provided. This results in a retrieved set of documents, which may be consideredas the system’s relevance judgements, and these can be compared to the test collection’srelevance judgements using statistical measures. Recall and precision are the two statisticalmeasures most widely used to compare the two sets of relevance judgements and, takentogether, measure the system’s ability to retrieveonly relevant documents (i.e. to retrieverelevant documents and ignore non-relevant documents).

Although the test collection approach was first developed as an appropriate method ofevaluating early non-interactive IR systems, there are obvious and continuing advantages tothe technique. An experiment can be conducted for relatively little expense and effort, and theuse of standard statistical measures makes it easy to compare performance across systems.

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 117

Unfortunately, the character of test collections has changed little over the years, while thenature of IR systems and our understanding of the IR process have changed dramatically.When test collections were first employed (Cleverdon et al. 1966), relevance judgementswere required to be as objective and accurate as possible, since retrieval “was conducted inan environment where putting a question to a system was a major event” (Robertson andHancock-Beaulieu 1992, page 459). The most objective judgements were considered to bethose of domain experts, since they concentrate purely on the content of the documents, inisolation from the IR system. Binary relevance judgements were used, since it was assumedthat each document would be simply either relevant or non-relevant.

This view of the IR process, and more specifically the nature of relevance assessment,clearly has limitations. It takes no account of the interactive nature of current IR systems, andis rooted firmly in the systems tradition of IR relevance research.1 Saracevic (1996) identifiesfour relevance frameworks, the systems, communication, psychological and situationalframeworks. Of these four frameworks, the systems framework most closely matches thetraditional, test collection view of relevance. The situational framework, which explicitlyacknowledges the fundamental importance of situation, context, multi-dimensionality andtime, is the most comprehensive as far as the users’ role is concerned. This situational viewof relevance assessment is based on a cognitive view of the information seeking process, inwhich the user perceives a gap in their knowledge, leading to the formation of an informationneed (Belkin et al. 1982, Ingwersen 1992), which is the starting point for the IR process.In this view, the relevance of a document may be expressed as “a piece of new knowledgeconstructed by the requester in the light of some information need or deficit” (Swanson1977, page 139). However, attempts to integrate cognitive theory into real IR evaluationhave been limited, with the recent exception of Borlund and Ingwersen (1997).

As early as 1975, Sparck Jones and van Rijsbergen recognised that test collections were farfrom perfect. In their report for the British Library (Sparck Jones and van Rijsbergen 1975),they identified the characteristics which an ‘ideal’ test collection (or test collections) shouldpossess. Some of these criteria have now been met in different individual collections. Forexample, the issue of scale has been effectively addressed by TREC (Harman 1995), whichhas also recently recognised the inherent interactivity of IR in its interactive track (Beaulieuet al. 1995). However, some issues highlighted in the report remain unexplored. For example,there seems to have been no work on incorporating user relevance judgements into a testcollection, although user queries were collected and used for a large-scale medical collectionin the OHSUMED project (Hersh et al. 1994). There is also a continuing assumption intest collection design that relevance is a static concept, implying that time is not a factorin relevance judgements. This is clearly untrue, as has been acknowledged in the researchon presentation order of documents (Eisenberg and Barry 1988, Janes 1991). On the sametheme, Harter (1996) summarises the findings of the large body of research on variationsamong relevance assessments, and points out the lack of influence that this has so farexercised on IR evaluation methodologies. The long-standing theoretical concern with abroad range of evaluation issues (Sparck Jones 1981) has clearly failed to have muchpractical impact on IR evaluation.

This paper presents a framework which is based on one of the primary elements ofthe situational framework, the task. Information retrieval is inherently task-oriented. Users

118 REID

initiate IR sessions for many different reasons. However, they each have a purpose forseeking information, and this purpose may be considered as a task.

3. Task-oriented IR

The concept of “task” used in this paper is an intuitive one, based on the notion of a goal oraim, and the necessity for certain actions in order to fulfil this goal. The approach adopted inthis paper is an “object-oriented” one, perhaps closer to the HCI notion of “activity” ratherthan task.2 This contrasts with much of the task analysis work carried out in HCI (Diaper1989), which focuses on the actions involved in executing tasks.

It is clear that there is a huge variety of potential IR tasks. However, it is possible to identifysome features which are common across all tasks and others which can be used to make abasic distinction between different types of task. In order to identify those features whichare common across all tasks, a general IR task framework may be developed (figure 1).

Figure 1. The IR task framework [adapted from Reid 1999].

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 119

The starting-point of this framework is thetaskset by thetask setter, who produces ataskrepresentationin some medium. Thetask performerinterprets the task representation andforms atask modelin his mind,3 which contributes to the formation of an information need(or multiple information needs). The task performer then conducts an IR session with the aimof satisfying this information need, in the course of which the information need, includingthe task model, is continuously refined. Finally, the task performer uses the informationgained during the IR session to produce atask outcome. The outcome is subject totaskassessmentby the task setter, usually leading totask performance feedbackto the taskperformer, which allows him to revise his idea of the suitability of the task outcome, i.e.learn. This complete process takes place in anexternal context, i.e. against a background ofsocial and environmental factors, which may have an impact on any stage of the process.Task setter and task performer will experience different external contexts, although theremay also be some shared experience.

A basic distinction may be made between those tasks which areinternally generatedandthose which areexternally generated. Externally generated tasks are those where thetaskperformer, i.e. the person executing the task, is a different person from thetask setter, i.e.the person conceiving the task. Internally generated tasks are those where the task performeris the same person as the task setter, i.e. the task is conceived and executed by one person,with no external influence.

For internally generated tasks, the complete process, including task outcome assessment,will be internal, i.e. there will be no physical task representation created. For example, aresearcher reading a paper may realise that he has an incomplete grasp of one aspect ofthe subject matter, and may decide to supplement his knowledge by looking for furtherpapers in this related field. He will judge the success of the task outcome (the informationhe gains from his IR session) by whether, and to what extent, it enables him to reach a betterunderstanding of the paper he was reading. This process is completely internally driven,but nonetheless constitutes a task to be performed.

For externally generated tasks, the task setter will normally give a statement in somemedium (e.g. verbal or written) of the task which is to be performed, usually with someguidance about the form of the task outcome. Some additional background information mayalso be provided. Once the task has been performed, the task outcome is submitted in theappropriate medium to the task setter. The feedback in this case may be of various forms (e.g.verbal or written) and degrees (e.g. mark, grade, suggestions for improvements), and mayconcern different aspects, e.g. content, presentation. For example, an employee working inthe marketing section of a large company may be asked by his department head to researchthe current competition which would be faced by a potential new product, and present hisfindings to the group. At the same time, the standard of his presentation may contribute to hisannual employee review. The success of the outcome will therefore be assessed accordingto two different criteria, with the content of his work perhaps receiving verbal feedbackand comments during the course of the presentation, and more formal feedback, includingassessment of the quality of his presentation style, being given in the form of a structuredreport from his department head. Clearly, both of these styles of external feedback will helpthe employee to adjust his view of the suitability of his work, but they focus on differentaspects of performance and use different criteria.

120 REID

4. Properties of a task-oriented test collection

A task-oriented test collection will consist of:

• a collection of documents• a set of task descriptions• a set of queries per task• a set of relevance judgements per task

4.1. Collection

Although current available test collections are textual, a task-oriented test collection couldcontain multi-media or mixed media documents.

4.2. Task descriptions

A task description is a description of the task itself and the task context, and can include:

• task performer information• task outcome information• task completion information

Task performer information includes a description of the role which the task performerfulfils when executing the task, along with any situational or contextual information. Taskoutcome information includes guidance from the task setter on format and content of thetask outcome. Task completion information includes guidance on any resources or methodsrecommended by the task setter. Not every task description will include all of these typesof information. Figure 2 shows an example task description.

Figure 2. An example task description.

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 121

A task description may be considered as analagous to a TREC topic (Harman 1995), butwith an important distinction. A TREC topic is constructed for the benefit of a relevanceassessor and therefore includes a description of the relevance criteria which should beapplied. The purpose of this is to provide objective guidelines for relevance assessmentwhich will eliminate differences between individual judges. A task description, on theother hand, acknowledges the dynamic and subjective nature of relevance assessment byproviding only the starting-point for the information need in the form of task outcome andtask completion information. In this way, the concept of task description is much closer toBorlund’s simulated information need situation (Borlund and Ingwersen 1997).

4.3. Queries

Queries take the form of brief, natural language statements, created by the task performers,which may be submitted directly to a query-based IR system.

4.4. Relevance judgements

Swanson (1986) makes a distinction between the “objective” view of relevance as an as-sociation between a query and a document, and the “subjective” view of relevance as anassociation between a query and the end-user’s information need. In the task-oriented IRevaluation model, relevance may be defined as the relationship of a document to the taskperformer’s task model. Each task therefore has a related set of relevance judgements.

In standard test collections, relevance judgements are made by experts without referenceto use of an IR system. There have been some studies which have examined relevancejudgements made by non-expert judges (Cuadra and Katter 1967, Rees and Schulz 1967,Regazzi 1988), but these have focussed on judgements made outwith the context of an IRsystem. Even within the context of an IR system, the relevance judgements made during anIR session may not be good indicators of the documents which really provide task-relevantinformation. Other factors, e.g. the success of the task performer’s current search strategy(Bates 1979, Marchionini 1995) and the technical expertise of the task performer (Borgman1989), may have an influence. For example, a task performer may mark a particular doc-ument as relevant because he did not formulate a very effective query and needs to guidethe search in a more appropriate direction. It is also clear that a task performer’s relevancejudgements will change over time, as more information is assimilated and his understand-ing of the task is improved. This is clearly demonstrated in the case study, problem-basedapproach advocated by Kuhlthau (1991) and Park (1994) and used by Smithson (1994) andTang and Solomon (1998). However, even this broad approach defines the end of the taskas the completion of the task outcome. The approach adopted in this paper goes further,in taking into account the feedback and learning stage, thus including the wider context ofknowledge acquisition, where knowledge is implicitly modified through interaction withthe surrounding environment.

The task performer’s notion of what constitutes a task-relevant document will thus beimplicitly modified as the session progresses, as the task is performed, and finally as hereceives feedback on the outcome. As the task performer’s comprehension of the task

122 REID

improves, his ability to state which documents contain task-relevant information improvestoo. Only at the post-task feedback stage of the process will he be able to state clearly whichdocuments provided task-relevant information, because only at this stage will he be able tojudge the success of the task outcome. This is really claiming that a task performer withrecent retrospective knowledge will be able to make a more accurate assessment of task-oriented relevance, and will use this assessment to make decisions about future actions. Thisintuitive view is supported by Smithson (1994), who found that even post-task relevanceassessment is not always accurate. For these reasons, post-feedback relevance judgementsshould be used.

Intuitively, a document which is chosen as being task-relevant by more than one task per-former should be weighted in some way. For example, if 10 task performers have completedthe same task and have all marked document d1 as being relevant, the prior probability thattask performer 11 will also choose document d1 as being relevant is clearly higher than ifonly 2 previous task performers had chosen it. This prior probability may also be viewedas a certainty or confidence value in the relevance of the document, and becomes moreimportant as the field of task performers is widened to include non-experts. It has alreadybeen shown in several studies that there continues to be variation among expert judges, butthis is smaller than the variation which can be seen among non-expert judges, and consider-ably smaller than the variation which can be observed across mixed expert and non-expertjudges, i.e. intra-group consistency is higher than inter-group consistency (Regazzi 1988).

The great majority of existing test collections use binary relevance judgements, withtwo exceptions being the Cystic Fibrosis collection used by Smithson (1994), and theSTAIRS collection (Blair 1990). Task-oriented test collection relevance judgements shouldbe weighted in order to provide a better match with our intuitive understanding, and a finergranularity of assessment of the IR system performance. In order to provide compatibilitywith current statistical measures, weights should be between 0 and 1. An explanation of howrecall and precision measures can be adapted to deal with weighted relevance judgementscan be found in Section 7.

5. Using a task-oriented test collection

The test collection is used in exactly the same way as standard test collections, with the pro-vided queries being run non-interactively on an IR system. The resulting list of ranked doc-uments can be evaluated with respect to the test collection relevance judgements, as normal.

Since test collection evaluation yields only quantitative and statistical data, end-evaluatorsmay wish to perform interactive evaluation, using the task descriptions, to provide comple-mentary qualitative data. For example, the task descriptions could provide a suitable basisfor the formation of ‘simulated information needs’ (Borlund and Ingwersen 1997) for usein interactive evaluation with experimental subjects.

6. Creating a task-oriented test collection

There are many issues to be considered in the construction of any test collection, but theseare multiplied in this method by the experimental nature of the relevance judgement process.

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 123

6.1. Task descriptions

The first factor to be considered in the choice of tasks is whether the tasks should be real, i.e.collected from past or present users of the collection, or simulated, i.e. created artificially.Clearly, real tasks will be more difficult to identify and the associated experimental datamay be difficult to gather. Conversely, a greater degree of control can be exerted over alaboratory experiment using experimental subjects. However, the latter strategy may leadto much less “realistic” data, more closely approaching the traditional nature of relevancejudgements. For this reason, only real tasks should be employed in a task-oriented testcollection.

Although a representative set of tasks can be obtained by simple observation over a longperiod of time, one method of speeding up this process is to perform some classificationor categorisation, and ensure that the tasks in the collection are evenly distributed acrossthe categories. Variation in the components of the task description, e.g. the type of taskoutcome, allows a simple classification to be made. A more comprehensive classificationcan be achieved if the test collection creator also ensures a spread of tasks across thefollowing dimensions:

• tasks generated by domain experts/novices: this dimension is indicated by standard HCItexts, e.g. Preece et al. 1994.• tasks generated externally/internally: this dimension is indicated by the task framework

presented in Section 3.• simple (well-defined)/complex (poorly defined) tasks: this dimension is indicated by the

work on task complexity in IR (Bystr¨om and J¨arvelin 1995).

In addition, tasks which cover a range of the topics treated in the collection should be used(Sparck Jones and van Rijsbergen 1975). For each externally generated task, multiple taskperformers with a range of system experience should be used. This dimension is indicatedby standard HCI texts, e.g. Preece et al. 1994. Clearly, this is not possible for internallygenerated tasks, since only one task performer is involved.

6.2. Queries

Queries are created by the task performers. Where there is more than one task performerper task, all the queries should be included in the test collection.

6.3. Relevance judgements

It has already been stated that weighted relevance judgements should be used in a task-oriented test collection. In the case of internally generated tasks, weighted relevance judge-ments can only be generated by means of obtaining individual weighted relevance judge-ments. In the case of externally generated tasks, there are two possible methods of generatingweighted relevance judgements:

124 REID

• by obtaining binary relevance judgements from more than one task performer, and com-bining these• by obtaining weighted relevance judgements from more than one task performer, and

combining these

The concept of binary relevance has recently been heavily criticised (Schamber et al.1990), on the grounds that our intuitive understanding of the notion is based on varyingdegrees of relevance rather than its presence or absence. For this reason, and to main-tain consistency with internally generated tasks, individual weighted relevance judgementsshould always be used. These can be generated by several possible methods, e.g. throughthe tri-partite categories used by Saracevic et al. (1988a, b, c) and Borlund and Ingwersen(1997), or by simply asking task performers for relevance judgements on a continuous scaleof 0-1.

7. Statistical measures

The existing measures of recall and precision are compatible with the task-oriented testcollection approach, although it could be argued that they are not the most appropriatemeasures. Use of recall and precision requires a reinterpretation of the standard defini-tions of these measures, however, since the test collection relevance judgements will benon-binary.

Recall and precision are normally viewed as two measures based on the proportion ofrelevant documents retrieved. This involves a count of thenumberof documents. However,this clearly only applies with a binary interpretation of relevance, i.e. relevance is eitherabsent (relevance judgement= 0) or present (relevance judgement= 1). If we viewrelevance as a partial concept, however, binary relevance judgements may be viewed simplyas special cases. In effect, recall and precision may then be viewed as two measures basedon the accumulatedamount of relevancefound within the retrieved documents.

So, whereas recall is traditionally viewed as

number of relevant documents retrieved

total number of relevant documents

we view it instead as

relevance weight of documents retrieved

total relevance weight of all documents

As an example, consider the following.Suppose we have 5 documents: d1, d2 . . d5, and 5 users u1, u2 . . u5, who are all per-

forming the same task. Our grid of task performer relevance judgements might look likeTable 1.

Clearly, it is possible to apply a threshold to these judgements to produce a binaryrelevance judgement, i.e. if a certain number or proportion of task performers choose adocument as being relevant, that document is labelled as relevant. In Table 2 below, method

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 125

Table 1. An example grid of task performer relevance judgements.

u1 u2 u3 u4 u5

d1 x o o o o

d2 o o x o x

d3 o o x x o

d4 o o o o o

d5 x o x x o

Where, x indicates that a task performer hasmarked a document as relevant and o indicatesthat a task performer has not marked a docu-ment as relevant.

Table 2. Possible methods for combining relevance judgements from Table 1.

Method 1 Method 2 Method 3

d1 0.25 0 0.125

d2 0.5 0 0.375

d3 0.75 0 0.625

d4 0.75 0 0.625

d5 1.0 1.0 1.0

1 demonstrates the case where the threshold is set to 1, and method 2 demonstrates thecase where the threshold is set to 3 (the majority). Method 3 shows the weighted relevancejudgements obtained by the simple and intuitive principle of accumulating the amount ofrelevance for each document across all the task performers.

It can be seen from the above table that these different methods produce vastly differingresults, with method 3 giving an intuitively better representation of degrees of relevance.

The same process is carried out with precision to give a re-interpretation of the currentdefinition of precision:

number of relevant documents retrieved

total number of documents retrieved

as

relevance weight of documents retrieved

potential relevance weight of documents retrieved

Again, the standard precision measure can be interpreted as a special case of this generalrule.

To get an overall picture of the IR system’s performance, the normal two-stage processmay be followed:

126 REID

• For each query, the precision score may be calculated at each recall point, i.e. each pointat which another relevant document is retrieved.• The recall and precision points can then be averaged across all queries, and these values

used to produce a precision-recall graph.

8. Conclusions

This paper has outlined a new evaluation methodology for IR systems, based on the con-cept of a task-oriented test collection. The methodology retains the traditional test collectionadvantages of ease of use and comparability of results, but incorporates elements of inter-action into the relevance assessment process. In addition to the usual functionality of a testcollection, the task descriptions may be used as the basis for complementary interactiveevaluation.

A task-oriented test collection is a more realistic method of evaluating IR systems thana traditional test collection, because it acknowledges the primary importance of the taskin user motivation. The boundaries of the task are widened to include the feedback andlearning stages of the task cycle. The dynamic nature of relevance assessment during thiscycle is acknowledged, and the most realistic source of relevance judgements identified asthe post-feedback stage. The subjectivity of user relevance judgements is reflected in thecombination of several task performers’ relevance judgements into one weighted relevancejudgement, and the associated redefinition of recall and precision as measures of the amountof relevance found in retrieved documents.

9. Implications for future research

There has been a recent explosion in the amount of multi-media data available. This hascaused considerable speculation about how the potential problems of multi-media materialcompare to those of text (Draper and Dunlop 1997). There has already been some interestin the issues involved in creating a multi-media test collection (MIRA 1998). While it ispossible (although not desirable) to ignore the task context in textual retrieval, and con-centrate on the query, it seems less easy to do so in multi-media retrieval. The reason forthis is not clear, although it may be that the very act of translating a multi-media informa-tion need into a (usually) textual query adds a further layer of complexity and variation tothe whole process. If this proves to be so, it would seem that the inherent subjectivity ofthe task-oriented test collection methodology may provide an answer to at least some of thequestions posed by the issue of multi-media evaluation.

Although the justification for using weighted relevance judgements is clear, it is lessclear how these should be formed from individual task performer relevance judgements.In this paper, a simple and intuitive method was used to illustrate the potential advantagesof this technique. However, with the recent increase in the application of logical modelsof information retrieval (Lalmas 1998), it may prove possible to formulate a theoreticalframework for the combination of individual judgements.

It has been shown that variations between relevance judgements make no difference tothe comparative evaluation of IR systems (Burgin 1992, Voorhees 1998). However, it is

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 127

unclear whether this result, using binary and topical relevance judgements, will extend toweighted and task-oriented judgements.

Recall and precision are very limited measures of the performance of an IR system(Su 1994). Many other measures have been proposed, such as utility or the subjectivesatisfaction of the user (Cooper 1973, Belkin and Vickery 1985), the use of the informationgained (Belkin and Vickery 1985) and informativeness, based on comparison of the system’sranking with the user’s retrospective optimal ordering of documents (Tague and Schultz1989). One objection to recall and precision implicit in all these proposals is that theyassume an objective, static and topical view of relevance. However, it has been demonstratedthat a task performer’s view of relevance is dynamic, and task-related criteria become moreimportant later in the search process (Smithson 1994, Tang and Solomon 1998). While theultimate aim of an IR system must be to provide task-relevant documents, it should alsosupport the task performer through the process of information seeking. New measures areneeded to evaluate this factor in the context of the task framework outlined in this paper.

The task-oriented test collection methodology outlined in this paper is intended as aworking compromise between the traditional, system-oriented view of IR evaluation and theHCI-influenced, user-oriented view. More work, such as that of Dunlop (1997), is needed inthis area in order to draw together these two different, and often conflicting viewpoints. Thissynthesis of expertise provides the opportunity to progress gradually towards the ultimateaim: to provide a set of evaluation tools to allow IR researchers to perform comprehensiveevaluation of all aspects of IR systems.

Acknowledgments

We would like to thank Mark Dunlop and Steve Draper for reading earlier drafts of thispaper and offering many helpful suggestions. Phil Gray helped clarify thoughts on thetask material. We would also like to thank the anonymous reviewers for their constructivecomments.

Notes

1. For a comprehensive overview of relevance research, see Mizzaro (1997).2. For a general introduction to the concepts of task and activity, see any standard HCI text, e.g. Preece et al.

(1994).3. A task model is based on the concept of a ‘mental model’. For a general introduction to the concept of mental

models, see Gentner and Stevens (1983) or Johnson-Laird (1983).

References

Barry CL and Schamber L (1998) Users’ criteria for relevance evaluation: a cross-situational comparison. Infor-mation Processing and Management, 34:219–236.

Bates M (1979) Information search tactics. Journal of the American Society for Information Science, 30:205–214.

Beaulieu M, Robertson SE and Rasmussen EM (1995) Evaluating interactive systems in TREC. Journal of theAmerican Society for Information Science, 47:85–94.

128 REID

Belkin NJ, Oddy RN and Brooks HM (1982) ASK for information retrieval: part 1 background and theory. Journalof Documentation, 38:61–71.

Belkin NJ and Vickery A (1985) Interaction in information systems: a review of research from document retrievalto knowledge-based systeme. Library and Information Research Report 35, the British Library.

Blair, DC (1990) Language and Representation in Information Retrieval. Elsevier, New York.Borgman C (1989) All users of information retrieval systems are not created equal: an exploration into individual

differences. Information Processing and Management, 25:237–251.Borlund P and Ingwersen P (1997) The development of a method for the evaluation of interactive information

retrieval systems. Journal of Documentation, 53:225–250.Burgin R (1992) Variations in relevance judgments and the evaluation of retrieval performance. Information

Processing and Management, 28:619–627.Bystrom K and Jarvelin K (1995) Task complexity affects information seeking and use. Information Processing

and Management, 31:191–213.Cleverdon CW, Mills J and Keen M (1966) Factors determining the performance of indexing systems. ASLIB

Cranfield project, Cranfield.Cuadra CA and Katter RV (1967) Opening the black box of “relevance”. Journal of Documentation, 23:291–303.Cooper WS (1973) On selecting a measure of retrieval effectiveness. Part 1. Journal of the American Society for

Information Science, 24:87–100.Diaper D, Ed. (1989) Task analysis for human-computer interaction. Ellis Horwood Limited, Chichester, England.Draper SW and Dunlop MD (1997) New IR—new evaluation: the impact of interactive multimedia on information

retrieval and its evaluation. The New Review of Hypermedia and Multimedia, 3:107–122.Dunlop MD (1997) Time relevance and interaction modelling for information retrieval. In: Belkin NJ, Narasimhalu

AD and Willett, P, Eds. SIGIR ‘97, Proceedings of the 20th Annual International ACM-SIGIR Conference onResearch and Development in Information Retrieval. ACM, Philadelphia, pp. 206–213.

Eisenberg M and Barry C (1988) Order effects: A study of the possible influence of presentation order on userjudgments of document relevance. Journal of the American Society for Information Science, 39:293–300.

Gentner D and Stevens L (1983) Mental Models. Lawrence Erlbaum Associates, Hillsdale, N.J., USA.Harman DK (1995) The TREC Conferences. In: Kuhlen R and Rittberger M, Eds. Hypertext—Information

Retrieval—Multimedia: Proceedings of HIM 95. Konstanz, Germany, pp. 9–28.Harter SP (1992) Psychological relevance and information science. Journal of the American Society for Information

Science, 43:602–615.Harter SP (1996) Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of

the American Society of Information Science, 47:37–49.Hersh WR, Buckley C, Leone TJ and Hickman DH (1994) OHSUMED: an interactive retrieval evaluation and

new large test collection for research. In: Croft WB and van Rijsbergen CJ, Eds. SIGIR ‘94, Proceedings of the17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.Springer-Verlag, Dublin, pp. 192–201.

Ingwersen P (1992) Information Retrieval Interaction. Taylor Graham, London.Janes J (1991) Relevance judgements and the incremental presentation of document representations. Information

Processing and Management, 27:629–646.Johnson-Laird PN (1983) Mental Models. Cambridge University Press, Cambridge.Kuhlthau CC (1991) Inside the search process: information seeking from the user’s perspective. Journal of the

American Society for Information Science, 42:361–371.Lalmas M (1998) Logical models in information retrieval: introduction and overview. Information Processing and

Management, 34:19–33.Marchionini G (1995) Information seeking in electronic environments. Cambridge Series on Human-Computer

Interaction, Cambridge University Press.MIRA (1998) Evaluation frameworks for interactive multi-media information retrieval applications. http://

www.dcs.gla.ac.uk/mira.Mizzaro S (1997) Relevance: the whole history. Journal of the American Society for Information Science, 48:810–

832.Mizzaro S (1998) How many relevances in information retrieval? Interacting with Computers, 10:305–322.Park TK (1993) The nature of relevance in information retrieval: an empirical study. Library quarterly, 63:318–351.

TASK-ORIENTED NON-INTERACTIVE EVALUATION METHODOLOGY 129

Park TK (1994) Toward a theory of user-based relevance: a call for a new paradigm of enquiry. Journal of theAmerican Society of Information Science, 45:135–141.

Pejtersen AM (1996) Empirical work place evaluation of complex systems. In: ICAE ‘96, Proceedings of the 1stInternational Conference on Applied Ergonomics. Istanbul, Turkey, pp. 21–24.

Preece J, Rogers Y, Sharp H, Benyon D, Holland S and Carey T (1994) Human-Computer Interaction. Addison-Wesley, England.

Rees AM and Schulz DG (1967) A field experimental approach to the study of relevance assessments in re-lation to document searching. I: Final report. NSF contract no. C-423, Case Western Reserve University,Cleveland.

Reid J (1999) A new, task-oriented paradigm for information retrieval: implications for evaluation of informa-tion retrieval systems. In: Aparac T, Saracevic T, Ingwersen P and Vakkari P, Eds. Proceedings of the ThirdInternational Conference on Conceptions of Library and Information Science. Dubrovnik, Croatia, pp. 97–108.

Regazzi JJ (1988) Performance measures for information retrieval systems—an experimental approach. Journalof the American Society of Information Science, 39:235–251.

Robertson SE and Hancock-Beaulieu M (1992) On the evaluation of IR systems. Information Processing andManagement, 28:457–466.

Saracevic T (1975) Relevance: a review of and a framework for thinking on the notion in information science.Journal of the American Society for Information Science, 26:321–343.

Saracevic T (1996) Relevance reconsidered ‘96. In: Ingwersen P and Pors NO, Eds. Proceedings of CoLIS 2, SecondInternational Conference on Conceptions of Library and Information Science: Integration in Perspective. TheRoyal School of Librarianship, Copenhagen, pp. 201–218.

Saracevic T and Kantor P (1988a) A study of information seeking and retrieving. II. Users, questions and effec-tiveness. Journal of the American Society for Information Science, 39:177–196.

Saracevic T and Kantor P (1988b) A study of information seeking and retrieving. III. Searchers, searches andoverlap. Journal of the American Society for Information Science, 39:197–216.

Saracevic T, Kantor P, Chamis AY and Trivison D (1988c) A study of information seeking and retrieving. I.Background and methodology. Journal of the American Society for Information Science, 39:161–176.

Schamber L, Eisenberg MB and Nilan MS (1990) A re-examination of relevance: toward a dynamic, situationaldefinition. Information Processing and Management, 26:755–776.

Soergel D (1976) Is user satisfaction a hobgoblin? Journal of the American Society for Information Science,24:87–100.

Smithson S (1994) Information retrieval evaluation in practice: a case study approach. Information Processing andManagement, 30:205–221.

Sparck Jones K and van Rijsbergen CJ (1975) Report on the Need for and Provision of an ‘Ideal’ InformationRetrieval Test Collection. Report number 5266, University Computer Laboratory, Cambridge.

Sparck Jones K, Ed. (1981) Information Retrieval Experiment. Butterworths, London.Su LT (1994) The relevance of recall and precision in user evaluation. Journal of the American Society of

Information Science, 45:207–217.Swanson DR (1977) Information retrieval as a trial-and-error process. Library Quarterly, 47:128–148.Swanson DR (1986) Subjective versus objective relevance in bibliographic retrieval systems. Library Quarterly,

56:389–398.Tague J and Schultz R (1989) Evaluation of the user interface in an information retrieval system: a model.

Information Processing and Management, 25:377–389.Tang R and Solomon P (1998) Towards an understanding of the dynamics of relevance judgment: an analysis of

one person’s search behaviour. Information Processing and Management, 34:237–256.van Rijsbergen CJ (1979) Information Retrieval 2nd ed. Butterworths, London.Voorhees EM (1998) Variations in relevance judgments and the measurement of retrieval effectiveness. In: Croft

WB, Moffat A, van Rijsbergen CJ, Wilkinson R and Zobel J, Eds. SIGIR ‘98, Proceedings of the 21st AnnualInternational ACM-SIGIR Conference on Research and Development in Information Retrieval. ACM Press,Melbourne, pp. 315–323.

Wilson P (1973) Situational relevance. Information Storage and Retrieval, 9:457–471.