Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks

11
Information Retrieval, 5, 127–137, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks STEPHEN ROBERTSON Microsoft Research, 7 JJ Thomson Avenue, Cambridge CB3 0FB, UK Received May 5, 2001; Revised February 16, 2002; Accepted February 16, 2002 Abstract. This paper introduces the special issue, and reviews the routing and filtering tasks as defined and evaluated at TREC. The tasks attempt to simulate a specific service situation: the system is assumed to process an incoming stream of documents against profiles of user interest, strictly in the time order in which they arrive, and immediately refer any matching document to the user. In the adaptive filtering version of the task, the user is assumed to provide a relevance judgement instantly. The rationale for the task definitions and the evaluation measures used is discussed. Keywords: filtering, routing, TREC 1. Introduction to the issue This issue of Information Retrieval is devoted to the filtering task, and takes as its starting point the approach to filtering of the Text REtrieval Conference (TREC). Over successive rounds of TREC, the TREC routing and filtering tasks have tried to simulate on-line time- critical text filtering applications, where the value of a document decays rapidly with time. The issue is not, however, confined to TREC; another source is the closely-related topic tracking task defined and evaluated in TDT. Further information about TDT, and about its relation to TREC, is given in other papers in this issue, specifically Allan (2002) and Ault and Yang (2002), who makes comparisons between the TREC and TDT tasks. In other papers in the issue, Soboroff and Nicholas (2002) use the TREC tasks to investigate the combination of content-based and collaborative filtering; Eichmann and Srinivasan (2002) apply a clustering approach to filtering; Robertson (2002a) considers the problem of threshold setting, and (2002b) compares filtering and ranked-output performance. The remainder of this introduction is concerned with setting the scene for the other papers in the issue by defining the TREC approach to routing and filtering in some detail, including the specification of the tasks and evaluation measures used, and by discussing the rationale behind the decisions. 2. Introduction to filtering A text filtering system sifts through a stream of incoming information to find documents relevant to a set of user needs represented by profiles. Filtering differs from traditional

Transcript of Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks

Information Retrieval, 5, 127–137, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Introduction to the Special Issue: Overviewof the TREC Routing and Filtering Tasks

STEPHEN ROBERTSONMicrosoft Research, 7 JJ Thomson Avenue, Cambridge CB3 0FB, UK

Received May 5, 2001; Revised February 16, 2002; Accepted February 16, 2002

Abstract. This paper introduces the special issue, and reviews the routing and filtering tasks as defined andevaluated at TREC. The tasks attempt to simulate a specific service situation: the system is assumed to processan incoming stream of documents against profiles of user interest, strictly in the time order in which they arrive,and immediately refer any matching document to the user. In the adaptive filtering version of the task, the useris assumed to provide a relevance judgement instantly. The rationale for the task definitions and the evaluationmeasures used is discussed.

Keywords: filtering, routing, TREC

1. Introduction to the issue

This issue of Information Retrieval is devoted to the filtering task, and takes as its startingpoint the approach to filtering of the Text REtrieval Conference (TREC). Over successiverounds of TREC, the TREC routing and filtering tasks have tried to simulate on-line time-critical text filtering applications, where the value of a document decays rapidly with time.The issue is not, however, confined to TREC; another source is the closely-related topictracking task defined and evaluated in TDT.

Further information about TDT, and about its relation to TREC, is given in other papersin this issue, specifically Allan (2002) and Ault and Yang (2002), who makes comparisonsbetween the TREC and TDT tasks. In other papers in the issue, Soboroff and Nicholas(2002) use the TREC tasks to investigate the combination of content-based and collaborativefiltering; Eichmann and Srinivasan (2002) apply a clustering approach to filtering; Robertson(2002a) considers the problem of threshold setting, and (2002b) compares filtering andranked-output performance.

The remainder of this introduction is concerned with setting the scene for the other papersin the issue by defining the TREC approach to routing and filtering in some detail, includingthe specification of the tasks and evaluation measures used, and by discussing the rationalebehind the decisions.

2. Introduction to filtering

A text filtering system sifts through a stream of incoming information to find documentsrelevant to a set of user needs represented by profiles. Filtering differs from traditional

128 ROBERTSON

‘adhoc’ or retrospective search in that documents arrive sequentially over time. This meansthat potentially relevant documents must be presented immediately to the user. There isno time to accumulate and rank a set of documents. On the other hand, user profiles arepersistent, and tend to reflect a long term information need. With feedback from the user,the system can learn a better profile, and improve its performance over time.

2.1. Initial description

The Routing task, which was one of the two core tasks in earlier rounds and remains asubtask of the current Filtering track, takes a very simple view of the situation. In effectthe simulation is located at a particular, fixed point in time, with some history (the trainingset) and some future incoming documents (the test set). The history includes some originaltext form of the topic, together with all available relevance judgements on the training-setdocuments. These relevance judgements are generally assumed to be more-or-less complete,although in practice they were made on all those documents retrieved for this topic by anysystem in some previous TREC experiment. The test involves representing each topic as aprofile, improving or optimizing this profile by making use of the relevance judgements onthe training set, and running it against the complete test set of documents. In fact despitethe above comment about ranking, evaluation of this routing task has relied on the usualranked-output-based measures of retrieval performance.

The Filtering track, introduced at TREC-4, attempts a more realistic simulation. Aftersome initialization stage (discussed in more detail below), the primary Adaptive Filteringtask (from TREC-7 on) involves the simulation of the passage of time, processing a testset of documents in date order. For each profile and each document, a binary decisionis taken by the system: to refer the document to the user (owner of the profile) or not.When a document is referred to the user, s/he is assumed to provide (instantly!) a relevancejudgement. A document not referred to the user is assumed to be unjudged (as far as thesystem is concerned, though not of course for the eventual evaluation of the system). Noback-tracking or revisiting of rejected documents is allowed. However, the system canmake use of any kind of information derived from previously processed documents, in (forexample) modifying a profile for new documents.

This time-based simulation is clearly more realistic than the Routing task, though equallyclearly it has several over-simplifications or unrealistic assumptions. The one that standsout most clearly is the assumption of instant feedback, and perhaps also the assumptionof no backtracking and no batching of documents or ranking of small sets. One couldperfectly well imagine a situation (at least for some document streams and/or user groups)where the system maintains a ranked list of the documents matched over the past (say)week, which might be modified on feedback, with the user potentially providing feedbackat any time (or not at all). However, the complications of simulating such a situation areconsiderable; the TREC Filtering assumptions combine the advantages of relative simplicityand implementability in a laboratory, with some reasonable degree of realism.

They do however introduce (in contrast to the Routing task) one substantial complication:the need to evaluate on the basis of binary retrieval decisions rather than ranking. This matteris discussed further below.

OVERVIEW OF THE TREC ROUTING 129

The history and development of the TREC Routing and Filtering Tasks can be traced byreading the yearly final reports:

• TREC-9 http://trec.nist.gov/pubs/trec9/t9 proceedings.html (#3) (Robertson and Hull2001)

• TREC-8 http://trec.nist.gov/pubs/trec8/t8 proceedings.html (#3—2 files) (Hull andRobertson 2000)

• TREC-7 http://trec.nist.gov/pubs/trec7/t7 proceedings.html (#3—2 files) (Hull 1999)• TREC-6 http://trec.nist.gov/pubs/trec6/t6 proceedings.html (#1, Overview, for routing;

#4 and #5, filtering) (Voorhees and Harman 1998, Hull 1998)• TREC-5 http://trec.nist.gov/pubs/trec5/t5 proceedings.html (#1, Overview, for routing;

#5, filtering) (Voorhees and Harman 1997, Lewis 1997)• TREC-4 http://trec.nist.gov/pubs/trec4/t4 proceedings.html (#1, Overview, for routing;

#11, filtering) (Harman 1996, Lewis 1996)• TREC-3 http://trec.nist.gov/pubs/trec3/t3 proceedings.html (#1, Overview, for routing

only) (Harman 1995)• TREC-2 http://trec.nist.gov/pubs/trec2/t2 proceedings.html (#1, Overview, for routing

only) (Harman 1994)• TREC-1 http://trec.nist.gov/pubs/trec1/t1 proceedings.html (#1, Overview, for routing

only) (Harman 1993)

In the sections that follow, we specify the tasks as defined for recent TRECs. We con-centrate on TREC-9 and TREC-8.

3. Data

The particular set of documents and topics used is not central to this discussion. However,it is useful to give a brief description as a basis for the task definitions.

3.1. TREC-8

The TREC-8 filtering experiments used the Financial Times (FT) document collection,from TREC disk 4 (TREC no date), which consists of slightly more than three years ofnewspaper articles covering part of 1991 and most of 1992–1994. The 210,000 documentswere ordered roughly as a function of time, and all systems were required to process thecollection (or a subset) strictly in the same order. The documents average 412 words inlength and cover a wide variety of subject matter.

All tasks used TREC topics 351–400, which were constructed for the TREC-7 adhocexperiments. The topics contain Title, Description, and Narrative fields and have an averagelength of 58 words. Relevance judgements were available from TREC-7; however, a smallnumber of additional documents, which had not been judged for TREC-7 but were retrievedby participants in the TREC-8 filtering track, were judged for TREC-8.

For the main adaptive filtering task, the entire FT collection was the test set, but for othertasks (including routing), the 1992 data were treated as a training set and the 93/94 data asthe test set.

130 ROBERTSON

3.2. TREC-9

The TREC-9 filtering experiments went outside the usual TREC collections and used theOHSUMED test collection compiled by, and available from, William Hersh (Hersh et al.1994). This consists of Medline documents from the years 1987–1991 and a set of requests(topics) and relevance judgements. A slightly modified version of this dataset was puttogether for the task.

The entire collection contains about 350,000 documents. Actually these are bibliographicrecords containing the usual fields including abstract, although only about two thirds ofthe records contain abstracts. They also have a field containing MeSH headings, that ishuman-assigned index terms. These are assumed to arrive in identifier order, at a rate ofapproximately 6000 documents per month. The 1987 data (equivalent to about 9 months’worth) was extracted from the dataset to provide training material, as discussed below; thetest set is therefore the 1988–91 data.

Sixty-three of the original OHSUMED topics were selected for filtering (they wereselected to have a minimum of 2 definitely relevant documents in the training set).1 These63 topics form the OHSU set. In addition, the MeSH headings were treated as if they weretopics: the text of the topic was taken from the scope notes available for MeSH headings,and assignments of headings to documents were regarded as relevance judgements. Againthey were selected, to have a minimum of 4 relevant documents in the training set and tohave at least one in the final year; also very rare and very frequent headings were excluded.2

The remaining 4903 MeSH headings formed the MSH topic set. Finally, because of the sizeof this topic set which made it difficult to process in its entirety, a random sample of 500 ofthese was made, to form the MSH-SMP set.

For obvious reasons, the MeSH field of the records could not be used for the MSH orMSH-SMP topic sets.

3.3. Relevance judgements

A characteristic of the methods described is that they make use of datasets where therelevance judgements have already been made. In the early TREC routing experiments, itwas possible to work with a dataset in which judgements had been made on the training setbut not on the test set; these last could be made after participants had submitted their runsto NIST, as part of the evaluation stage. However, this is no longer a possible scenario foradaptive filtering (see below), where the system is expected to modify the profile on the flyas documents are retrieved and judged for relevance. Such an experiment requires ‘canned’relevance judgements to be available.

In the usual TREC style, relevance judgements for each topic have been made in pastTRECs on the pooled output of a number of searches. Attempts have been made to ensurea wide range of different systems are represented, to maximize the chance that most rele-vant documents have been found, but of course some relevant documents will have beenmissed. It may well be that the routing or filtering searches throw up documents whichwere not originally judged because of not appearing in any output in the original pooledsearches.

OVERVIEW OF THE TREC ROUTING 131

For the TREC-8 experiment, some of the newly retrieved documents for each topic werejudged for relevance. This process did indeed throw up some (a small number of) previouslyunknown relevant documents, which were then treated as relevant for evaluation purposes.The process was not repeated in TREC-9, because of lack of resources at NIST and becausethere were no suitable judges available for the medical topics (it would in any case have beenout of the question to attempt such judgements on the topics based on MESH headings).This may be seen as a limitation of this form of experiment.

4. Tasks

Careful decisions need to be made about what data can be made available to a system, forwhatever purpose, at any stage in the process. Clearly no system should see any aspect of the“future” documents before they are processed; therefore (for example) the test documentsshould not be used to set initial idf-based weights. These must be based on some othercollection: in the context of TREC-9, this could include the complete training set of 1987OHSUMED documents. Nor could anything relating to the test topics be used, except asprecisely specified, but other topics (together with their relevance judgements on otherdocuments) could be used to tune the system.

4.1. Adaptive filtering

The adaptive filtering task is designed to model the text filtering process from the momentof profile construction, with at least moderately realistic restrictions on the use of relevanceinformation.

In earlier TRECs, including TREC-8, the assumption was that the user would initiatethe profile with a text topic only. In TREC-9, in contrast, we assumed that the user ar-rives with a small number (two or four) of known positive examples (relevant documents).Subsequently, once a document is retrieved, the relevance assessment (when one exists) isimmediately made available to the system. Judgements for unretrieved documents are neverrevealed to the system. Once the system makes a decision about whether or not to retrievea document, that decision is final. No back-tracking or temporary caching of documents isallowed.

Evaluation is based on set retrieval, as below.

4.2. Batch-adaptive filtering

In this task, the initialization of the profile is allowed to use the complete relevance judge-ments for that topic from some training set of documents. In TREC-8, this was the 1992FT collection; in TREC-9, the 1987 OHSUMED collection. When searching the test set(1993/94 FT or 1988–91 OHSUMED), the same rules applied as for adaptive filtering: therelevance judgement on any document retrieved for a topic could be used to modify theprofile for matching against future documents.

Evaluation is based on set retrieval, as below.

132 ROBERTSON

4.3. Batch filtering (non-adaptive)

Initialization is the same as for Batch-adaptive filtering. No adaptation is performed as thetest set is searched, however. Evaluation is again based on set retrieval.

4.4. Routing

Initialization is the same as for Batch and Batch-adaptive filtering. However, the completetest set is searched in one go, the output is ranked, and the top 1000 documents are returnedfor evaluation. Evaluation is based on the measures used for TREC adhoc tasks, as imple-mented in the trec eval package, such as mean average precision, RPrec (precision whenthe number of documents retrieved is equal to the number relevant in the collection), andprecision at n documents for various n. Thus no threshold or other mechanism for binaryretrieval is required, and no adaptation is possible.

Clearly, these last three tasks (Batch-adaptive, Batch and Routing) make various kinds ofretreat from even the level of realism attempted in Adaptive filtering. However, they remainof interest to some TREC participants, and continue to generate interesting results.

5. Evaluation measures

As discussed, filtering systems are expected to make a binary decision to accept or rejecta document for each profile. Therefore, the retrieved set consists of an unranked list ofdocuments. This fact has implications for evaluation, in that it demands a measure ofeffectiveness which can be applied to such an unranked set. Many of the standard measuresused in the evaluation of ranked retrieval (such as average precision, or precision at a fixeddocument cutoff) are not applicable.

Through most of the TREC filtering experiments, the main measures of performancehave been based on the concept of utility. In TREC-9, a different measure, described asprecision-oriented, was introduced.

5.1. Utility

Utility measures essentially assume that the desirable outcomes of a binary retrieval decisionmay be given credits which accumulate in some fashion, while the undesirable outcomes aregiven corresponding debits, which count against the credits. For this purpose the outcomesare normally classified into a 2 × 2 table as follows:

Relevant Not relevant

Retrieved R+ N+

Not retrieved R− N−

Total R N

OVERVIEW OF THE TREC ROUTING 133

A general linear utility measure takes the form

Utility = A′ R+ + B ′N+ + C ′ R− + D′N−

(A and D would normally be positive, B and C negative).From the point of view of defining a retrieval rule based on maximising linear utility, this

definition can be simplified to reduce the number of parameters. From the above,

Utility = (A′ − C ′)R+ + (B ′ − D′)N+ + C ′ R + D′N ,

but the last two components are independent of the actual results. The zero point of theutility measure is in some sense arbitrary, so we can remove these two components. Nowredefining A = A′ − C ′ and B = B ′ − D′, we have:

Utility = AR+ + B N+

In effect, this means that we can without loss of generality set C ′ and D′ to zero.3 Again,from the point of view of defining the retrieval rule, the unit of measurement of utility isalso arbitrary, so what matters is the ratio of the chosen A and B, not their absolute values.

Thus the TREC linear utility measures have all taken the form

Utility = AR+ + B N+ (1)

with positive A and negative B. The exact values of A and B have varied from year to year,but for example the measure used in TREC-9 had A = 2 and B = −1.

For TREC-8, we experimented with a non-linear utility measure. The idea is that suc-cessive relevant documents retrieved lose value—later documents (after many have alreadybeen retrieved) are of less value than earlier ones. This is achieved by raising R+ to afractional power in Eq. (1). However, the non-linear measure was thought to be difficult tointerpret, and was abandoned for TREC-9.

One problem with a utility measure is that for any given topic it has a maximum (ARwhere R is the total relevant), but it can go negative and is effectively unbounded below. Thishas some consequences for averaging. More particularly, a straightforward mean of utilityacross topics can easily be dominated by a particularly poor performance on a single topic. Inearlier TRECs, no average was calculated—systems were compared on the basis of countingtopics for which each system did well. In TRECs 7–9, various attempts were made to averageutilities fairly. A somewhat complex scheme was tried in TREC-8, in which utilities foreach topic were scaled between the maximum and some notional minimum (which could bevaried). A much simpler method was used in TREC-9: large negative utilities were simplytruncated to some minimum value. In fact, many of the systems were sufficiently welladapted and successfully avoided large negative utilities; for all the top-scoring systems,this truncation rule was never activated.

However, it remains the case that mean utility is likely to be affected more by topics withhigher numbers of relevant documents in total. This is seen as a disadvantage of the utilitymeasure.

The TREC-9 utility measure is known as T9U; it has A = 2, B = 1, and a minimum of−100 (OHSU topics) or −400 (MSH topics).

134 ROBERTSON

5.2. Precision-oriented measure

The basic idea behind this measure is that the user may set a target number of documentsto be retrieved over the period of the simulation. This situation might be said to correspondroughly with cases where the user indicates what sort of volume of material they expect/areprepared for/are able to deal with/would like to see. In TREC-9 a fixed target was used(50 documents over the period of the test). Clearly a fixed target is a simplification of suchcases (each of which is a little different from the others), but may be seen as an acceptablesimplification for experimental purposes.

The measure is essentially precision, but with a penalty for not reaching the target:

Target Precision = Number of relevant retrieved documents

max(Target, Number of retrieved documents)(2)

This may be regarded as something akin to a “precision at [Target] documents” measure.The TREC-9 version of this measure is known as T9P, and has a target of 50 documents.

6. Thresholding and optimisation

In traditional ranked retrieval for an adhoc task, there is a range of measures that are com-monly used, particularly those evaluated by the trec eval program used in the TREC experi-ments. These measures include precision at fixed document cutoffs (5, 10, 20 . . .documents),precision at fixed recall levels (10%, 20%, 30% recall), various forms of uninterpolated orinterpolated average precision, RPrec (precision at the point at which the number of docu-ments retrieved equals the number relevant for this topic), etc. It is of course possible to tunean adhoc retrieval system to give good results on a specific measure; however, in general,it appears that ranking systems are relatively robust across these measures. Some measuresappear more stable than others (Buckley and Voorhees 2000), but a system that is tuned toperform well on one of the more stable measures is likely to perform well on all the othermeasures in the list above.

This is not at all the case for the set-based filtering measures. Tuning of a filtering systemhas to be based very closely on the specific measure it is intended to optimise. Thus typicallya filtering system should know the measure under which it will be evaluated, and the methodsboth for initialising a profile and for adapting it to feedback need to be explicitly aimed atthis measure.

The necessity for this measure-specific tuning is perhaps most in evidence in filteringsystems which are based on traditional ranked-retrieval systems. In such a system, theprofile would typically consist of a traditional query formulation (e.g. terms and weights),to be used in a traditional matching method (resulting in a score for each document), anda non-traditional threshold. This threshold would be used to turn the document score intoa binary decision: a document whose score exceeds the threshold is retrieved, any other isrejected for this profile.

Initial query formulation and scoring in such a system can be identical to those functions inadhoc retrieval: a scoring method which produces adhoc rankings which perform well on the

OVERVIEW OF THE TREC ROUTING 135

above ranking measures is in general good for filtering as well. Subsequent modificationof the query formulation based on relevance feedback may also be similar to that usedin feedback on an adhoc search. However, the non-traditional part of the process, thethresholding, is quite different: the threshold must be set (and adapted) with the optimisationmeasure in mind. A good threshold for a certain utility measure will very likely be bad for adifferent utility measure (that is a different pair of credit-debit parameters), let alone anothermeasure altogether.

Thus an experimental evaluation of filtering systems needs to be performed on the basisof a specific evaluation measure. Typically, each run being compared will be tuned to thatmeasure.

7. Utility, target precision and overall performance

If the measure chosen is one of the utility ones, a system which retrieves nothing in responseto any given topic scores zero on that topic. Since utility may in general be positive ornegative, scoring zero may be preferable to scoring negatively, so a very conservativestrategy (don’t retrieve anything) may be a (relatively) good strategy for a particular measure.

One of the conclusions of some of the TREC experiments in TREC-7 and TREC-8 wasindeed that a conservative strategy was a good one to follow. On one of the utility measuresin TREC-7, the best “system” overall was a notional system which retrieved nothing forany topic. This was seen as a rather discouraging conclusion. To put it another way, the taskas set (with the specific utility measure) seemed to be, simply, too hard.

The hardness of the task may be due to, or alleviated by, a number of factors. Oneresponse in TREC-9 was to provide a few positive examples as well as the text topic forquery initialisation. It is also the case that the number of relevant documents per topic inthe TREC-7 and 8 datasets was low—systems did not have much opportunity to adapt.

However, another response was to introduce the Target Precision measure as an alterna-tive to utility. This turned out to be a very informative measure, because it allowed someperformance comparisons between ranked retrieval and set retrieval (Robertson 2002b). Itappears that the better filtering systems are indeed performing comparably with the betterranked retrieval systems.

Also, some of the systems using the utility measure in TREC-9 achieved average utilitiessubstantially above zero.

Nevertheless, thresholding is crucial. The performance of a system can be completelydestroyed by a poor thresholding mechanism, however good its weighting and scoring. Inthis sense filtering is a harder task than ranked retrieval.

8. Adaptation

The filtering task presents, in principle, opportunities for adaptation based on feedbackwhich are far greater than those normally open in adhoc retrieval.

In the old routing task, without the thresholding complication, the best-performing TRECsystems on the whole were ones that did fairly heavy-duty query optimisation. That is, given

136 ROBERTSON

a substantial training set for a given topic, an almost unlimited number of queries could betried out, by successively reducing or expanding the term set and/or adjusting weights upor down. Although the space of possible queries is far too large to be explored completely,nevertheless a reasonably well-designed heuristic program could be expected to find, giventime, a very good query.

However, this is a computationally demanding approach, and does not translate well tothe filtering environment. Most of the filtering systems used at TREC do not attempt suchiterative query optimisation. Query term selection and weighting is normally done by rule.Some limited iterative threshold optimisation might be used (this is, after all, a problem inonly one dimension!).4

Some of the systems represented at TREC do in fact show substantial benefit fromadaptation.

9. Final remarks

The evolution of the TREC filtering and routing tasks, and the work done by participantsin these tasks, have pushed forward the state of the art by a substantial amount. Althoughno actual results have been reported in this paper (some of the other papers in this issuereport specific results), it should be clear that our understanding of these tasks and of howto design filtering systems has made considerable progress over the period of, and in somemeasure because of, TREC.

Acknowledgments

I am very grateful to David Lewis and David Hull for comments on a draft of this overview,and for being able to borrow unashamedly from their various TREC filtering track overviewpapers.

Notes

1. Relevance judgements for OHSUMED topics were made on a 3-point scale, not relevant, possibly relevant anddefinitely relevant. The training documents for adaptive filtering were definitely relevant. Systems were freeto make use of the graded relevance judgements in any way they saw fit, but the final evaluation was based ontreating both possibly relevant and definitely relevant as relevant.

2. The reason for excluding those MeSH headings not represented in the final year was to avoid headings whichhad been dropped out of MeSH (which undergoes continual modification) during the period.

3. There is an obvious reason to set D′ to zero anyway—it seems a little strange from a user viewpoint to givecredit to the system for not showing a non-relevant document, particularly since there are likely to be many ofthese. Setting C ′ to zero is less obvious, but as indicated, there is no loss of generality in doing so.

4. Even without heavy query optimisation, it is apparent that the filtering task is still potentially computationallyheavy.

References

Allan J (2002) Tracking events through time in broadcast news. Information Retrieval, 5:139–157.Ault T and Yang Y (2002) Information filtering in TREC and TDT: A comparative analysis. Information Retrieval,

5:159–187.

OVERVIEW OF THE TREC ROUTING 137

Buckley C and Voorhees E (2000) Evaluating evaluation measure stability. In: Belkin NJ, Ingwersen P and LeongM-K, Eds., SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval. ACM Press, pp. 33–40.

Eichmann D and Srinivasan P (2002) Adaptive filtering of newswire stories using two-level clustering. InformationRetrieval, 5:209–238.

Harman D (1993) Overview of the first Text Retrieval conference, TREC-1. In: Harman D, Ed., The First TextRetrieval Conference (TREC-1), NIST SP 500-207, pp. 1–20.

Harman D (1994) Overview of the second Text Retrieval conference, TREC-2. In: Harman D, Ed., The SecondText Retrieval Conference (TREC-2), NIST SP 500-215, pp. 1–20.

Harman D (1995) Overview of the third Text Retrieval conference, TREC-3. In: Harman DK, Ed., The Third TextRetrieval Conference (TREC-3), NIST SP 500-226, pp. 1–20.

Harman D (1996) Overview of the fourth Text Retrieval conference, TREC-4. In: Harman DK, Ed., The FourthText Retrieval Conference (TREC-4), NIST SP 500-236, pp. 1–24.

Hersh WR, Buckley C, Leone TJ, and Hickam DH (1994) OHSUMED: An interactive retrieval evaluation andnew large test collection for research. In: Croft WB and van Rijsbergen CJ, Eds., SIGIR ’94: Proceedings of the17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Springer-Verlag, pp. 192–201.

Hull DA (1998) The TREC-6 filtering track: Description and analysis. In: Voorhees EM and Harman DK, Eds.,The 6th Text Retrieval Conference (TREC-6), NIST SP 500-240, pp. 45–68.

Hull DA (1999) The TREC-7 filtering track: Description and analysis. In: Voorhees EM and Harman DK, Eds.,The 7th Text Retrieval Conference (TREC-7), NIST SP 500-242, pp. 33–56.

Hull DA and Robertson S (2000) The TREC-8 filtering track final report. In: Voorhees EM and Harman DK, Eds.,The 8th Text Retrieval Conference (TREC-8), NIST SP 500-246, pp. 35–56.

Lewis D (1996) The TREC-4 Filtering Track. In: Harman DK, Ed., The 4th Text Retrieval Conference (TREC-4),NIST SP 500-236, pp. 165–180.

Lewis D (1997) The TREC-5 filtering track. In: Voorhees EM and Harman DK, Eds., The 5th Text RetrievalConference (TREC-5), NIST SP 500-238, pp. 75–96.

NIST. http://www.nist.gov/srd/nistsd22.htm (visited 9 August 2001)Robertson SE (2002a) Threshold setting and performance optimization in adaptive filtering. Information Retrieval,

5:239–256.Robertson SE (2002b) Comparing the performance of adaptive filtering and ranked output systems. Information

Retrieval, 5:257–268.Robertson SE and Hull DA (2001) The TREC-9 filtering track final report. In: Voorhees EM and Harman DK,

Eds., TR. 9th Text Retrieval Conference (TREC-9), NIST SP 500-249, pp. 25–40.Soboroff I and Nicholas C (2002) Collaborative Content-based Filtering in TREC-8. Information Retrieval, 5:189–

208.Voorhees E and Harman D (1997) Overview of the fifth Text Retrieval conference, TREC-5. In: Voorhees EM and

Harman DK Eds., The Fifth Text Retrieval Conference (TREC-5), NIST SP 500-238, pp. 1–28.Voorhees E and Harman D (1998) Overview of the sixth Text Retrieval conference, TREC-6. In: Voorhees EM

and Harman DK Eds., The Sixth Text Retrieval Conference (TREC-6), NIST SP 500-240, pp. 1–28.