Letter to the editor

2
Pergamon InformationProcessing & Management, Vol.32, No. 5, pp. 635--637, 1996 Published by Elsevier ScienceLtd. Printedin Great Britain 0306-4573/96 $15 +0.00 LETTER TO THE EDITOR The paper on term relevance weights by Shaw (1995) was, for obvious reasons, of much interest to me. However, I believe his main arguments to be considerably weakened by a limitation in his methodology which he does not mention. Essentially this is that the test he performed was a retrospective test: that is, he evaluated the retrieval performance of his relevance weights on the same test collection that he used to calculate them. The alternative is a predictive test, involving two separate sets of test data, one for calculating or estimating the weights, and one for evaluating the performance. There are various ways in which this may be done: we can simulate a routing environment (relevance judgements on items retrieved in last month's documents can be used to formulate searches on this month's documents), or an ad hoc environment (relevance judgements made on the results of an initial search are used to formulate a new search and retrieve new--unjudged---documents from the same collection). While retrospective evaluation has a role to play in understanding relevance weighting, it will be clear below that at least some of Shaw's results and conclusions are compromised, perhaps fatally, by his reliance on this method. In our original paper, RSJ (Robertson & Sparck Jones, 1976), we conducted both retrospective and predictive evaluations. For the retrospective tests, we had a different method of dealing with the problem which Shaw addresses, based on logic rather than on probability estimates. For example, if a term occurred only in relevant documents, we simply retrieved any document containing it, without calculating a weight for the term or a score for the document. Shaw's method just gives the term a very large weight, and therefore the document a very large score. In a retrospective test, there is no possibility of any conflict in either of these methods (e.g. if one term occurs only in relevant documents, and another only in non- relevant documents, they cannot co-occur). The situation is quite different in a predictive test: no term is going to carry that kind of guarantee. The exact probabilities we can measure in the retrospective test (including the zero or one probabilities which are the reason for the problem) no longer apply; what we require now is a method of estimating the probabilities based on partial data. For these reasons, different formulae should be used in retrospective and predictive relevance weighting. The c=0.5 formula which Shaw quotes was specifically intended for the predictive situation, not for the retrospective situation to which he applies it. This is made quite clear in RSJ. Retrospective tests always show better performance than the corresponding predictive tests. This may in part be attributed simply to the fact that the retrospective test has access to the exact probabilities; in this sense, a retrospective test may be said to indicate some optimum performance level, and the more predictive information we have, the closer we may expect to approach this optimum. However, it is also the case that a retrospective test can take advantage of any property of the test set, including those that are not even in principle predictable. So a retrospective test will actually overestimate the optimum. In RSJ, we dealt only with query terms. The emphasis in recent years has been on query expansion, and in this vein Shaw includes all terms available in the system. Unfortunately this has the effect of grossly inflating this tendency to overestimate performance. For example, any term that occurs only in one relevant document will be given a very high weight, irrespective of whether it has any real predictive value for relevance. As an extreme example, a typographical error (text databases usually contain many such, and most of them occur only once) might generate such a term. Then the (relevant) document containing it will necessarily be retrieved, and the test will show good "performance". Thus a more than sufficient condition for a retrospective method such as Shaw's or ours to get perfect performance on expanding the query is for each relevant document to contain at least one typographical error! Treating some unique document identifier as a searchable term would have precisely the same effect. To put it another way, Shaw has clearly overfitted his sample. Buckley and Salton (1995) give a good account of this problem. Several different groups of researchers have investigated the prediction question in some depth, particularly in recent years in the context of the routing experiment in TREC (Harman, 1995). There is still considerable argument about the extent to which query expansion continues to be valuable beyond a certain point. For example, we (Robertson et al., 1995) get our best results by doing very limited expansion, while some other groups generate queries of hundreds of terms. But all groups find that adding in very many more terms either produces little or no extra benefit, or actually reduces performance. Improvements are 635

Transcript of Letter to the editor

Page 1: Letter to the editor

Pergamon Information Processing & Management, Vol. 32, No. 5, pp. 635--637, 1996

Published by Elsevier Science Ltd. Printed in Great Britain 0306-4573/96 $15 + 0.00

LETTER TO THE EDITOR

The paper on term relevance weights by Shaw (1995) was, for obvious reasons, of much interest to me. However, I believe his main arguments to be considerably weakened by a limitation in his methodology which he does not mention. Essentially this is that the test he performed was a retrospective test: that is, he evaluated the retrieval performance of his relevance weights on the same test collection that he used to calculate them. The alternative is a predictive test, involving two separate sets of test data, one for calculating or estimating the weights, and one for evaluating the performance. There are various ways in which this may be done: we can simulate a routing environment (relevance judgements on items retrieved in last month's documents can be used to formulate searches on this month's documents), or an ad hoc environment (relevance judgements made on the results of an initial search are used to formulate a new search and retrieve new--unjudged---documents from the same collection).

While retrospective evaluation has a role to play in understanding relevance weighting, it will be clear below that at least some of Shaw's results and conclusions are compromised, perhaps fatally, by his reliance on this method.

In our original paper, RSJ (Robertson & Sparck Jones, 1976), we conducted both retrospective and predictive evaluations. For the retrospective tests, we had a different method of dealing with the problem which Shaw addresses, based on logic rather than on probability estimates. For example, if a term occurred only in relevant documents, we simply retrieved any document containing it, without calculating a weight for the term or a score for the document. Shaw's method just gives the term a very large weight, and therefore the document a very large score. In a retrospective test, there is no possibility of any conflict in either of these methods (e.g. if one term occurs only in relevant documents, and another only in non- relevant documents, they cannot co-occur). The situation is quite different in a predictive test: no term is going to carry that kind of guarantee. The exact probabilities we can measure in the retrospective test (including the zero or one probabilities which are the reason for the problem) no longer apply; what we require now is a method of estimating the probabilities based on partial data.

For these reasons, different formulae should be used in retrospective and predictive relevance weighting. The c=0.5 formula which Shaw quotes was specifically intended for the predictive situation, not for the retrospective situation to which he applies it. This is made quite clear in RSJ.

Retrospective tests always show better performance than the corresponding predictive tests. This may in part be attributed simply to the fact that the retrospective test has access to the exact probabilities; in this sense, a retrospective test may be said to indicate some optimum performance level, and the more predictive information we have, the closer we may expect to approach this optimum. However, it is also the case that a retrospective test can take advantage of any property of the test set, including those that are not even in principle predictable. So a retrospective test will actually overestimate the optimum.

In RSJ, we dealt only with query terms. The emphasis in recent years has been on query expansion, and in this vein Shaw includes all terms available in the system. Unfortunately this has the effect of grossly inflating this tendency to overestimate performance. For example, any term that occurs only in one relevant document will be given a very high weight, irrespective of whether it has any real predictive value for relevance. As an extreme example, a typographical error (text databases usually contain many such, and most of them occur only once) might generate such a term. Then the (relevant) document containing it will necessarily be retrieved, and the test will show good "performance". Thus a more than sufficient condition for a retrospective method such as Shaw's or ours to get perfect performance on expanding the query is for each relevant document to contain at least one typographical error! Treating some unique document identifier as a searchable term would have precisely the same effect.

To put it another way, Shaw has clearly overfitted his sample. Buckley and Salton (1995) give a good account of this problem.

Several different groups of researchers have investigated the prediction question in some depth, particularly in recent years in the context of the routing experiment in TREC (Harman, 1995). There is still considerable argument about the extent to which query expansion continues to be valuable beyond a certain point. For example, we (Robertson et al., 1995) get our best results by doing very limited expansion, while some other groups generate queries of hundreds of terms. But all groups find that adding in very many more terms either produces little or no extra benefit, or actually reduces performance. Improvements are

635

Page 2: Letter to the editor

636 Letter to the Editor

no doubt possible, but Shaw's "perfect performance" based on including all terms, in a predictive environment, is not even remotely plausible.

Centre for Interactive Systems Research Department of Information Science City University, London, U.K.

STEPHEN ROBERTSON

REFERENCES

Buckley, C. & Salton, G. (1995). Optimization of relevance feedback weights. In E.A. Fox, E Ingwersen & R. Fidel (Eds), SIGIR-95 (Special issue of SIGIR Forum) (pp. 351-357). New York: ACM.

Harman, D. (Ed.) (1995). The Third Text Retrieval Conference (TREC-3). Gaithersburg, MD: NIST. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M. & Gatford, M. (1995). Okapi at TREC-3. In D. K.

Harman (Ed), Overview of the Third Text Retrival Conference (TREC-3), (pp. 109-126). Gaithersburg, Maryland: National Institute of Standards and Technology.

Robertson, S.E. & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146.

Shaw, W.M. (1995). Term-relevance computations and perfect retrieval performance. Information Processing and Management, 31(4), 491--498.

REPLY

As indicated in Robertson's letter, retrospective tests of probabilistic retrieval models can be expected to establish optimal levels of performance. "Term-Relevance Computations and Perfect Retrieval Perform- ance" (Shaw, 1995) presents optimal levels of performance for word-stem representations of the 1239 documents in the CF database (Shaw et ai., 1991) as a function of computing equations for probabilities in the binary independent model introduced by Robertson and Sparck Jones (1976). Results show a wide disparity between optimal levels of performance for "conventional" [Robertson & Sparck Jones, 1976; Shaw, 1995, p. 494, eqn 2(a) and (b), with c=0.5] and "alternative" [Shaw, 1995, p. 494, eqn 3(a) and (b)] computing equations, when all terms in the database are used; alternative equations produce essentially perfect results. Robertson contends that comparisons of conventional and alternative computing equations are inappropriate in a retrospective study, in part, because the former equations are intended for predictive studies (Robertson & Sparck Jones, 1976). In his letter, Robertson also asserts that "'perfect performance' based on including all terms, in a predictive environment, is not even remotely plausible." Preliminary results of an investigation in progress provide some insight into these issues.

The effectiveness of predictive tests is currently being investigated as a function of document representations in the CF database. Initial searches are based on word stems taken from query statements, which are compared to word-stem representations of documents, and retrieved documents are ranked according to inverse document frequency weights. The first 30 documents in the ranked outcome constitute the "training set," and define the universe of documents from which term relevance weights are estimated. Based on the assumption that the most important events in an adaptive retrieval system are the relevance judgments of end users and that no information associated with these evaluations should be ignored, relevance weights are computed for all word stems appearing in the training set. The complete set of terms, together with their relevance weights, defines the query vector in the next iteration of the search. In subsequent iterations of the feedback process, the training set includes 30 documents plus the number known to be relevant at that stage of the process, and term relevance weights are again computed for the complete set of word stems in the training set. Documents identified as relevant in one iteration can contribute to term relevance weight estimates in the next iteration. The feedback process is continued until no further improvement in retrieval effectiveness is detected. For many queries the most effective result is found after several iterations; a few queries require many iterations. If a unique typographical error is indexed and contributes to the retrieval of a relevant document, the outcome is considered a success.

Two results of the predictive tests support Robertson's contention regarding comparisons of conventional and alternative computing equations. First, in the retrospective investigations, the alternative computing equations yield average values of recall and precision that are 53 to 153% higher than