Positive or negative?

2
ELSEVIER International Journal of Forecasting 11 (1995) 501-502 Editorial Positive or negative? Chris Chatfield School of Mathematical Sciences, University of Bath, Bath BA2 7A Y, UK According to some recent 'Hype', neural net- works (NNs) could provide the answer to all the forecaster's needs. Other commentators (e.g. Chatfield, 1993) are more cautious and published empirical studies have, as usual, given a mixture of encouraging and discouraging results (e.g. Hill et al., 1994; Gorr et al., 1994). I was recently talking to someone who has been working for a large international company which has been trying out NN models. Sadly, they found that the resulting forecasts were less accurate than those produced by simple exponential smoothing. When I enquired if the results were going to be published or could be referred to, the answer was--"Of course not!". The company knew that their analysts could easily be criticised by the NN experts (e.g. for trying the wrong architecture) and did not want to be seen to have 'failed' at implementing a new 'state-of-the-art' procedure. News like this is rather disturbing. Of course it is more satisfying to publish a positive result, such as "This new treatment works better than existing treatments", or "This new forecasting method works better than current alternatives". However it can be just as important to publish a negative result, such as "This new treatment appears to be no better than (or perhaps is even worse than) existing alternatives" or "This new hypothesis is not supported by empirical evi- dence". It is therefore unfortunate that the prevailing culture may conspire to suppress nega- tive results. The publication bias against non-significant results in hypothesis-testing is well-known (e.g. Begg and Berlin, 1988). This bias is particularly unfortunate in medical statistics where it is important to know if a new treatment is working or not. One particularly worrying case I heard about concerned a drug which was found to have a 'significant' effect in an initial historical re- trospective study but which gave non-significant results when tested in later clinical trials. The latter results were hard to publish and, when last heard of, the drug was still being prescribed. In some journals it is virtually impossible to publish a non-significant result as this is thought not to be interesting. Of course there is a sense in which positive results are more interesting than negative ones, and it would become rather tedious if every published result was negative. But this should not prevent negative results being published especially when they contradict earlier findings or current thinking. Similar remarks should apply when a new statistical method is proposed. I think particu- larly of the many computationally intensive methods which have been proposed in recent years. There is no doubt that techniques like bootstrapping (Efron and Tibshirani, 1993) have been able to solve problems which were difficult or impossible to tackle with analytic methods. A forecasting example is given by Masarotto (1990). It has become a feature of recent statisti- cal conferences for enthusiastic speakers to pres- ent convincing examples of their use. However one hears 'on the grapevine' of disturbing cases where the methods have not worked well, per- haps because they give results which are intui- tively unacceptable or which are inferior to those given by alternative methods. These examples Copyright © 1995 Published by Elsevier Science B.V. SSD1 0169-2070(95)000648-6

Transcript of Positive or negative?

Page 1: Positive or negative?

ELSEVIER International Journal of Forecasting 11 (1995) 501-502

Editorial

Positive or negative?

Chris Chatfield School of Mathematical Sciences, University of Bath, Bath BA2 7A Y, UK

According to some recent 'Hype' , neural net- works (NNs) could provide the answer to all the forecaster's needs. Other commentators (e.g. Chatfield, 1993) are more cautious and published empirical studies have, as usual, given a mixture of encouraging and discouraging results (e.g. Hill et al., 1994; Gorr et al., 1994). I was recently talking to someone who has been working for a large international company which has been trying out NN models. Sadly, they found that the resulting forecasts were less accurate than those produced by simple exponential smoothing. When I enquired if the results were going to be published or could be referred to, the answer w a s - - " O f course not!". The company knew that their analysts could easily be criticised by the NN experts (e.g. for trying the wrong architecture) and did not want to be seen to have 'failed' at implementing a new 'state-of-the-art' procedure.

News like this is rather disturbing. Of course it is more satisfying to publish a positive result, such as "This new treatment works better than existing treatments", or "This new forecasting method works better than current alternatives". However it can be just as important to publish a negative result, such as "This new treatment appears to be no better than (or perhaps is even worse than) existing alternatives" or "This new hypothesis is not supported by empirical evi- dence". It is therefore unfortunate that the prevailing culture may conspire to suppress nega- tive results.

The publication bias against non-significant results in hypothesis-testing is well-known (e.g. Begg and Berlin, 1988). This bias is particularly

unfortunate in medical statistics where it is important to know if a new treatment is working or not. One particularly worrying case I heard about concerned a drug which was found to have a 'significant' effect in an initial historical re- trospective study but which gave non-significant results when tested in later clinical trials. The latter results were hard to publish and, when last heard of, the drug was still being prescribed. In some journals it is virtually impossible to publish a non-significant result as this is thought not to be interesting. Of course there is a sense in which positive results are more interesting than negative ones, and it would become rather tedious if every published result was negative. But this should not prevent negative results being published especially when they contradict earlier f indings or current thinking.

Similar remarks should apply when a new statistical method is proposed. I think particu- larly of the many computationally intensive methods which have been proposed in recent years. There is no doubt that techniques like bootstrapping (Efron and Tibshirani, 1993) have been able to solve problems which were difficult or impossible to tackle with analytic methods. A forecasting example is given by Masarotto (1990). It has become a feature of recent statisti- cal conferences for enthusiastic speakers to pres- ent convincing examples of their use. However one hears 'on the grapevine' of disturbing cases where the methods have not worked well, per- haps because they give results which are intui- tively unacceptable or which are inferior to those given by alternative methods. These examples

Copyright © 1995 Published by Elsevier Science B.V. SSD1 0169-2070(95)000648-6

Page 2: Positive or negative?

502 Editorial

are rarely reported formally, presumably because people do not like to admit that they have 'failed' or because they think the results are not of general interest. (Another problem with computationally intensive methods is that one can rarely be sure that the 'answers' are correct, but that is another story). A few cautionary remarks are beginning to appear in the literature (e.g. Young, 1994) and it is clearly to be hoped that both positive and negative results will be reported in future so that the analyst can better judge when and how these techniques should be used. Forecasters should also note that the use of resampling methods on time-series data is par- ticularly problematic because of the time-order- ing of the data and the lack of independence. Some specialized methods have been developed (e.g. see Hjorth, 1994) but great care is needed. I attended one lecture where three different methods of resampling were tried on the same time-series data. Only one of the three sets of results was thought to be intuitively reasonable, and the resampling method corresponding to that (most plausible) set of results was therefore chosen. Such an approach does not fill me with confidence!

The importance of reporting both positive and negative results also applies when a new fore- casting method is proposed. The person who develops a new method will naturally apply it with great care and will typically choose to try it on suitable data sets where it works well. The good predictions which result may lead to the method being forcefully advocated and, over the years, we have seen many examples of novel forecasting methods which have been heralded as the answer to all our prayers. Yet when these methods are tried on different data sets with different characteristics by different people (per- haps with less skill), the methods are usually found to give much less persuasive results. Rather empirical studies may show that the method is 'good' under some conditions but not others. Thus, in the course of time a new method may take it s place in the forecaster's toolkit as just one of a number of procedures, each of which is known to be useful under particular well-defined circumstances.

In a sense, this editorial is just re-emphasizing

the fundamental importance of replication in scientific work and it is heartening that " IJF encourages replication studies" (see guidelines on the inside front cover). Replication is a key requirement of genuine scientific progress, and its importance has been emphasized for example by Ehrenberg (1993) and Chatfield (1995, espe- cially Sections 5.2 and 7.2) in statistical work, and by Fildes and Makridakis (1995) in forecast- ing competitions. The point I am making can be alternatively expressed as saying that replication may result in a confirmation of earlier positive results but may produce contradictory negative results which must not be suppressed.

So by all means let us be enthusiastic about new procedures, but let us also be 'critical' (in the best senses of the word) and this means that all relevant evidence should be published whether positive or negative.

References

Begg, C.B. and J.A. Berlin, 1988, Publication bias: a problem in interpreting medical data (with discussion), Journal of the Royal Statistical Society, Series A , 151, 419-463.

Chatfield, C., 1993, Neural networks: Forecasting break- through or passing fad? International Journal of Forecast- ing, 9,1-3.

Chatfield, C., 1995 Problem Solving: A Statistician's Guide, 2nd edn. (Chapman and Hall, London).

Efron, B. and R.J. Tibshirani, 1993, An Introduction to the Bootstrap, (Chapman and Hall, New York).

Ehrenberg, A.S.C., 1993, Predictability and prediction (with discussion), Journal of the Royal Statistical Society, Series A, 156, 167-206.

Fildes, R. and S. Makridakis, 1995, The impact of empirical accuracy studies on time series analysis and forecasting, International Statistical Review (in press).

Gorr, W.L., D. Nagin and J. Szczypula, 1994, Comparative study of artificial neural network and statistical models for predicting student grade point averages, International Journal of Forecasting, I0, 17-34.

Hill, T., L. Marquez, M. O'Connor and W. Remus, 1994, Artificial neural network models for forecasting and deci- sion making, International Journal of Forecasting, 10, 5- 15.

Hjorth, J.S.U., 1994, Computer Intensive Statistical Methods, (Chapman and Hall, London).

Masarotto, G., 1990 Bootstrap prediction intervals for au- toregressions, International Journal of Forecasting, 6, 229- 239.

Young, G.A., 1994, Bootstrap: More than a stab in the dark? (with discussion) Statistical Science, 9, 382-415.