A commentary on error measures

3
than smaller ones, and percentage better that does not take account of error size. I would place less weight on consensus. A particularly sobering finding is the poor performance of the RMSE, the most commonly used error measure. Its use is related to its popu- larity with statisticians and its interpretability in relation to business decisions and not to its effi- ciency in choosing accurate forecasting methods. The reliability of RMSE is poor and it is scale dependent. Another favorite, the MAPE, does reasonably well in the Armstrong-Collopy com- parisons (better, 1 think, than they acknowledge) but is strongly rejected by Fildes on statistical grounds. While it is difficult to find a single measure of accuracy, all is not lost because both papers find support for the usefulness of modifications of these well known methods. The RGRMSE over- comes the problems of the RMSE and the Rela- tive Median Absolute Percentage Error (RMd- APE) those of the MAPE. Both papers find RMAPE useful although Fildes shows a prefer- ence for the RGRMSE based on interpretability. These measures are related to Theil’s U2 but use levels of variables not changes as in Theil’s origi- nal measure. I think a move to relative accuracy measures is overdue and the random walk is a good choice for the comparator since it is often in the minds of forecasters. Other comparators may be relevant in other applications, for example, Keilman (1991) used a modified U2 for evaluat- ing population forecasts with the population in the year before the jump-off year as the alterna- tive forecast. Interpretability seems to be an important issue in the use of accuracy measures. Armstrong and Collopy like the RAE because of its ease of interpretation relative to Theil’s U2 and Fildes likes the interpretability of RGRMSE. I find Theil’s U2 more appealing on interpretability than Armstrong and Collopy (see Ahlburg 1982) but agree that our ability to communicate the mean- ing of accuracy statistics is a key to spreading the use of better statistics among practitioners. The conclusion reached by Armstrong and Collopy that no single accuracy measure is appro- priate in most situations is true but perhaps too pessimistic. These papers clearly reject some fre- quently used measures and provide others that have desirable statistical properties and perform well in applications. The next step is to get fore- casters to use ‘[appropriate] accuracy results to guide our choices among methods’ [Beaumont and Isserman (1987: 1005)l. References Ahlburg, D.A., 19X2, “How accurate are the US Bureau of the Census projections of total live births?” Jourtza/ O! Forecasting, 1, 365-374. Ahlburg, D.A., 19X7, “Population forecasting.” in: S. Makri- dakis and S. Wheelwright. eds., TIrc Handbook of Forecasi- mg: A Manaxer’s Guide. second ed. (Wiley, New York) 135-14’). Beaumont, P. and A. Isserman. 1987, “Comment”. Journal @ the American Statisticul Associution. X2, 1004- 1009. Keilman, N.W.. 1991, Uncertainty in national population fore- casting: Issues, backgrounds, analyses, recommendations. (Swets and Zeitlinger, Amsterdam). Land, K., 19X6, “Methods for national population forecasts: A review”, Journal of the American Statistical Association, 81, X88-YOl. Smith, S. 1987. “Test of forecast accuracy and bias for county population projections”. Journal of the American Statistical Association, 82, Y91- 1003. Smith, S. and T. Sincich, 1991, “An empirical analysis of the effect of length of forecast horizon on population forecast errors”. Demography, 2X. 261-274. Biography: Dennis A. AHLBURG is Associate Professor of Industrial Relations in the Carlson School of Management and Deputy Director of the Center for Population Analysis and Policy, Humphrey Institute for Public Affairs. at the University of Minnesota. He holds a PhD in economics from the University of Pennsylvania. His current research interests include the accuracy of population forecasts, determinants and effects of migrant’s remittances. and labor supply in health care labor markets. He has published papers in De- mography, International Journal of Forecasting, Journal of Forecasting, Journal of Labor Research. Research in Ponula- tion Economics, and Research in Personnel and Human Re- source Management. He wrote the chapter on population forecasting in Makridakis and Wheelwright’s The Handbook of Forecasting. A commentary on error measures, Chris Chatfield, School of Mathematical Sciences, University of Bath, Bath, Avon, BA2 7AY, UK. These two papers address important issues in the evaluation of forecasting methods. I welcome the main thrust of both papers and would like to briefly summarise the results for the benefit of the general reader. I am just a little worried that

Transcript of A commentary on error measures

Page 1: A commentary on error measures

than smaller ones, and percentage better that does not take account of error size. I would place less weight on consensus.

A particularly sobering finding is the poor performance of the RMSE, the most commonly used error measure. Its use is related to its popu- larity with statisticians and its interpretability in relation to business decisions and not to its effi- ciency in choosing accurate forecasting methods. The reliability of RMSE is poor and it is scale dependent. Another favorite, the MAPE, does reasonably well in the Armstrong-Collopy com- parisons (better, 1 think, than they acknowledge) but is strongly rejected by Fildes on statistical grounds.

While it is difficult to find a single measure of accuracy, all is not lost because both papers find support for the usefulness of modifications of these well known methods. The RGRMSE over- comes the problems of the RMSE and the Rela- tive Median Absolute Percentage Error (RMd- APE) those of the MAPE. Both papers find RMAPE useful although Fildes shows a prefer- ence for the RGRMSE based on interpretability. These measures are related to Theil’s U2 but use levels of variables not changes as in Theil’s origi- nal measure. I think a move to relative accuracy measures is overdue and the random walk is a good choice for the comparator since it is often in the minds of forecasters. Other comparators may be relevant in other applications, for example, Keilman (1991) used a modified U2 for evaluat- ing population forecasts with the population in the year before the jump-off year as the alterna- tive forecast.

Interpretability seems to be an important issue in the use of accuracy measures. Armstrong and Collopy like the RAE because of its ease of interpretation relative to Theil’s U2 and Fildes likes the interpretability of RGRMSE. I find Theil’s U2 more appealing on interpretability than Armstrong and Collopy (see Ahlburg 1982) but agree that our ability to communicate the mean- ing of accuracy statistics is a key to spreading the use of better statistics among practitioners.

The conclusion reached by Armstrong and Collopy that no single accuracy measure is appro- priate in most situations is true but perhaps too pessimistic. These papers clearly reject some fre- quently used measures and provide others that have desirable statistical properties and perform

well in applications. The next step is to get fore- casters to use ‘[appropriate] accuracy results to guide our choices among methods’ [Beaumont and Isserman (1987: 1005)l.

References

Ahlburg, D.A., 19X2, “How accurate are the US Bureau of

the Census projections of total live births?” Jourtza/ O!

Forecasting, 1, 365-374.

Ahlburg, D.A., 19X7, “Population forecasting.” in: S. Makri-

dakis and S. Wheelwright. eds., TIrc Handbook of Forecasi-

mg: A Manaxer’s Guide. second ed. (Wiley, New York) 135-14’).

Beaumont, P. and A. Isserman. 1987, “Comment”. Journal @

the American Statisticul Associution. X2, 1004- 1009.

Keilman, N.W.. 1991, Uncertainty in national population fore-

casting: Issues, backgrounds, analyses, recommendations.

(Swets and Zeitlinger, Amsterdam).

Land, K., 19X6, “Methods for national population forecasts: A review”, Journal of the American Statistical Association, 81,

X88-YOl.

Smith, S. 1987. “Test of forecast accuracy and bias for county

population projections”. Journal of the American Statistical

Association, 82, Y91- 1003.

Smith, S. and T. Sincich, 1991, “An empirical analysis of the

effect of length of forecast horizon on population forecast

errors”. Demography, 2X. 261-274.

Biography: Dennis A. AHLBURG is Associate Professor of Industrial Relations in the Carlson School of Management and Deputy Director of the Center for Population Analysis and Policy, Humphrey Institute for Public Affairs. at the University of Minnesota. He holds a PhD in economics from the University of Pennsylvania. His current research interests include the accuracy of population forecasts, determinants and effects of migrant’s remittances. and labor supply in health care labor markets. He has published papers in De- mography, International Journal of Forecasting, Journal of Forecasting, Journal of Labor Research. Research in Ponula- tion Economics, and Research in Personnel and Human Re- source Management. He wrote the chapter on population forecasting in Makridakis and Wheelwright’s The Handbook of Forecasting.

A commentary on error measures, Chris Chatfield, School of Mathematical Sciences, University of Bath, Bath, Avon, BA2 7AY, UK.

These two papers address important issues in the evaluation of forecasting methods. I welcome the main thrust of both papers and would like to briefly summarise the results for the benefit of the general reader. I am just a little worried that

Page 2: A commentary on error measures

the ‘message’ may get rather lost in these longish papers with so many tables to digest.

I suppose the average forecaster (is there such an animal?) wants to know (a) which forecasting method to use, and (b) which error measure should be used to assess forecasting accuracy. The answers to these questions are of course situation-dependent. For example the choice of method depends on the type of data and the expertise available. Here I concentrate on the following dichotomy of situations. At one extreme there is a single series to forecast, an appropriate probability model is fitted and optimal forecasts are then made. At the other extreme, the practi- tioner with a large number of series to forecast, may decide to use the same all-purpose proce- dure whatever the individual series look like. The former situation is perhaps more familiar to the statistician, and the latter to the operational re- searcher.

Error measures For a single series it is perfectly reasonable to

fit a model by least squares (as statisticians cus- tomarily do) and evaluate forecasts from different models and/or methods by the mean square er- ror (MSE) of the forecasts. However, once you apply the same method to a group of series, it can be disastrous to average raw MSE across series as MSE is scale-dependent. Unfortunately this was done with the M-competition results so that the MSE results therefrom should be disregarded. Various alternatives are explored by Armstrong and Collopy and detailed recommendations are made. They seem reasonable enough but involve three different measures and so are perhaps somewhat overly complicated. Fildes on the other hand plumps for the geometric root MSE (GRMSE). This may be unfamiliar to many read- ers, in which case I recommend reading Section 1.2 closely. The key point is that the measure should be scale-independent so that multiplying all the numbers in one series by the same con- stant (e.g. expressing sales in pounds rather than dollars) should have no effect on the overall comparison of methods for a group of series. It is also valuable if the error measure has the prop- erty of not being unduly affected by outliers, as demonstrated by Fildes for GRMSE.

Another important point demonstrated by Fildes is that forecasting comparisons should not

rely on a single time origin for each series but that the error measure should be averaged across time in some way, as well as across different series.

Choice of forecasting method As regards choice of forecasting method, the

results of the M-competition and similar studies have long been controversial. Is it really true that simple smoothing methods can give forecasts which are as good as Box-Jenkins forecasts for example? Do the results generalise to other groups of series and/or single series?

The value, or otherwise, of forecasting compe- titions has been discussed by various authors [e.g. Chatfield (1988a, section 4.1)1. They do tell us something, but only part of the story. They have been mainly helpful in comparing automatic fore- casting methods for large disparate groups of series. The series in the M-competition were about as varied as one can get! In contrast Fildes examines a large group of series from the same company for the same variable. Naturally the data are much more homogeneous than the series in the M-competition. In particular they all show a negative trend for example which renders sim- ple exponential smoothing inappropriate at a stroke. Fildes uses background knowledge, an exploratory examination of the data, plus a com- parative evaluation of different methods to select a forecasting method which is designed to cope with the particular situation, including the pres- ence of trend and of outliers. This case study makes it is clear that the results of one forecast- ing competition need not necessarily generalise to another. Furthermore the results from forecast- ing competitions certainly do not generalise to the single-series situation, where the analyst will still have the difficult task of identifying the model appropriate to the given data. Perhaps the great- est benefit of the M-competition has been, not the results as such, but the ‘by-products’ in mak- ing us think more clearly about such issues as error measures and replicability.

Tables and graphs For my final comments, I switch to a com-

pletely different topic. One of my ‘hobby-horses’ is lamenting the generally poor standard of tables and graphs which the computer age seems to have done little to improve.

Page 3: A commentary on error measures

The general rules for presenting clear tables and graphs are ‘well-known’ [e.g. Chatfield (lC)XSb, section 6.511 yet are often disregarded. For example there should be a clear self-explana- tory title, the units of measurement should be stated, axes should be labelled, the number of significant figures in tables should be carefully chosen, and so on. The overriding rule is that the exhibit should be clear, easy-to-understand and preferably self-explanatory without looking at the text.

While I have seen many worse examples than those given in these two papers (which have im- proved during the refereeing process), I have also seen better and I suspect that many readers will be bemused by some of them. For example in Fildes’ Exhibit 2, the horizontal axis, time, is measured in months (I think) and the vertical axis is the ‘number of circuits’ not ‘circuits’. In Arm- strong-Collopy’s Exhibit 6, I can guess what RW is, I can look up RAE and F,,,, yet still end up vaguely confused. Likewise Fildes’ Exhibit 5 is likely to bring on a headache rather than a dawn- ing of light. If the results at lead 12 are like those at lead 6, is it necessary to give them‘? If the geometric mean is better than the arithmetic mean, why give the latter? Ah well, I could give far worse examples from other recent published papers, including figures with no title. unlabelled axes and so on. I urge all readers, whether acting as author, referee or editor, to give this topic the attention it deserves. It really can ‘make’ a paper if the tables and graphs are clear, but this can take much more effort than many people realise. It is not obvious for example how to present Fildes’ Exhibit 5 in the ‘best’ way. In my experi- ence it can take several iterations to get tables and graphs to a publishable state, and even then further improvements are often possible which can sometimes be seen more easily by a fresh eye.

References

Chatfield. C., 1988a, .‘What is the best method of

forecasting?“. Journal of‘Applird Stutisrrcs. 15. 19-38.

Chatfield, C.. 1988b. f’roblrm-SolGng: A Sta~is~iciun ‘s Guide (Chapman and Hall, London).

Biography: Chris CHATFIELD is Reader in Statistics at the University of Bath. U.K. He has a Ph.D. in Statistics and is the author of four textbooks, including The Amlysi.s of Time Srrws (4th edn., 1989, Chapman and Hall) as well as numer-

ous research papers. He is a Fellow of the Royal Statistical Society, an elected member of the International Statistical Institute, and a member of the International Institute of Forecasters. His research interests cover time-series analysis. with emphasis on forecasting, as well as the more general questions involved in tackling real-life statistical problems (see Pro!~lem-Solr,irl,g - A S~~t~ti&~z ‘Y Guide. 1988. Chapman and Hall).

Computing forecusts in finance. Stephen J. Taylor, Department of Accounting and Finance, Lan- caster University, UK.

The two papers by Armstrong, Collopy and Fildes draw attention to the important problem of deciding how to combine forecasting results for a set of time series. Their contributions will I hope stimulate further work in this area. In these comments I reflect on how the issues raised by the authors could relate to forecasting market prices and I conclude by noting some lessons that can be learnt from the papers.

Armstrong and Collopy give results for eco- nomic and demographic data and Fildes studies telecommunications data with references to in- ventory and production control. One way to eval- uate the methodological contributions of these two papers is to consider their implications for forecasting methodology in a different area. I will attempt to do this for the application area 1 know best, namely Finance. Comparisons of forecasting methods in Finance are frequently made for sev- eral series but relatively rarely is an effort made to combine results across series. The following examples are therefore hypothetical but all are based on important contemporary problems in Finance.

Consider three forecasting tasks involving the market prices of financial assets where the re- searcher could have access to a sample of time series from a large, homogcnous population. The researcher might work for a bank or might be an academic; both scenarios are plausible and a dis- tinction between applied and theoretical forecast- ing should not be sought.

First, we might have an interest in predicting the future values of prices and the data could be the weekly prices of several hundred U.S. stocks for several years. Second, we might want to pre- dict the volatility of prices, defined to be the standard deviation of changes in the logarithms