Apples, oranges and mean square error

4
International Journal of Forecasting 4 (1988) 515-518 North-Holland 515 EDITORIAL Apples, Oranges and Mean Square Error It is important to compare like with like. The table of mean square errors reported in the M-competition results does not do this. So be warned! Children usually learn at an early age the dangers of adding apples to orange (or apples to bananas or whatever). Yet in later life, we often fail to remember this maxim. Total book sales may mistakenly give equal weight to one small paperback and one enormous reference text. A statistic averaged over all American states may give equal weight to California and to Rhode Island (when, if ever, would this be sensible?). My concern here is the potentially misleading use of mean square error (abbreviated MSE) when obtained by averaging across different time series or different forecasting horizons. MSE is a widely used criterion for assessing the fit of a model to a set of data, and for assessing the accuracy of forecasts produced by the model. It gives extra weight to large errors, is mathemati- cally easy to handle, can be linked to a quadratic loss function, and is generally regarded as a sensible measure of accuracy for evaluating an individual time series. Indeed it can be argued theoretically that if a model is fitted by least squares, then forecasts should be evaluated using the same criterion [Zellner (1986)]. The use of MSE averaged across different time series is much more controversial because the resulting measure will not be scaie-invariant. If, for example, one series is multiplied throughout by a constant, then the relative rankings of average MSE produced by different forecasting methods may change. This is of particular interest in regard to the M-competition [Makridakis et al. (1982) hereafter abbreviated M82] which compares 24 forecasting procedures on 1001 time series. A variety of criteria were reported including MSE averaged across all series. The MSE results are tabulated in Table 3 of MS2 and part of this table is given in table 1 I suspect the reader will look at table 1 and shudder. It is very difficult to compare the results, partly because of the hideous E-format (which often appears in computer output to perplex the user) and partly because there are too many significant digits. Fortunately, this is the only table in M82 in this confusing format. I understand this format was used because the table was only included at the last minute at the insistence of some participants, although I also understand that other participants did not want it included at all. Indeed, reading M82 it is clear that the co-authors could not agree on what, if anything, this table demonstrates, except tllbt MSE Lhictuates much more than other measures. Subsequent writers have also found difficulties with the table. In the M82 commentary [Armstrong and Lusk (1983)], Guerts, Newbold and Fildes all remark that the MSE results are not scale-in- variant. However, Gardner suggests that, although difficult to interpret, the results should not be overlooked. Nevertheless, Gardner goes on to say that the series should be sorted into groups with similar levels to make the MSE results easier to interpret. Zellner (1986) deplores the lack of emphasis given to the fact that the Bayesian forecasting procedure performs ‘best’ according to the SE criterion when averaged over all forecasting horizons, but the scaling problem is further 0169-2070/88/$3.50 C; 1988, Elsevier Science Publishers B.V. (North-Holland)

Transcript of Apples, oranges and mean square error

Page 1: Apples, oranges and mean square error

International Journal of Forecasting 4 (1988) 515-518 North-Holland

515

EDITORIAL

Apples, Oranges and Mean Square Error

It is important to compare like with like. The table of mean square errors reported in the M-competition results does not do this. So be warned!

Children usually learn at an early age the dangers of adding apples to orange (or apples to bananas or whatever). Yet in later life, we often fail to remember this maxim. Total book sales may mistakenly give equal weight to one small paperback and one enormous reference text. A statistic averaged over all American states may give equal weight to California and to Rhode Island (when, if ever, would this be sensible?). My concern here is the potentially misleading use of mean square error (abbreviated MSE) when obtained by averaging across different time series or different forecasting horizons.

MSE is a widely used criterion for assessing the fit of a model to a set of data, and for assessing the accuracy of forecasts produced by the model. It gives extra weight to large errors, is mathemati- cally easy to handle, can be linked to a quadratic loss function, and is generally regarded as a sensible measure of accuracy for evaluating an individual time series. Indeed it can be argued theoretically that if a model is fitted by least squares, then forecasts should be evaluated using the same criterion [Zellner (1986)].

The use of MSE averaged across different time series is much more controversial because the resulting measure will not be scaie-invariant. If, for example, one series is multiplied throughout by a constant, then the relative rankings of average MSE produced by different forecasting methods may change. This is of particular interest in regard to the M-competition [Makridakis et al. (1982) hereafter abbreviated M82] which compares 24 forecasting procedures on 1001 time series. A variety of criteria were reported including MSE averaged across all series. The MSE results are tabulated in Table 3 of MS2 and part of this table is given in table 1

I suspect the reader will look at table 1 and shudder. It is very difficult to compare the results, partly because of the hideous E-format (which often appears in computer output to perplex the user) and partly because there are too many significant digits. Fortunately, this is the only table in M82 in this confusing format. I understand this format was used because the table was only included at the last minute at the insistence of some participants, although I also understand that other participants did not want it included at all. Indeed, reading M82 it is clear that the co-authors could not agree on what, if anything, this table demonstrates, except tllbt MSE Lhictuates much more than other measures.

Subsequent writers have also found difficulties with the table. In the M82 commentary [Armstrong and Lusk (1983)], Guerts, Newbold and Fildes all remark that the MSE results are not scale-in- variant. However, Gardner suggests that, although difficult to interpret, the results should not be overlooked. Nevertheless, Gardner goes on to say that the series should be sorted into groups with similar levels to make the MSE results easier to interpret. Zellner (1986) deplores the lack of emphasis given to the fact that the Bayesian forecasting procedure performs ‘best’ according to the

SE criterion when averaged over all forecasting horizons, but the scaling problem is further

0169-2070/88/$3.50 C; 1988, Elsevier Science Publishers B.V. (North-Holland)

Page 2: Apples, oranges and mean square error

516 Editorial

Table 1

Average MSE: all data. Part of Tablr 3 of M82.

Method Forecastinyhorizon

NAIVE1

Mov. Averag.

Single EXP

ARR EXP

Holt EXP

Brown EXP

Quad. EXP

Regression

NAIVE2

D. Mov. Avg

D Sing EXP

D ARR EXP

D Holt EXP

D Brown EXP

D Quad. EXP

D Regress

Winters

Autom. AEP

Bayesiart F.

Combining A

Combining B

0.9217E+ 11

0.2702E+ 12

0_8553E+ 11

0.8594E + 11

0 5117E+ll

k4004E + 11

0.3242E + 11

0.7966E + 11

0.9316E+ 11

0.2701E-c 12

0.8551E+ 11

0.8586E + 11

0.5113E+ I1

0.4001 E -+ 11

U.323jE + i i

0.7962s -I- 11

0.51!3E+ 11

0.4S93E + 11

0.2173E+ 11

0.5759E + 11

0.8525E+ 11

Average 0.8287E + 11

0.3051E+09

0.2022E + 09

O.l25lE+O9

O.l196E+09

0.3248E + 09

0.3134E+09

0.1255E + 10

0.1958E + 09

0.3565E + 09

0.3338E + 09

0.2496E + 09

0.2001 E + 09

0.5644E + 09

0.4874E + 09

0.1514E + 10

0.3994E + 09

0.4847E -I- 09

O.l141E+ 10

0.3286s + 09

0.4191E+09

0.2319E+O9

0.4548E + 09

elucidated by Fildes and Mkridakis (1988) who suggest ways of standardizing the errors to overcome it. The latter treatment may satisfy many readers, but I suggest that a careful re-examina- tion of Table 3 of M32 provides an a%r.stive simpler way of clarifying the situation which exposes the seriously misleading nature of the table.

The table of MSE’s

One of my pet ‘hobby-horses’ is the need to improve the presentation of graphs and tables. This is vital to interpret and communicate results efficiently and yet is often ignored (as in the offending table!) with potentially damaging consequences (as here!). The general requirements for a good table are: (a) give a clear, self-contained title, (b) give row and column averages where appropriate, (c) round the numbers to an appropriate number of significant digits, (d) use a clear layout.

How can we make sense of table l? Table 2 shows the values at horizon one rounded and se-ordered. The table is now much easier to ‘read’. While the Bayesian method is indeed ‘best’, ;rr,other conspicuous feature is the large number of equal rankings which at first sight is very puzzling. For each exponential smoothing method, the corresponding deseasonalized method gives virtually identical results. Yet there is no apparent reason why for example Holt’s exponential smoothing method (Holt EXP), the deseasonalized version (D Halt EXP) and the Holt-Winters method (Winters) should give identicai results except for yearly series (e.g. see Table 12 in M82). These equal rankings do not occur for other criteria when averaged over all series, and could not be expected with seasonal series involved.

Page 3: Apples, oranges and mean square error

Editorial 517

Table 2 Average MSE’s of different forecasting methods for all 1001 series for a forecasting horizon of one period.

Method MSE x 10”

Bayesian F

Quad EXP D Quad EXP >

Brown EXP D Brown EXP >

Automat AEP

Holt EXP D Holt EXP Winters I

Combining A

Regression D Regress >

Combining B

NAIVE 1 NAIVE 2 >

Mov Avrg D Mov Avrg >

2.2

3.2

4.0

4.9

5.1

5.8

8.0

8.5

8.6

9.3

27.0

One obvious explanation is that the values have been given incorrectly. However, I am assured that they have been checked by severai researchers. An alternative explanation, which turns out to be the true one, reveals the inadequacy of the whoie table.

Examination of the individual series analysed in the M-cornpetit * ion reveals that the five series with the largest variability are all yearly (and hence non-seasonal) series. Indeed seven of the eight most variable series are yearly. The yearly series are generally at a higher level than quarterly and mcnthly series perhaps because some of them are obtained by aggregating monthly data. The ranges of the three most variable series are all about 24 x lo6 - an enormous number - vhile other series have a range as low as one. If we simply average MSE across all time series, the series with large variability will ‘swamp’ the series with low variability. The non-yearly series with largest variability is a monthly series whose range is about 19 x 10’. Assuming that the order of magnitude of MSE is roughly proportional to (range)*, then the average MSE produced by the three most variable (yearly) series would be about (24 x 10/19)* = 160 times the MSE appropriate to the most variable seasonal series. In fact there are two yearly series which are particularly irregular and give MSE values which are relatively even larger. It is therefore clear that the results in table 2 are completely dominated by the results for a few yearly series. The table says nothing about the forecasting behaviour in the remaining 990 + series!

It is also worth noting that as well as avei;aging across (very different) time series. the results at

Page 4: Apples, oranges and mean square error

518 Editorial

horizon one also average across very different lead periods ranging from one month (for monthly data) to one year (for yearly data). The results are not averaging like with like in any sense.

There are several other features of note in Table 3 of M82. One unexpected result is that forecasts for long horizons appear to be more accurate than those for short horizons (MSE’s of order lo9 or 10” rather than 10” or 1012). The reason for this can now be readily deduced. Forecasts for yearly data were only produced for up to 6 periods ahead. The values at horizon 8 therefore pertain only to quarterly or monthly series, whit re generally less variable. When the MSE’s at horizon 8 are rounded and re-ordered, it is fou that the ‘equal rankings’ syndrome has disappeared and that Bayesian forecasting is no longer * best’ (but rather is ranked as low as 11 th!). Indeed Bayesian forecasting is not ‘best’ at horizons 12 or 18 or for the subset of 111 series at any horizon. Of course all these results involve averaging lr~~~ many series and are no more meaningful than the results in table 2 but they do at least show t it Bayesian forecasts are not necessarily ‘best’ when yearly series are excluded.

Zellner’s (1986) evaluation of th. results goes further by concentrating on the results which are averaged, not only across all time series, but also across different forecasting horizons. As the MSE’s at horizons 8, 12 and 18 are relatively small, the overall average MSE’s are dominated by the values at horizons 1 to 6 which as before are dominated by a small number of yearly series. Thus, this ‘double’ averaging does not avoid the problem.

Conclusions

The example discussed here re-emphasizes the importance of comparing like with like and demonstrntes how easy it is for this maxim to be overlooked or obscured. If MSE is averaged across different time series or different forecasting horizons, it must be ensured that the quantities are of a comparable order, perhaps by standardizing or scaling in some way, or perhaps by partitioning into homogeneous subgroups. The MSE results in M82 are obtained by averaging across time series with widely different variabilities. The results do not demonstrate that Bayesian forecasting is ‘best’ for all 1001 series but rather that this method happens to give the best forecasts for the 3 or 4 yearly series with the largest variability. In my vie the MSE results should not have been published at all!

Chris Chatfield School of Mathematical Sciences, University of Bath, Bath, UK

References

Armstrong, J.S. and E.J. Lusk, 1983, Commentary on the Makridakis Time Series (M-competition), Journal of Forecasting 2, 259-311.

Fildes. R. and S. Makridakis, 1988, Forecasting and loss functions, International Journal of Forecasting 4, 545-550. Makridakis. S. et al.. 1982, The accuracy of extrapolative (time series) methods: Results of a forecasting competition, Journal

of Forecasting 1, 111-153. Zellner. A, 1986, A tale of forecasting 1001 series: The Bayesian Knight strikes again, International Journal of Forecasting 2.

491- 494.