An examination of the accuracy of judgemental confidence intervals in time series forecasting

15
Journal of Forecasting, Vol. 8, 141-155 (1989) An Examination of the Accuracy of Judgemental Confidence Intervals in Time Series Forecasting MARCUS O’CONNOR and MICHAEL LAWRENCE University of New South Wales, Australia ABSTRACT This paper examines the accuracy (calibration) of judgemental and selected statistical confidence intervals in time series forecasting. Using the forecasts and forecast errors produced by the deseasonalized single exponential smoothing method, three statistical intervals were produced utilizing three assumptions about the distribution of errors: the normal distribution, the empirical distribution and the distribution based on the Chebyshev inequality. Using 33 real-life time series, results indicated that the judgemental confidence intervals were initially excessively over- confident, thus confirming the finding of Lichtenstein et al. (1982). However, calibration of the judgemental intervals improved considerably with feedback, and was found to be influenced by the degree of forecasting difficulty of the series. KEY WORDS Judgement Calibration Confidence intervals INTRODUCTION This paper seeks to examine the empirical validity of judgemental expressions of confidence in the accuracy of forecasts based on time series data. A large number of studies have previously examined the calibration of expressions of confidence (see reviews by Lichtenstein et al. 1977, 1982; Wallsten and Budescu, 1983; O’Connor, 1988). While the studies have been from wide and varied sources, a number of conclusions have emerged: (1) Many of the studies have found that people are overconfident in the accuracy of their predictions. This tendency has been demonstrated for both discrete and continuous distributions. (2) Familiarity of the person with the topic of interest is a major determinant of calibration, i.e. the extent of a person’s knowledge of the behaviour of the forecast variable significantly determines the ‘goodness’ of the expressions of confidence (see also Petersen and Pitz, 1986). (3) Familiarity of the person with the tusk of probabilistic forecasting significantly determines the ‘goodness’ of the expressions of confidence. 0277-6693/89/020141- 15$07.50 0 1989 by John Wiley & Sons, Ltd. Received February 1988 Revised October 1988

Transcript of An examination of the accuracy of judgemental confidence intervals in time series forecasting

Page 1: An examination of the accuracy of judgemental confidence intervals in time series forecasting

Journal of Forecasting, Vol. 8, 141-155 (1989)

An Examination of the Accuracy of Judgemental Confidence Intervals in Time Series Forecasting

MARCUS O’CONNOR and MICHAEL LAWRENCE University of New South Wales, Australia

ABSTRACT

This paper examines the accuracy (calibration) of judgemental and selected statistical confidence intervals in time series forecasting. Using the forecasts and forecast errors produced by the deseasonalized single exponential smoothing method, three statistical intervals were produced utilizing three assumptions about the distribution of errors: the normal distribution, the empirical distribution and the distribution based on the Chebyshev inequality. Using 33 real-life time series, results indicated that the judgemental confidence intervals were initially excessively over- confident, thus confirming the finding of Lichtenstein et al. (1982). However, calibration of the judgemental intervals improved considerably with feedback, and was found to be influenced by the degree of forecasting difficulty of the series.

KEY WORDS Judgement Calibration Confidence intervals

INTRODUCTION

This paper seeks to examine the empirical validity of judgemental expressions of confidence in the accuracy o f forecasts based on time series data. A large number of studies have previously examined the calibration of expressions of confidence (see reviews by Lichtenstein et al. 1977, 1982; Wallsten and Budescu, 1983; O’Connor, 1988). While the studies have been from wide and varied sources, a number of conclusions have emerged:

(1) Many of the studies have found that people are overconfident in the accuracy of their predictions. This tendency has been demonstrated for both discrete and continuous distributions.

(2) Familiarity of the person with the topic of interest is a major determinant of calibration, i.e. the extent of a person’s knowledge of the behaviour of the forecast variable significantly determines the ‘goodness’ of the expressions of confidence (see also Petersen and Pitz, 1986).

(3) Familiarity of the person with the tusk of probabilistic forecasting significantly determines the ‘goodness’ of the expressions of confidence.

0277-6693/89/020141- 15$07.50 0 1989 by John Wiley & Sons, Ltd.

Received February 1988 Revised October 1988

Page 2: An examination of the accuracy of judgemental confidence intervals in time series forecasting

142

(4)

( 5 )

Journal of Forecasting Vof. 8, Iss. No. 2

The difficulty in forecasting has a strong effect on the extent of over/underconfidence. For tasks that are difficult to forecast, overconfidence abounds, whereas for those that are easy to forecast, underconfidence is exhibited. It should be emphasized that it is the difficulty in forecasting the event and not that of establishing the confidence levels that is important. Training at probabilistic forecasting has had mixed results, the most effective being the requirement that the subject also state the reasons for the choices made (see Koriat et al., 1980).

Most of the experimental tasks examined in this literature require the subject to recall relevant information from memory. Thus performance is dependent, in part, on the efficiency of storage and recall from memory. Furthermore, the salience or vividness of recent events (see Tversky and Kahneman, 1974) and the social utility of responses have been found to be important determinants of levels of confidence (see Solomon et al., 1985).

Very little research has examined the ability of people to establish confidence intervals (CIS) in a time series setting. Brown (1973) has provided a graph of selected series and asked subjects to forecast the next period and Stael von Holstein (1971) has conducted a similar study. However, the nature of the series was revealed to subjects. This may involve a recall from memory of non-time series information that implies that confidence estimates are based on both time series and non-time series information. For example, a revelation that a graph represented the Dow Jones index from June 1979 to January 1986 will invoke not only a recollection of the movement of the index for that period but also associated macro-economic information held in memory. Using this approach, one is unsure whether (for example) bad performance is caused by an inadequate appreciation of the noise in the series, an inadequate understanding of the meaning or implications of the macro-economic information, bad recall from memory, and so on. The importance of the amount of information to performance in probabilistic forecasting was examined by Pitz (1974) and Petersen and Pitz (1986). They found that, although overconfidence was still in evidence, it was significantly reduced when more information about the topic of interest was available to the subjects. Furthermore, the accuracy of the forecasts also improved. Thus since the majority of the previous studies have not controlled for the amount of information used in the probabilistic assessments, little can be said about the possible causes of the overconfidence.

This study seeks to investigate the ability of people to establish CIS using only time series data. By not revealing either the nature of the series or the time period to which it relates, it is argued that the only information on which the CIS can be based is that presented on the graph. Using this approach, we are able to examine directly the ability of people to establish CIS from the time series data only. Furthermore, since this is the same information set as that available to the statistical methods, statements can be made about the comparative ability of people to establish CIS and to appreciate noise in a series.

RESEARCH HYPOTHESES

In relation to the literature reviewed above we wish to examine the ability of research subjects to establish CIS from time series data. The hypotheses to be tested in this study are:

HI: The judgemental CIS established around a forecast tend to be too narrow (overconfident).

Page 3: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O'Connor and M. Lawrence Judgemental Confidence Intervals 143

Hz: Where the task of forecasting a given time series is more difficult than for another series, overconfidence will result. Conversely, where the task is comparatively easy, underconfidence will abound. The provision of outcome feedback will improve performance over a number of trials. H3:

EXPERIMENTAL PROCEDURES

Experimental task A database of 33 monthly time series was randomly selected from the M-competition (Makridakis et a/ . , 1982) data. The series used for this study represented a wide cross-section of different characteristics and included macro-economic and micro-economic series, seasonal and non-seasonal series and some demographic series. Each series had at least 3 . 5 years of monthly data.

Subjects were presented with a computer-produced graph of the data points in the time series. This did not reveal to them the nature of the series, not the period of history to which it referred. The subjects for the experiment were required to establish a forecast for the next period as well as the 50% and 75% CIS around the forecast. This process was repeated six times with the actual value corresponding to the last forecast revealed to the subjects at each iteration. Thus seven-point forecasts and CIS were included in the study.

Subjects Subjects for the experiment were 33 postgraduate students undertaking masters and PhD programmes at the University of New South Wales. Each subject forecast a single time series for seven iterations. They participated voluntarily in the study and received no course credit or monetary rewards. The study was held in normal class times in one sitting. Typically it took students 45 min to complete. The students had no previous experience in the forecasting of time series, nor had they any in establishing CIS, although all had completed an introductory course in statistics. '

ESTABLISHING STATISTICAL CIS

The deseasonalized single exponential smoothing (DSE) method of forecasting was selected for comparison against the judgemental method and this technique performed very well for monthly time series in the M-competition. The procedures used to calculate the DSE forecasts and the resultant forecast errors were as follows:

(1) The historical data points for each of the 33 monthly time series were forecast by the DSE technique. The optimal smoothing factor was obtained for each series by a trial- and-error approach.

(2) Using the optimal smoothing factor selected above, each series was reforecasted on a step-by-step basis and the percentage errors noted (for all historical data points other than points 1-6, to permit the errors to settle down). In determining the CIS for the forecast periods the errors for the last 24 periods only were used. It should be noted that, because the study was conducted over seven

(3)

' There was no evidence that statistical training affected performance in the task of setting the CIS.

Page 4: An examination of the accuracy of judgemental confidence intervals in time series forecasting

144 Journal of Forecasting Vol. 8, Iss. No. 2

iterations, the immediate last 24 errors were always used, on a rolling basis. The last 24 were chosen to eliminate the effect of earlier errors that may now be unrepresentative of the current behaviour of the series.

Because the error distribution is unknown it was decided that three analytical approaches would be considered. These represent broad perspectives that can be taken in relation to establishing CIS, and were:

(1) A parametric approach: the distribution of the forecast errors was assumed to be normal.

(2) A non-parametric approach: in this case the distribution of the forecast errors was assumed to be the same as the empirical distribution for the 24 immediate past forecast errors. This approach has been found to be best calibrated in the applied setting of forecasting telephone requirements at Bell Laboratories (Williams and Goodman, 1971). It is based on the somewhat unrealistic assumption that the sample distribution is the population distribution.

(3) A non-parametric conservative approach was taken by using the Chebyshev inequality. This inequality provides a conservative bound such that the true CI will always be smaller than that calculated. The advantage of the Chebyshev approach is that it provides a conservative limit against which other CIS may be compared.

All three approaches are considered in this study, thus generating three DSE-based statistical CIS for each iteration of each series (termed DSE-normal, DSE-empirical and DSE- Chebyshev). While these three approaches do not represent all those possible, it is believed that they contain a wide spectrum of assumptions that can be made regarding the future distribution of forecast errors.

The procedures used in calculating CIS under each of the above approaches are presented in the Appendix.

A COMPARISON OF THE CONFIDENCE INTERVALS

We would like to compare the CIS developed using the various approaches across all 33 series in the database. However, since each series has a different numerical scale, comparisons based on the numerical width of the CIS can only be made within a series.

One solution to this problem of generalizability is to transform each CI by dividing it by the forecast of that period. The descaled CIS then represent a measure of percentage variation about the mean forecast, and, as such are similar to the coefficient of variation.

RESULTS

Accuracy in forecasting In the Lawrence et al. (1985) study we found that, in general, there was no difference in forecasting accuracy between the graphical judgement and DSE techniques. However, it was noted that DSE performed very well in the first few periods. As this project has been confined to one-step-ahead forecasting for seven iterations (unlike the Lawrence et al. study), we would expect DSE to perform well in comparison to the judgemental technique. The average MAPEs for each period (and their standard deviations) are contained in Table I . These results indicate

Page 5: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O’Connor and M. Lawrence Judgemental Conjdence Intervals 145

Table 1. MAPEs (and standard deviation) for judgemental and DSE techniques over seven periods

Technique

Period Judgement DSE

1 2 3 4 5 6 7

Average

0.130 (0.129) 0.130 (0.146) 0.080 (0.081) 0.067 (0.08 1) 0.132 (0.141) 0.110 (0.118) 0.152 (0.202) 0.114 (0.135)

0.115 (0.160) 0.123 (0.176)

0.075 (0.083) 0.126 (0.191) 0.113 (0.183) 0.127 (0.163) 0.111 (0.160)

0.100 (0.145)

that there are very few differences in MAPEs. In four out of seven cases DSE errors were lower than the judgemental errors, but paired t-tests revealed that there was no significant differences in errors.

An interesting observation concerns the comparison of the standard deviations of the MAPEs. In six out of seven periods the standard deviation of the MAPEs for judgement is lower than for DSE. This result was also observed in Lawrence et al. (1985) and its replication in this study is worth emphasizing. The lower standard deviations imply that the judgemental technique avoids very low and very high errors. In a practical business setting it would be preferable to choose a technique whose MAPEs were not significantly different from others, but where it could be guaranteed to avoid large errors. I t seems from this (as well a s the previous study) that the judgemental technique may well provide the facility, and as such would be very useful to management. However, such statements depend on the shape of the user’s loss function. Furthermore, this function also determines whether MAPE is the correct metric to use for measuring forecasting error.

Comparison of judgemental and statistical CIS The judgemental CIS were descaled as described above, and Table I1 presents the means and standard deviations for descaled CIS (across series) for the judgemental and DSE based

Table 11. Mean (and standard deviation) descaled CIS for each period across series

Judgemental DSE-normal DSE-empirical DSE-Chebyshev Mean Std dev. Mean Std dev. Mean Std dev. Mean Std dev.

I 2 3 4 5 6 7

All

0.090 0.096 0.111 0.115 0.113 0.113 0.135 0.110

0.069 0.076 0.088 0.110 0.116 0.104 0.143 0.103

0.125 0.130 0.140 0.138 0.136 0.139 0.167 0.139

0.103 0.115 0.126 0.119 0.110 0.133 0.162 0.124

0.103 0.107 0.113 0.114 0.113 0.120 0.132 0.115

0.088 0.102 0.111 0.117 0.108 0.133 0.136 0.113

0.263 0.273 0.293 0.290 0.285 0.292 0.350 0.292

0.216 0.242 0.264 0.249 0.23 1 0.27 1 0.341 0.260

Page 6: An examination of the accuracy of judgemental confidence intervals in time series forecasting

146 Journal of Forecasting Vol. 8, Iss. No. 2

20 -

a, Cn 6 1 5 - r 0

Q, Cn

c Q,

0 1 0 - Y

? : 5 -

0

methods. Several comments on this table are necessary:

(1) A consideration of the mean descaled CIS for a given technique revealed that there is a general increase over time:

(a) Judgemental-50% increase; (b) DSE-normal and DSE-Chebyshev-33.6% increase; (c) DSE-empirical-29.4% increase.

The increased displayed by the DSE based methods indicate that, for the 33 series in the database, the forecast errors are increasing. This may be due to either increasing noise in the series or the inappropriateness of the smoothing factor. The increase in the judgemental CIS is much larger. However, most of this increase occurred in the first few periods. Increases in CIS over the first four periods were:

(d) Judgemental-28% (e) DSE-normal and DSE-Chebyshev--10.4% (f ) DSE-empirical- 10.7%

This movement over the seven periods can be seen in Figure 1. The figure shows there is a considerable amount of adjustment made by people in establishing their CIS for the first few periods. Of course, this effect may be overcome with training sessions prior to the task. However, in an applied business setting it is likely that there is limiting opportunity for pre-task training.

(2) Analysis of the standard deviations of the descaled CIS also reveals considerable adjustments over time. Since the descaled CIS represent a measure of the relative

-- -

25 1 + Judgemental

+ DSE-normal

_ _ - -

-5 I 6-7 1-2 2-3 3-4 4- 5 5-6

Changes between periods

Figure 1. Percentage change in average CIS

Page 7: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O’Connor and M. Lawrence Judgemental Confidence Intervals 147

variability about the mean forecast the standard deviation of the CIS in a particular period indicates the extent of variation in the CIS that have been assigned to the 33 different series. Some series have a low noise element, and so demand narrow CIS: others have a high noise element, requiring wider CIS. The standard deviation of the descaled CIS then gives an indication of how variable the CIS were in response to the different levels of noise in the different series. An examination of the standard deviation of the descaled CIS in Table I1 for the different methods indicates that the judgemental method has a considerably lower standard deviation in the first few periods than for the DSE-based methods. This suggests that people may be initially insensitive to different levels of noise in different series. However, Table I1 and Figure 2 indicate that the stand- ard deviation of the CIS for the judgemental method increased markedly over the seven iterations, i.e.:

(a) Judgemental- 107.2% (b) DSE-Normal and DSE-Chebyshev-57.3%; (c) DSE-Empirical-54.5%.

The movements in standard deviations over time can be seen in Figure 2. This figure reveals that the standard deviations of the CIS for the DSE-based methods are changing in the same direction and (apart from period 6-7) to a similar extent (as expected, since they are based on the same error data). However, the important point illustrated by Figure 2 is that the standard deviations of judgemental CIS are often changing in directions opposite to that of DSE methods (in four out of the six changes the standard deviations of judgemental CIS have moved in directions opposite to DSE CIS). This is especially apparent in the first three changes. There is some evidence that people have

30 401 --t Judgemental

-e- DSE-normal

+ DSE-empirical

-10 ! U I I 1 I I

1-2 2-3 3-4 4-5 5-6 6-7

C h a n g e s between per iods

Percentage change in the standard deviation of the CIS Figure 2.

Page 8: An examination of the accuracy of judgemental confidence intervals in time series forecasting

148 Journal of Forecasting Vol. 8, Iss. No. 2

Cornparison against DSE-normal

40 1

-4 2 - 3 1 - 2 1-2 2-3 .5-4 1-5 5-b >6

Comparison against DSE-empirical

4 0 1

>6 5-6 4-5 3-4 2-3 1-2 1-2 2-3 3-4 4-5 5-6 -6

J wider thon DSE-Empirical J noirower than DSE-Empp.coi

Comparison ogainst D S E - C h e b y s h e v

4 0 1

30

>.

c m 3 CT 20

: LL

10

0 >6 5-6 4-5 3-4 2-3 1-2 1-2 2-3 5-4 4-5 5-6 >6

J wider than DSE-Chebyshev J narrower than DSE-Chebysheb

Figure 3 . Histograms of the magnitudes of differences between judgemental and DSE-based CIS

Page 9: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O’Connor and M. Lawrence Judgemental Conjidence Intervals 149

inadequate conceptions of variability (see Hogarth, 1975), and this could be a reason why the range of CIS is increasing over the periods. In early periods it is possible that, due to misconceptions about noise, subjects have been very insensitive to different levels of noise in different series, thus ascribing CIS that do not vary greatly for series that may differ considerably in their noise component. This tendency may reverse itself as feedback is received and an understanding of the noise element is attained.

(3) An examination of Table I1 suggests that, in the aggregate, judgemental CIS tended to be narrower than the CIS of the DSE-based methods. However, paired t-tests on the descaled CIS between the judgemental, DSE-normal and DSE-empirical methods revealed no significant differences. These results are also illustrated by the histograms in Figure 3, which tabulate the magnitudes of the differences of the judgemental CIS from the DSE-based CIS. These results indicate that there are almost as many judgemental CIS that are wider as narrower than the DSE-normal CIS, with a slight tendency for them to be narrower. On the other hand, there is a slight tendency for judgemental CIS to be wider than the DSE-empirical CIS. Although the majority of the differences were between one to two times, the fact that there were cases where the differences were of the order of six magnitudes has significant implications for practical business forecasting. If this result represents a consistent response of people in the judgemental CI estimation, the usefulness of these CIS may be questionable-especially in the budgeting environment. If we can assume that the DSE-empirical and DSE- normal methods represent rational and consistent approaches to CI estimation, the extent of the differences would tend to cast doubt on the degree to which we can rely on those generated judgementally. This doubt is further supported by the fact that about 15% of the judgemental CIS was wider than the DSE-Chebyshev CIS. Since these CIS represent the conservative bound, it is disturbing that such a relatively large percentage were wider than this conservative limit.

In summary, we have found that, in comparison with the DSE-based methods, judgemental CIS tend to be narrower (especially for the early periods). However, they increased over the course of the experiment. The standard deviation of the CIS also indicated that people were insensitive to different levels of noise in the series; but this also tended to increase over the first few trials. Finally, the tabulation of the differences between the judgemental and DSE-based methods suggested that, since the differences in many cases were large, the usefulness of the CIS in a business setting may be questionable (assuming that the DSE methods represented realistic and practical methods for constructing CIS).

Calibration of judgemental and DSE-based CIS Calibration of probabilistic assessments for a continuous variable refers to the extent of corresondence between the assessed frequency of occurrence and the actual frequency for any given interval. Thus if a CI has been created for the condition that 50% of the future actual values will fall within this interval the intervals so created are said to be perfectly calibrated if 50% of the actual values in the future occur within this interval.

Table 111 presents the calibration of judgemental and statistical approaches, and indicates the number (and percentage) of times the actual value x fell within the 50% interval, within the 75% interval and outside the 75% interval.

*While these results may seem to conflict with the results in Table I1 it should be noted that there are a number of cases where judgemental CIS were more than six times the size of the DSE-empirical CIS. This tends to inflate the mean descaled Cls presented in Table 11.

Page 10: An examination of the accuracy of judgemental confidence intervals in time series forecasting

150 Journal of Forecasting Vol. 8, Iss. No. 2

Table 111. Percentage calibration of judgemental and DSE-based methods

Within 50% Within 75% Outside 75% interval interval interval

Judgemental 37.2 62.3 37.7 DSE-empirical 40.7 63.6 36.4 DSE-normal 44.6 67.5 32.5 DSE-Cheby shev 75.8 86.1 13.9 Ideal 50 74.9 25.1

This table reveals that the judgemental, DSE-normal and DSE-empirical intervals were overconfident (in that less than 50% of the actual values have occurred within the middle interval), but the limits based on the Chebyshev inequality were too wide (i.e. underconfident). DSE-normal and DSE-empirical were better calibrated than the judgemental method; however, the DSE-Chebyshev limits were possibly worse than the judgemental limits, but in the opposite direction. Of course, the underconfidence of the DSE-Chebyshev CIS were to be expected, as they represent conservative bounds. DSE-normal was perhaps best calibrated-although still overconfident. Apart from comparisons against DSE-Chebyshev, all paired t-tests on calibration revealed no significant differences. Similar results were obtained on scores obtained using the Continuous Ranked Probability Score (see Stael von Holstein, 1977).

To determine whether performance improved over the trials, calibration in the earlier periods was compared with that of the later periods, and Table IV presents this comparison. This table reveals that there is a definite improvement in calibration for the judgemental methods in the later period-especially for the 75% intervals. I t seems that the upper limits of the CIS have been moved outwards to improve calibration for the upper intervals. While this has also occurred for the 50% intervals, it has been to a lesser extent.

Thus we can conclude that there has been some improvement in calibration in response to outcome feedback, and these results point to the acceptance of H,-that calibration improved

Table IV. Percentage calibration for different methods over different horizons

Within 50% Within 75% Outside 75% interval interval interval

Judgemental Periods 1-3 30.3 56.6 43.4 Periods 5-7 40.4 80.9 19.1

Periods 1-3 45.5 63.6 36.4 Periods 5-7 42.4 63.6 36.4

Periods 1-3 37.4 62.6 37.4 Periods 5-7 40.4 60.6 39.4

Periods 1-3 74 85 15 Periods 5-7 76 84 16

DSE-normal

DSE-empirical

DSE-Cheby shev

Page 11: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O’Connor and M. Lawrence Judgemental Confrdence Intervals 151

Table V. Comparison of Calibration over time by Subject Groups

Calibration in period 5-7

Group A Group B Total

Group A 15 4 19 Calibration in

periods 1-3 Group B 9 5 14 Total 24 9 33

over the seven trials of the experiment. This seems to lead to a questioning of the generality of the conclusions of the previous literature that people do not learn from outcome feedback.

However, have the subjects who performed best in the initial periods also done so in the later periods? To test this proposition, the calibration score (defined as the number of times the actual value fell within the CIS) was computed for each subject for periods 1-3 and 5-7. Subjects were classified into two groups on the basis of their calibration score in periods 1-3: those who performed best, group A (with either one or two values falling in the CI) and those who performed worst, group B (with no or three actual values falling inside the CIS). The performance of each group was then compared with the performance of the same group in periods 5-7, and Table V presents the results.

This table shows that for period 1-3, 19 of the 33 subjects were well calibrated (members of group A). For period 5-7, 15 of these 19 were also well calibrated. This suggests some consistency in performance over time. On the other hand, for those subjects who were badly calibrated in period 1-3, nine had become better calibrated in the later periods. Thus there seems to be a general tendency for subjects to improve their calibration (see also Table IV). On the basis of this, we are unable to conclude that subjects were consistent over time in their calibration.

The influence of forecasting difhculty on CI widths As mentioned earlier, one of the important conclusions from the literature on calibration of judgemental probabilistic assessments was that task difficulty in forecasring has an important effect on the level of confidence. Previous studies (especially Lichtenstein and Fischhoff, 1978) have found that where the task is easy to forecast, underconfidence abounds; conversely, where forecasting is difficult, overconfidence is prevalent. ’

There can be two approaches to the problem of obtaining a metric of task difficulty in time series forecasting. On the one hand, we can examine the forecasting errors of the current experiment and stratify the series into those with high errors and those with low ones. The

’ I t should be emphasized that the time series forecasting task described in this paper is substantially different from those reviewed in Lichtenstein er al. (1982). In the time series task the nature of the variable is not revealed and the full data on which estimation can be based is available to the judge. In the experiments reviewed in Lichtenstein er a/., subjects were aware of the nature of the variable and were often required to recall from memory the relevant - information about it.

Page 12: An examination of the accuracy of judgemental confidence intervals in time series forecasting

152 Journal o j Forecasting Vol. 8, Iss. No. 2

stratification is then done on an expost basis. On the other hand, we can examine the results of previous experiments using the same technique and use these as the basis for stratification. Thus the stratification is done on an ex ante basis. The problem of the former approach is that the inherent variability of judgemental approaches directly influences the measure used for the level of difficulty. This problem is somewhat overcome if the latter approach is adopted-where the errors from several experiments with different people are considered. The averaging of the errors from different people will tend to eliminate some of this bias created by the variability of errors. Thus the latter approach has been adopted.

In determining the measures for forecasting difficulty, the experimental data referred to in Lawrence et a/. ( I 985) were re-analysed. In this study, forecasts using the graphical technique were produced by two groups of subjects. The errors for both forecasts were then combined to produce an average error for graphical forecasts. The 33 monthly series were then stratified into three groups based on the average error:

( I ) Those eight-series with the highest average error (i.e. high-difficulty group); (2) Those seven series with the lowest average error (i.e. low-difficulty group); (3) The remaining 18 series that fell in between (medium-difficulty g r ~ u p ) . ~

Table VI presents calibration for the three difficulty groups. This table reveals that the interquartile index for the judgemental method for series of both high and medium difficulty is below 50, thus indicating substantial overconfidence. However, for series with low difficulty in forecasting, the interquartile index is 57.1, indicating some degree of underconfidence. On the basis of this, there is some indication that difficulty in forecasting may have some effect on the widths of the CIS.

Group t-tests on the calibration scores reveal that those for the low-difficulty group were significantly different from both the medium ( p < 0.03) and the high (p < 0.04) groups. Similar results were obtained by examining the ratios of the judgemental CIS to the DSE-

Table VI. Percentage calibration of methods for different difficulty groups

Within 50% Within 75% Outside 75% interval interval interval

Low difficulty Judgemental DSE-empirical DSE-normal DSE-Chebyshev

High difficulty: Judgemental DSE-empirical DSE-normal DSE-Chebyshev

Medium difficulty: Judgemental DSE-empirical DSE-normal DSE-Chebyshev

57 .1 44.9 46.9 69.4

32.1 33.9 33.9 71.4

31.7 42.1 48.4 80.2

81.6 65.3 67.3 79.6

58.9 57.1 60.7 85.7

56.3 65.9 68.3 88.9

18.4 34.7 32.7 20.4

41.1 42.9 39.3 14.3

43.7 34.1 31.7 1 1 . 1

For a more complete description of this methodology, see Lawrence ef al. (1986)

Page 13: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M . O’Connor and M, Lawrence Judgemental Conjidence Intervals 153

normal and DSE-empirical CIS and examining the histograms of the magnitudes of the differences between the judgemental and DSE-based methods. For the low-difficulty group, the frequency histograms reveal that the judgemental CIS were wider than the DSE-normal CIS on 76% of occasions, 80% for the DSE-empirical and 70% for the DSE-Chebyshev. All these results indicate that the calibration of the judgemental CIS for the low-difficulty group was significantly different from that of the other groups. However, as this result did not occur for the high group, only limited support can be given to H2-that task difficulty affects calibration.

CONCLUSIONS

This study has demonstrated that, in comparison with DSE-normal and DSE-empirical CIS, judgemental intervals tended to be too narrow initially, but significant widening of the intervals occurred within the first few periods. However, calibration of the judgemental intervals over all trials were not significantly different from DSE-normal and DSE-empirical. There was also some indication that people were not able to discriminate between different levels of noise in the specification of their CIS. Furthermore, the difficulty of forecasting the series was found to affect calibration, but only for those series that were easy to forecast.

APPENDIX: PROCEDURES FOR CALCULATING STATISTICAL CIS

The procedures used to calculate each of the DSE-based CIS are described below.

CIS based on the empirical distribution The empirical distribution is a probability one built up from past observations (in this case forecast errors). The xolo CIS are obtained by establishing cut-off values such that x% of the past errors occurred within the interval. Thus the 50% CI for 24 historical errors is obtained by determining the upper and lower values such that 12 (or 50%) of the past errors occurred within the interval. Similarly, the 75% CI is obtained by determining the upper and lower limits such that 18 of the 24 past errors have occurred within the CI. From the viewpoint of this study, the advantage of the empirical distribution is that it is (by definition) perfectly calibrated for past errors. Of course, there is no guarantee that there will be perfect calibration for the forecast period. If, however, the series continues in the same pattern as before, excellent calibration is likely to continue. It should be noted that, by using the past 24 observations, the 75% intervals are based on only the upper and lower three observations. If these three observations are unrepresentative of the population, the 75% intervals are likely to be less accurate than the 50% intervals. It also should be emphasized that, by basing the CIS on the empirical distribution of the past 24 periods only, we are assuming that the sample of errors for the 24 prior periods represents the entire population of errors. This assumption was taken because it is possible (indeed, highly probable) that the time series changes shape over the time period considered. I t is also possible that the level of noise may change over time. For these reasons, it seems imperative that an analysis of past errors for the calculation of CIS be based on the most current level of noise in the series.’

’To investigate whether the limitation of the sample to 24 past errors made a difference to the limits the errors for the whole series were calculated and used to generate the CIS. These new CIS were then compared with the previous CIS in a paired f-test. This test revealed no differences.

Page 14: An examination of the accuracy of judgemental confidence intervals in time series forecasting

154 Journal of Forecasting Vol. 8, Iss. No. 2

CIS based on the normal distribution This approach assumes that the errors are distributed according to the normal distribution6 and that the CIS are accordingly calculated using the 1-distribution’ (with 23 degrees of freedom).

CIS based on the Chebyshev inequality The previous procedures have made various assumptions about the shape of the distribution of forecast errors. The Chebyshev inequality represents a procedure for determining CIS that makes no assumptions about the distribution. In that respect it can be classed as a non- parametric method of determining CIS.

This inequality can also be stated as:

i.e. the probability that a standardized random variable has absolute magnitude less than or equal to k is always greater than or equal to 1 - ( l / k2 ) . This means, for example, the probability that an event having a standardized value of three or less units is always greater than or equal to [ 1 - (1/9)], or 0.89.

For our task of establishing CIS in a time series setting the Chebyshev inequality specifies the following CIS:

50% interval F ? ,20 75% interval F ? 2u

Thus, for example, the 50% CI implies that there is at least a 50% chance that the actual value will fall within the specified limits. The importance of this procedure is that these limits do not make any assumptions about the shape of the distribution of forecast errors.

It should be noted that there is a constant relationship between the DSE-Chebyshev CIS and the DSE-normal ones. On the basis of the formulae specified above, the DSE-Chebyshev CIS are always 2.064 times those of the DSE-normal ones. This again reflects the conservatism of the Chebyshev inequality.

REFERENCES

Brown, T. A., ‘An experiment in probabilistic forecasting’ (1973), as referenced in Lichtenstein et al

Hogarth, R. M., ‘Cognitive processes and the assessment of subjective probability distributions’, Journal

Koriat, A., Lichtenstein, S. and Fischhoff, B., ‘Reasons for confidence’, Journal of Experirnenfal

Lawrence, M. J . , Edmundson, R. H. and O’Connor, M. J., ‘An examination of judgemental time series

Lawrence, M. J . , Edmundson, R. H . and O’Connor, M. J . , ‘The accuracy of combining judgemental

(1982).

of the American Statistical Association, 70 (June 1975), 271-94.

Psychology: Learning and Memory, 6 (March 1980), 107-18.

extrapolation’, International Journal of Forecasting, 1 (1989, 25-36.

and statistical forecasts’, Management Science, December (1986).

‘To test for this condition, the distribution of DSE errors for nine selected time series were analysed using the non- parametric Kolmogorov-Smirnoff two-sample test for normality. The forecast errors of all nine series passed this test for normality. ’Since there are oniy 24 observations used in calculating the CI, the t-statistic should be used rather than the ;-statistic of the normal distribution (see Winkler and Hays, 1975).

Page 15: An examination of the accuracy of judgemental confidence intervals in time series forecasting

M. O’Connor and M . Lawrence Judgemental Conjdence Innfervals 155

Lichtenstein, S. and Fischhoff, B., ‘How well do probability experts assess probability?’ Unpublished study reported in Lichtenstein et a1 (1982).

Lichtenstein, S., Fischhoff, B and Phillips, L . , ‘Calibration of probabilities: the state of the art’, in Jungerman and de Zeeuw (eds), Decision Making and Change in Human Affairs, Dordrecht: Reidel (1977).

Lichtenstein, S., Fischhoff, B. and Phillips, L., ‘Calibration of probabilities: the state of the art to 1980’ in Kahneman, Slovic and Tversky, Judgeiizent Under Uncertainty: Heuristics and Biases, Cambridge: Cambridge University Press (1982).

Makridakis, S., Andersen, A., Carbone, A., Fildes, R., Hibon, M . , Lewandowski, R., Newton, J., Parzen, E. and Winkler, R., ‘The accuracy of extrapolation (time series) methods: results of a forecasting competition’, Journal of Forecasting, 1 (1982), 11 1-53.

O’Connor, M. J., ‘Models of behaviour and confidence in judgement: a review’, Accepted for Publication in International Journal of Forecasting (1988).

Petersen, D. K. and Pitz, G. F., ‘Effects of amount of information on predictions of uncertain quantities’, Acta Psyehologica (1986).

Pitz, G. F., ‘Subjective probability distributions for imperfectly known quantities’, in Gregg, L. W. (ed), Knowledge and Cognition, New York: Wiley (1974).

Solomon, I., Ariyo, A. and Tomassini, L. A., ‘Contextual effects on the calibration of probabilistic judgements’, Journal of Applied Psychology, 70 (1989, 528-32.

Stael von Holstein, C. S., ‘The effect of learning on the assessment of subjective probability distributions’, Organisational Behavior and Human Performance, 6 (1971), 304-15.

Stael von Holstein, C. S., ‘Probabilistic forecasting: an experiment related to the stock market’, Organisational Behavior and Human Performance, 8 (1972), 139-58.

Stael von Holstein, C . S., ‘The continuous ranked probability score in practice’, In Jungermann, H. and de Zeeuw, G., Decision Making and Change in Human Aflairs, Dordrecht: Reidel (1977).

Tversky, A. and Kahneman, D., ‘Judgement under uncertainty: heuristics and biases’, Science, 185

Wallsten, T. S. and Budescu, D. V . , ‘Encoding subjective probabilities: a psychological and psychometric review’, Management Science, 29 (February 1983), 151-73.

Williams, W. H. and Goodman, M. L., ‘A simple method for the construction of empirical confidence limits for economic forecasts’, Journal of American Stafistical Association, 66 (December 1971),

Winkler, R. L. and Hays, W. L., Statistics: Probability, Inference and Decision, New York: Holt,

(1974), 1124-31.

752-4.

Rinehart and Winston (1975).

Authors’ biographies: Marcus O’Connor (PhD, University of New South Wales) is a Senior Lecturer in Information Systems. He has completed undergraduate and postgraduate degrees at the University of New South Wales. His research interests centre c n examining the efficacy of judgemental techniques in forecasting from time series data, the contribution of non-time series data to forecasting accuracy, the processes by which forecasting is undertaken in industry and the development of decision support systems to aid judgemental forecasters. He is also interested in the application of decision aids to organizational decision making. Articles written by him have appeared in Management Science, Journal of Forecasting and the International Journal of Forecasting.

Michael Lawrence teaches in the School of Information Systems at the University of New South Wales. He holds a PhD from the University of California, Berkeley, in Operations Research. His current research interests are in decision support systems, especially forecasting decision support systems focusing on the role of human judgement and how best to aid it. Articles published by him have appeared in Management Science, Journal of Forecasting, International Journal of Forecasting, Organisational Behaviour and Human Decision Processes and Journal of Systems and Software.