USING RECURSIVE RESIDUALS, CALCULATED ON … · and Leroy (1987), and Chatterjee and Hadi (1988 ......
Transcript of USING RECURSIVE RESIDUALS, CALCULATED ON … · and Leroy (1987), and Chatterjee and Hadi (1988 ......
....
.. ;~
•
-.:
USING RECURSIVE RESIDUALS, CALCULATED ON ADAPTIVELY-ORDERED
OBSERVATIONS, TO IDENTIFY OUTLIERS IN LINEAR REGRESSION
Farid Kianifard and William H. Swallow
Institute of Statistics Mimeo Series No. 19051{
September 1987
Department of St~tistics library
USING RECURSIVE RESIDUALS, CALCULATED ON ADAPTIVELy-oRDERBD
OBSERVATIONS, TO IDENTIFY OUTLIBRS IN LINEAR REGRBSSION
Farid Kianifard and William H. Swallow
Department of StatisticsNorth Carolina State University
Raleigh, NC 27695-8203
SUMMARY
A new procedure for identifying outliers or influential
observations is proposed. The procedure uses recursive residuals,
calculated on observations which have been ordered according to
their studentized residuals, values of Cook's D, or another regression
diagnostic of the user's choice. These recursive residuals,
appropriately standardized, have approximate Student's t
distributions. Thus, convenient critical values are available for
deciding which observations merit scrutiny and, perhaps, special
treatment. The power of the test procedure to identify one or more
outliers is investigated through simulation. Power is generally high,
but depends on the number and configuration of the outliers, that is,
their placement with respect to the main body of the data. The use
of adaptive ordering increases power and helps to combat the
masking of one outlier by another when multiple ouiliers are present.
Key Words: Cook's D; Influential observations; Regression diagnostics;
Studentized residual
-
L Introduction
Data from many fields are commonly analyzed using linear
regression. These data sets often contain outliers or influential
observations, and it is important that such observations be identified
in the course of a thorough statistical analysis. In some cases,
observations which fall outside the pattern seen in the bulk of the
data, and which are thus known as outliers, are important and of
interest in their own right. Indeed, they may be the most
interesting and important observations in the entire data set, and
their identification a matter of high priority. Identification of such
an outlier may direct future research effort to collecting additional
data in the region (treatment combination) where the interesting
outlier was observed.
Furthermore, it is always important to identify aberrant
observations, either valid outliers or erroneous data points, with an
eye to removing them from the data set or at least down-weighting
them in the analysis of the rest of the data. Clearly, erroneous data
should be corrected, if possible, or deleted. But even valid outliers
often should be set aside lest they have undue impact or influence
on the analysis, seriously distorting conclusions about relationships
between variables in the main body of the data. Such data points
are often called influential observations. Of course, removing
observations from data sets should not be undertaken lightly;
objective methods are required for identifying candidates for deletion
or other special treatment.
1
-
The study of outliers and influential observations in linear models
has attracted considerable interest in the past decade. A number of
books have been published devoted largely or exclusively to this
subject: Belsley, Kuh, and Welsch (1980), Hawkins (1980), Cook and
Weisberg (1982), Barnett and Lewis (1984), Atkinson (1985), Rousseeuw
and Leroy (1987), and Chatterjee and Hadi (1988). A recent review
article by Chatterjee and Hadi (1986) summarizes many of the
well-known outlier-identification statistics and their interrelationships.
We consider here the usual linear regression model:
-
Y = XR + e ,,.", ,.",~ ,..., (1.1)
where ! = (y " ...,yn)' is an n x 1 vector of values of the response
variable, ~ = (~""',~p)' is a p x 1 vector of unknown parameters,
X =(x;,...,x~)' is an n x p matrix of explanatory variables with,.. ,.. ,..
rank(X) = p, and e = (eu... ,8 n)' is an n x 1 vector of independent,.. ,..
normal random variables with mean 0 and (unknown) variance er2 •
For ~ =qr~)-l:r!, the ordinary least squares (OLS) estimator of ~,
the vector of OLS residuals is given by
..~ =! - !t!.
= (! - !D!
where H =(hi j)
er 2 is then
=X(X'X)-lX'.Ill' ,.,., ,.,
The residual mean square estimate of
2
The OLS residuals e t are correlated and their variances differ as
is evident from their variance-covariance matrix:
2Yar(e) = a (I - H) •... ... ... (1.2)
As equation (1.2) implies, the distribution of the OLS residuals even
depends, through ~, on the particular design matrix! being
considered.
A scaled version of the e t can be defined as -%e . =e./s(l - h .. ) ,
S1 1 11i = l, •.• ,n • (1.3)
The e.t are usually called the studentized residuals (STUDENT in
SAS), the name we will use. Other authors have called them the
standardized residuals or internally studentized residuals. Another·
scaled version of the et, advocated by Belsley et ale (1980) and
others, is often called the jackknifed or externally studentized
residual (RSTUDENT in SAS), and is defined as
%t. = e./s(.)(l - h .. ) ,1 1 1 11
i = l, ••• ,n , (1.4)
where S2 ( t) is the residual mean square estimate of a 2 obtained with
the i th observation omitted. Atkinson (1981) called tt the
"cross-validatory" residual and, noting that
tt = e.d(n-p-l)/(n-p-e~t)}, pointed out that the tt are a monotone
transformation of the e. t. While the above scaled versions of the
residuals e t have approximately unit variance, they are still
correlated.
A common practice is to plot the least squares residuals or e.t
or tt against variables such as t t or one or more of the explanatory
3
variables, or in serial order, to detect outliers. These plots sutfer
from the fact that the impact of an outlier is not confined to inflating
only its own e1' e.1 or t 1; it may inflate or deflate the e1' e s 1' or t 1
of other observations too, perhaps making itself more or less
conspicuous in the process. In simple linear regression, for example,
an outlier that is influential in determining the slope of the fitted
line draws the line toward itself, tending to inflate residuals
associated with other observations, while giving itself a smaller
residual than one might expect. Outliers also inflate the s( 1) or s
used to scale the t 1 or e s 1' respectively. When a single outlier is
present, s( 1) will be unaffected for the outlier, but inflated for all
other observations; the t 1 for nonoutliers will then shrink, leaving
the outlier more exposed. When multiple outliers are present, concern
about masking (outliers hiding each other) and swamping (making
nonoutliers appear to be outliers) is much greater. Multiple outliers
can reinforce or cancel each others' influence, presenting the data
analyst with a difficult and potentially confusing outlier-identification
problem. Concerns about masking and swamping are. by no means
specific to simple graphical procedures; attempts to use the e1' e.1'
or t 1 nongraphically are affected too.
Detecting outliers which are influential is of particular interest.
Of course, different observations may be influential in different
calculations; the estimation of the parameter vector fJ is generally aN
calculation of prime interest. Cook's (1977) D is a well-known measure
of influence of the i1:h observation on e. It is defined as
4
-
(1.5)
where ~ ( ,) is the ordinary least squares estimate of ~ obtained after
deleting the ii:h observation. Chatterjee and Hadi (1986) give some
alternative measures of influence, preferred by some data analysts,
but closely akin to Cook's D..-
Measures of the influence of the ii:h observation on ethrough
Var(~) can be based on the change in the volume of confidence
ellipsoids when the i'th observation is removed. One such measure,
introduced by Belsley et al. (1980), is
-
(1.6)
COVRATIO t is a ratio of the estimated generalized variances (see, e.g.,
Theil, 1971, p. 124) of the regression coefficients with and without the
ii:h observation deleted from the data, and, therefore, it can be
interpreted as a measure of the effect of the i th observation on the
efficiency of estimating fJ. A value of COVRATIO, greater than one...indicates that deleting the ii:h observation impairs efficiency, whereas
a value less than one indicates increased efficiency of estimation.
Atkinson (1981) has suggested using half-normal plots of t t or of
a modified version of Cook's D with simulated envelopes. This
approach seems quite effective in identifying outliers and influential
observations, but poses a substantial computational burden.
Obtaining the envelopes (bounds) requires simulating and analyzing
some 20 or more samples, a serious drawback for application to large
5
samples or routine screening. For a more complete discussion of this
approach, see Atkinson (1985).
Packaged programs [e.g., BMDP(09R), Minitab, SAS(PROC REG),
SPSS(New Regression)] nowadays provide a selection of the above
and/or other regression diagnostics and measures of influence.
Multiple-case diagnostics to identify groups (subsets) of outlying
and/or influential observations also exist, including
algebraically-straightforward generalizations of single-case diagnostics
like Cook's D of (L5). These too are likely to require a formidable
computational effort for even moderate-sized data sets and have not
sparked much interest among practitioners; they have not been
implemented in packaged programs. In practice, data analysts
generally rely on single-case diagnostics, despite their vulnerability
to masking and swamping when multiple outliers are present.
In this article, we develop a procedure based on recursive
residuals (Brown, Durbin, and Evans, 1975) for identifying (one or
more) outliers or influential observations in a linear regression
analysis. The procedure is described in Section 3, and its properties
investigated in a simulation study described in Section 4. An example
of application to a well-known data set is given in Section 5.
Although by no means do we advocate automatic deletion of
observations by a packaged program using this or any other
diagnostic or procedure, this procedure is suitable for routine
screening to identify points that deserve scrutiny and perhaps
special treatment.
6
-
2. Recursive Residuals
Consider the regression model (1.1) with independent
identically-distributed (iid) normal errors~. Let !J-l denote the
(j-l) x p matrix consisting of the first j-l rows (observations) of X.N
Provided (j-l) L p and assuming (!j-l!J-l) to be nonsingular, ecan
be estimated by
(2.1)
-where !J-l denotes the subvector consisting of the first j-l elements
of 'Y. Using eJ-u one can "forecast" YJ to be ~jeJ-1" The forecast
error is the difference (y J - !j~J-t>, and the variance of this
forecast error is cr2[l + ~j(!J-l!J-d-l~J]. The recursive residuals
are defined as
j =p+1, ••• ,D • (2.2)
Brown et al. (1975, Lemma 1) show that, under the model, and
assuming the inverses (!j-l !J-l )-1 exist for all j-l L p, Wp+l,...,Wn
are independent N(O,cr 2 ). This will be true for recursive residuals
calculated on randomly-ordered observations, or on observations
which have been ordered by any variable which is statistically
independent of the wJ [e.g., values of an x variable or of
!J = !~J-l for any of the ~J-l of (2.1)]; the values of the recursive
residuals will depend on the order in which they are calculated, but
their distribution will not. Recursive residuals cannot be calculated
for the first p observations. Hedayat and Robson (1970) defined the
7
recursive residuals in an alternative form and called them "stepwise
residuals"•
The w j could be calculated using the conventional least squares
formula repeatedly to compute each evector in the sequence
~p,.",~n-l. However, the computations are made much more efficient
using the following updating formulae (Plackett, 1950; Phillips and
Harvey, 1974; Brown et al., 1975): -(2.3)
2S.=S·l+ w .
J J- J
(2.4)
(2.5) .
where 5 J = (!j-!jej)'(!j-Xjej).
BLU5 residuals (see Theil, 1971, Chapter 5) have distributional
properties similar to recursive residuals but they are not as easy to
compute. Computational considerations, however, are not the main
advantage of the recursive residuals. Unlike BLU5 residuals,
recursive residuals are in one-to-one correspondence. with the n-p
observations for which they were calculated, an important property
when the goal is outlier detection.
Recursive residuals have been used by Hedayat and Robson
(1970) and Harvey and Phillips (1974) in testing for heteroscedasticity,
8
by Phillips and Harvey (1974) in constructing an exact test for first
order autocorrelation using the von Neumann ratio, by Brown et ale
(1975) in testing for structural change over time, and by Harvey and
Collier (1977) in testing for possible model misspecifications. Galpin
and Hawkins (1984) proposed the use of recursive residuals in
graphical procedures to check the model assumptions. Du Toit,
Steyn, and Stumpf (1986) provided a program for calculating
recursive residuals and using them in normal probability and
cumulative sum plotting; their program uses PROC MATRIX in SASe
3. The Test Procedure
To motivate our approach, we quote some remarks about recursive
residuals from Barnett and Lewis (1984, p. 294):
"These would seem to have potential for the study of
outliers, although no progress on this front is evident.
There is a major difficulty in that the labelling of the
observations is usually done at random, or in relation to
some concomitant variable, rather than •adaptively' in
response to the observed sample values (which might be a
desirable prospect from the outlier standpoint)."
Accordingly, we propose the following strategy for labelling or
ordering the observations, and calculating recursive residuals and
test statistics:
1) Fit the regression model to the data.
9
-
2) Compute values of an appropriate regression diagnostic
(e.g., the studentized residual or Cook's D) for each of the
n observations.
3) Order the observations according to the chosen diagnostic
measure.
4) Use the first p observations in the ordered data set to form
the "basis" for computing recursive residuals.
5) Compute recursive residuals, w j' for the remaining (n-p)
ordered observations.
6) Calculate the statistics w j/S( t), j =p+l,... ,n, comparing
the computed values against values of Student's t with
n-p-l df.
Under (1.1) with normality, the (unordered) recursive residuals w j
are iid N(0,0'2) random variables. Hence, if we were to estimate 0'2 by
a.J, an estimate of 0'2 which was independent of w j' then w j/a. j would
have an exact t distribution with the degrees of freedom (df) of a.J.
To test the null hypothesis that the jt:h observation is not an outlier,
we could then compare each w j fa. j to percentiles of the appropriate t
distribution and reject the null hypothesis whenever Iw j fa. j I
exceeded the critical value of t. The same would be true when
recursive residuals w j were calculated on observations which had
been ordered by any variable which was statistically independent of
those w j • Such ordering variables include the hi i' the diagonal
entries of the hat matrix ~ =~(~'~)-I~', which are sometimes used
as a measure of leverage. However, in general, when the
10
-
observations are adaptively ordered, the ordering variable will not be
independent of the recursive residuals, and the exact N(O,O' 2 )
property of the recursive residuals will be voided. Furthermore,
there are compelling reasons to use estimates of 0'2 that may not be
independent of the Wj (see Section 4); the estimate of 0'2 we adopt
and advocate is s 2 ( i) used for t i of (1.4). The test statistics W l s ( i )
have approximate t distributions under the null hypotheses.
When an appropriate diagnostic measure is used to order the
observations, outliers and/or influential observations can be expected
to appear late in the sequence of recursive residuals. The W j for
data points which precede them in the ordered set then, by
construction, will be unaffected by these outliers, reducing the
potential for masking and swamping. The adaptive ordering also
makes it highly unlikely that outliers will appear among the first p
ordered observations for which no recursive residuals are calculated
and no tests are possible; that is, the ordering Yields a "clean" basis
set for calculating recursive residuals.
We consider here three diagnostic measures according to which
the observations could be ordered,c arranging them in ascending
order of Ie 8 i I or D i or in descending order of COVRATIO i. These
measures represent different classes of regression diagnostics.
The studentized residual, e. i' of (1.3), was chosen over t i of (1.4)
because it is better known and more widely available through
packaged statistical programs. Because t i is a monotone function of
es 1' both give the same ordering. Similarly, Cook's D was chosen
11
because it is more widely known and available than other
closely-related measures: DFFITS and the Modified Cook's D
(Chatterjee and Hadi, 1986). The COVRATIO is less widely available
through standard packages, but represents a different class of
diagnostics. We did not use the htt, because they identify only
outliers with respect the x range, but take no account of
observations outlying in the response variable y.
4. Properties of the Test Procedure
We now present simulation results, first to justify our choice of
S( t) as the most suitable estimate of a, and then to evaluate the
performance of the test procedure suggested in Section 3 (referred
to hereafter as the "recursive method"). We used a simple linear
regression model Yt =flo + fllxt + B t with n =25 in all simulations
discussed here. The residuals are unaffected by the particular
values of flo and fll used; we set flo = 0, fll = 1. The x's were
generated as uniform (0,1) variables multiplied by 15. The error
terms B t were generated as N(O,I) random variables. Necessary
modifications were made to introduce outliers as described below. All
results are based on 1000 simulated samples of n = 25 each, with the
x's regenerated after every 100 samples. The same 10 sets of x's
were used in all simulations. The nominal level of the test was
ex = .05 throughout.
Table 1 shows the ex-levels of the test observed in our simulations
using different estimates of a with and without ordering by the
12
-
studentized residual. The estimates sand s ( i) are as defined in
Section 1 and
j-1 %80_1 ={I W~/(j-3)} ,
J i=3 1j=4, ... ,25. (4.1)
...The estimates Sj-l are paired with the ej-l of (2.1) used to obtain
the recursive residuals in (2.2); for a fixed value of j-l, SJ-l is
nothing more than the usual error mean square estimate of 0'2
obtained by fitting a line to the first j-l observations only. We
calculate Sj-l in (4.1) using the relation that, for a given fitted line,
the sum of squares of ordinary residuals equals the sum of squares
of recursive residuals (the first p being identically zero). The
column headings in Table 1 correspond to the cases where the
observations were arranged according' to increasing values of Ie s i I
and when there was no ordering. For the unordered case, Sj-l is
clearly independent of Wj' and Wj/Sj_l will have an exact t
distribution.
Because recursive residuals, w j' are only calculated for j =
3,... ,25, only 23 are available for testing in each unordered sample.
When Sj-l is used, only 22 can be tested. When the sample is
ordered, we assume that outliers and influential observations will
appear late in the sequence and be tested, i.e., we assume that all 25
observations in each sample are effectively tested. The divisors used
in calculating the entries in Tables 1-3 were adjusted accordingly
(see footnotes to Tables 1, 2). The fact that an outlier could be
untested in the unordered case itself argues for preordering if
13
...
recursive residuals are to be used as the basis for an outlier
detection statistic.
Table I confirms that, for the unordered sample, using SJ-l yields
test statistics having exact t distributions. That notwithstanding, for
our purposes, Sj-l is unworkable. First, for unordered samples, an
outlier would be equally likely to appear anywhere in the sequence;
one could not test the first p+l observations at all, and other
observations which appeared early in the unordered sequence would
be tested with few degrees of freedom. Second, for the ordered
sample, the variance is underestimated badly for many of the test
statistics wJ/sJ-u leading to far too many rejections (Table 1).
----Insert Table I Here --
The variance estimates s 2 and s 2 ( 1) can be used with good
results, although we prefer s 2 ( 1). For either one, the size of the
test is essentially the same with and without ordering. When S2 is
used, the test is somewhat conservative. When S2 (1) is used, the
observed size of the test (.048) is seen to be very close to the
nominal ex = .05. When S2 ( t) was used and the observations ordered
by Cook's D or by the COVRATIO, the observed size was also .048
(results not shown). S2( t) has the further advantage that it is more
robust; if an observation is indeed an outlier, that will inflate the
numerator, but not the denominator, ~f wJ / S( t ). The results which
follow all use S( t ) •
14
Table 2 summarizes the performance of the recursive method
according to whether the observations are ordered by a diagnostic
measure or not. When the sample was to be "not ordered", a single
outlier was created as follows: A random observation number was
selected from numbers 3-25. This ensured that the outlier would be
tested, not be part of the basis set. An amount 6 was then added to
the generated x value for that observation in place of a simulated
error term; recall flo =0, fll =1, and a =1, SO using 6 =3, for
example, is equivalent to placing the outlier 3 standard deviations
above the (true) line. In cases where the data would be ordered by
a diagnostic measure, the outlier was created by adding 6 to the 25i:h
generated x value (without loss of generality as the observations are
reordered) in place of a simulated error. Obvious generalizations
were used in creating as many as 3 outliers in each sample. When
multiple outliers were introduced in a simulated sample, each
Iw j/S( t ) I, i = 3,... ,25, was compared against the .975 percentile of
Student's t with 22 df. Of course, in practice, one would not know
in advance how many outliers were present in the data set. The
entries in Table 2 are the proportions of correctly identified outliers
(NOCORR) and of "good" observations incorrectly identified as outliers
(NOINC). The results show that when ordering is used, the power to
detect outliers (NOCORR) increases by an average of. about 6.7
percent. Of course, had we not prevented outliers in unordered
samples from falling into the basis set where they would have been
untested and thus undetected, the increase in NOCORR with ordering
15
-
would have been far larger. The 6.7 percent gain estimates the
increase in probability of detecting an outlier given that it is tested,
which it might not be in an unordered sample. NOINC is unaffected
by ordering, but is less than 5% in every case. This reflects
inflation by the outlier(s) of variance estimates S2 (i) used to scale
the "good" observations. The more outliers the sample contained, the
greater was this effect. When multiple outliers were present,
inflation of the S( i) also contributed to the masking of one outlier by
another, reducing the procedure's power (NOCORR) to detect all
outliers. This masking phenomenon is seen in more detail in Table 3.
----Insert Table 2 Here----
Figure I shows var.ious outlier patterns that might be of
interest. The "good" observations, simulated as above, are assumed
to lie in a pattern suggested by the parallelogram, and the symbol x
represents an outlier. Table 3 summarizes the results of the
simulations for each configuration of data in Fig. 1 in turn. Each
outlier is created by adding a quantity 6 to the x value(s) specified
in Fig. 1. The performance of the recursive method is then evaluated
in Table 3 for increasing 6. The entries in the table are as defined
before, using divisors chosen as in Table 2.
----Insert Figure 1 Here----
In Fig. l(a), the outlier occurs near the mean of the explanatory
variable. It can be seen that the power of the test gets very close
16
-
to 1 for 6 as small as 3. The outlier in Fig. l(b) is at the extreme of
the range of x. The power to detect this sort of outlier is less than
for .the first type, reflecting the greater impact of uncertainty in
estimation of the slope at x values farther from the mean X. In Fig.
l(c), both outliers are at one extreme of the range of the x values,
with 6's of opposite signs. Power in this case suffers as the outliers
become more exaggera~d (larger 6's). This reflects the type of
masking discussed earlier wherein each outlier inflates the estimate of
variance, S2 (t), used in the denominator of the test statistic for the
other. In all the cases considered above [Figs. 1(a-c)], the choice of
the ordering variable does not make an appreciable difference in the
power of the recursive method. Ordering by D t appears to be
somewhat less effective in the first case, but this is the least
important case because (i) power is high and (ii) an undetected
outlier will have little influence on the estimate of the slope.
----Insert Table 3 Here---
Figure l(d) displays two outliers, one at each end of the range of
the explanatory variable and having «S's with opposite signs. Figure
l(e) has both outliers at the same end of the range and with 6's of
the same sign. Figures l(d) and 1(e) serve as examples where more
serious masking is found, that is, in which the presence of one
outlier is more likely to obscure the presence of another. Now, not
only does each outlier inflate the estimate of error used to test the
other, but also the first outlier in the ordered sample reduces the
17
-
size of the recursive residual for the second outlier through its
effect on ~j-l of (2.1). In the case in Fig. l(c), the first outlier
torqued the fitted line away from the second outlier, increasing its
recursive residual and thus "unmasking" it. In the cases in Figs.
l(d) and I(e), the line will be torqued toward the second outlier,
reducing its recursive residual and thereby the likelihood that it
will be identified as an outlier. For these two cases, Dt seems to be
superior to le.t I, which in turn is preferable to COVRATIOt for
ordering.
These simulation results suggest that the recursive method is
very effective in detecting outliers, although not equally so in all
cases. In practice, anyone of lest I or Dt or COVRATIO t that is
readily available can be used to rearrange the data before computing
the recursive residuals. The studentized. residual, e.t, is most widely
available and best-known; COVRATIO is least available and least
well-known. What differences we did observe, favor using Cook's D
as the ordering variable.
The above simulations explore the properties of the recursive
method per se. Kianifard and Swallow (North Carolina State
University, Institute of Statistics Mimeo Series No. 1906, 1988) compare
the recursive method with some competing outlier-detection
procedures. Two of these, testing the max Iest I of (1.3) or the
maxlt.1 of (1.4), are commonly seen in applications where there is no
a priori reason to suspect which observations, if any, might be
outliers. In each procedure, whenever the tested observation is
18
declared to be an outlier, that observation is deleted from the sample,
new e s t or tt are computed, the maximum tested, and so on. Both
procedures are known to be vulnerable to masking and swamping, but
the required computations are very manageable. A third competitor
in the comparison is Marasinghe's (1985) multistage procedure,
particularly designed for multiple-outlier applications.
Head-to-head comparisons of diagnostics or tests for identifying
outliers or influential observations are generally problematic, since
the competitors are often designed for somewhat different purposes,
and may test (somewhat) different hypotheses. That notwithstanding,
our principal conclusion is that the recursive method is generally
superior to these competitors for detecting moderate outliers
(c5's =3 or 4, or even 5 for some outlier configurations). If one
wants a procedure that has high probability of identifying these
moderate outliers for scrutiny and perhaps special treatment, the
recursive method seems a good choice. For a more detailed reporting
of this comparison, see Kianifard and Swallow.
5. An Example
We illustrate the use of recursive residuals in detecting outliers
on· a set of data from Brownlee (1965) that has been used extensively
in the literature. The data appear in Table 4. The observations
were ordered according to Ies t I obtained by fitting the regression
model to the data. Recursive residuals wJ were then computed and
scaled by S( t). The resulting IwJ/S( t) I shown in Table 4 are
compared to the percentiles of a t distribution with n-p-l = 16 df.
19
-
The appropriate critical value for testing at the 5% level is 2.120. It
can readily be seen from Table 4 that observations 21 and 4 are
identified as data points deserving further scrutiny and, perhaps,
special treatment. Experienced data analysts have applied a variety
of tools to identify outliers in this data set; their conclusions have
generally been that observations 21 and 4, and perhaps 1 and 3,
should be viewed as outliers.
----Insert Table 4 Here----
ACKNOWLEDGEMENTS
The programming assistance of Sandra B. Donaghy is greatly
appreciated by the authors. This research was partially supported
by the North Carolina Agricultural Foundation.
20
-
REFERENCES
Atkinson, A. C. (1981). Two Graphical Displays for Outlying and
Influential Observations in Regression. Biometrika 68, 13-20.
(1985). Plots, Transformations, and Regression. Oxford:
University Press.
Barnett, V., and Lewis, T. (1984) Outliers in Statistical Data,
Second Edition. New York: Wiley.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression
Diagnostics: IdentifTing Influential Data and Sources of
Collinearity. New York: Wiley.
Brown, R. L., Durbin, J., and Evans, J. M. (1975). Techniques for
Testing the Constancy of Regression Relationships over Time.
Journal of the Royal Statistical Society, Series B, 37, 149-192.
Brownlee, K. A. (1965). Statistical Theory and Methodology in
Science and Engineering, Second Edition. New York: Wiley.
Chatterjee, S., and Radi, A. S. (1986). Influential Observations,
High Leverage Points, and Outliers in Linear Regression.
Statistical Science 1, 379-416.
Chatterjee, S., and Radi, A. S. (1988). Sensitivity Analysis in Linear
Regression. New York: Wiley.
Cook, R. D. (1977). Detection of Influential Observations in Linear
Regression. Technometrics 19, 15-18.
Cook, R. D., and Weisberg, S. (1982). Residuals and Influence in
Regression. New York: Chapman and Hall.
21
-
Du Toit, S. H. c., St.eyn, A. G. W., and Stumpf, R. H. (1986). Graphical
Exploratory Data Analysis. New York: Springer-Verlag.
Galpin, J. S., and Hawkins, D. M. (1984). The Use of Recursive
Residuals in Checking Model Fit in Linear Regression. The
American Statistician 38, 94-105.
Harvey, A. C., and Collier, P. (1977). Testing for Functional
Misspecification in Regression Analysis. Journal of Econometrics
6, 103-109.
Harvey, A. C., and Phillips, G. D. A. (1974). A Comparison of the
Power ot Some Tests tor Heteroskedasticity in the General Linear
Model. Journal of Econometrics 2, 307-316.
Hawkins, D. M. (1980). Identification of Outliers. New York:
Chapman and Hall.
Hedayat, A., and Robson, D. S. (1970). Independent Stepwise Residuals
tor Testing Homoscedasticity. Journal of the American Statistical
Association 65, 1573-1581.
Marasinghe, M. G. (1985). A Multistage Procedure tor Detecting
Several Outliers in Linear Regression. Technometrics 27, 395-399.
Phillips, G. D. A., and Harvey, A. C. (1974). A Simple Test tor Serial
Correlation in Regression Analysis. Journal of the American
Statistical Association 69, 935-939.
Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression and
Outlier Detection. New York: Wiley.
Plackett, R. L. (1950). Some Theorems in Least Squares. Biometrika
37, 149-157.
Theil, H. (1971). Principles of Econometrics, New York: Wiley.
22
-
Table 1. Size of the test when using each of several estimates of a, and
with observations ordered by their studentized residuals
or not before calculating recursive residuals.
Estimate of a Not Ordered Ordered by Ie . IS1
s(i) .049a .048c
s .036a .035c
s. 1 .050b .2l0cJ-
Bproportion of rejections 8JDong 23,000 tested.
bproportion of rejections 8JDong 22,000 tested.
cProportion of rejections 8JDong 25,000 effectively tested.
Table 2. Proportions of outliers correctly identified (NOCORR) and nonoutliers
incorrectly declared to be outliers (NOINC) when up to three outliers
per sample were planted at randomly chosen x at (vertical) distances
6. from the true regression line, and the observations were not1
ordered, or ordered by their studentized residuals, or values of Cook's D
or the covariance ratio, before calculating recursive residuals.
OutlierPattern
NotOrdered
Orderedby Ie .1
S1
Orderedby D.
1
Ordered byDescending COVRATIO.
1
(2.5,0,0)
(3,0,0)
(4,0,0)
(3,3,0)
(3,3,-3)
NOCORR
.662
.848
.935
.703
.552
NOINC NOCORR
.03la .725
.025a .934
.015a .999
.015b .767
.007c .610
NOINC NOCORR
.029d .737
.022d .940
.ond .999
.015e .747
.008f .625
NOINC
.029d
.022d
.012d
.014e
.007f
NOCORR
.722
.929
.999
.757
.609
NOINC
.030d
.022d
.Olld
.016e
.008f
8proportion of rejections among 22,000 tested.
bproportion of rejections among 21,000 tested.
cProportion of rejections among 20,000 tested.
dproportion of rejections among 24,000 effectively tested.
~roportion of rejections among 23,000 effectively tested.
fproportion of rejections among 22,000 effectively tested.
Table 3. Proportions of outliers correctly identified (NOCOOR) and nonoutliers
incorrectly declared to be outliers (NOINC) under the outlier patterns
of Figures l(a-e) with the outliers at (vertical) distances 6. from1
the true regression line, and the observations ordered by their
studentized residuals, or values of Cook's D or the covariance ratio,
before calculating recursive residuals.
(a)
-Ordered by Ie. I Ordered by D. Ordered by DescendingS1 1 COVRATIO.
1
61 NOCORR NOINC NOCORR NOINC NOCORR NOINC
2 .400 .036 .380 .035 .405 .035
2.5 .841 .029 .808 .028 .849 .029
3 .987 .020 .979 .021 .986 .020
3.5 .999 .015 1 .015 .999 .015
4 1 .011 1 .011 1 .010
(b)
Ordered by Ie. I Ordered by D. Ordered by DescendingS1 1 COVRATIO.
1
61 NOCORR NOINC NOCORR NOINC NOCORR NOINC
2 .341 .037 .339 .036 .328 .037
2.5 .680 .030 .686 .029 .662 .030
3 .904 .023 .919 .022 .892 .023
3.5 .983 .017 .986 .016 .980 .017
4 .998 .012 1 .012 .998 .012
(c)
Ordered by Ie. I Ordered by D. Ordered by DescendingSl. l. COVRATIO.
l.
( c5I' c52) NOCORR NOINe NOCORR NOINC NOCORR NOINe
(2,-2) .407 .024 .403 .024 .408 .024
(2.5,-2.5) .599 .015 .597 .015 .600 .015
(3,-3) .780 .008 .783 .008 .782 .008
(3.5,-3.5) .923 .004 .924 .004 .925 .004 -(4,-4) .969 .002 .968 .002 .969 .002
(d)
Ordered by Ie. I Ordered by D. Ordered by DescendingS1 1 COVRATIO.
1
( c5I' c52) NOCORR NOINC NOCORR NOINC NOCORR NOINe
(2,-2) .153 .035 .171 .032 .142 .035
(2.5,-2.5) .404 .027 .439 .023 .375 .027
(3,-3) .694 .018 .714 .015 .655 .019
(3.5,-3.5) .886 .012 .897 .009 .858 .013
(4,-4) .966 .006 .972 .005 .954 .007
(e)
Ordered by Ie. I Ordered by D. Ordered by DescendingS1 1 COVRATIO.
1
( c5I' c52) NOCORR NOINe NOCORR NOINC NOCORR NOINC
(2,2) .099 .039 .119 .036 .090 .039
(2.5,2.5) .320 .031 .359 .026 .300 .032
(3,3) .606 .022 .643 .019 .577 .023
(3.5,3.5) .808 .015 .848 .Oll .787 .016
(4,4) .931 .009 .955 .006 .917 .Oll
Table 4. Application of the recursive method to Brownlee's (1965) data
Data Ordered by Ie. IS1.
Observation Xl X2 X3 Y wj/s(i) ObservationNumber Number
1 .80 27 89 42 0 14
2 80 27 88 37 0 18
3 75 25 90 37 0 19
-4 62 24 87 28 0 16
5 62 22 87 18 .1753 10
6 62 23 87 18 .4278 20
7 62 24 93 19 -.6332 13
8 62 24 93 20 -.2210 8
9 58 23 87 15 -.2533 5
10 58 18 80 14 -.3511 17
11 58 18 89 14 .1802 2
12 58 17 88 13 .5093 15
13 58 18 82 11 -.3775 7
14 58 19 93 12 .4507 11
15 50 18 89 8 -.4720 6
16 50 18 86 7 .2152 12
17 50 19 72 8 -.3596 9
18 50 19 79 8 1.5019 1
19 50 20 80 9 1.3020 3
20 56 20 82 15 2.2758 4
21 70 20 91 15 -3.3305 21
Y = stack loss Xl = air flow
X2 = cooling water inlet temperature X3 = acid concentration
..
t.
2
•-2
10 It 14 11 -t • • • II 12 It 11I
(d) Y1 = o + 151Y2 = 15 + 152
-I -.-2 0 10 12 11 11 -t • I 10 IZ It 11 -
I I
Ca) Y1 = 7.5 + 51 (b) Y1.= 15 + 151
, ,II J2•ZI :1
"~.
II ~.
It tt
11 lZ
" 11
-I
-t •
(c) Y1 = 15 + 151Y2 = 15 + 15 2,
II
21
II
"It
12
10
•-t
-t • • • "12 14 11
I
(e) Y1 = 15 + 151Y2 = 14.95 + 152
Figure 1- Outlier Patterns for Table 3., ,
12 12•Z, ZI
",.
11 11 :J11 11
12,.
10 11