Outliers and influential data points
-
Upload
plato-russo -
Category
Documents
-
view
49 -
download
0
description
Transcript of Outliers and influential data points
Outliers and influential data points
The distinction
• An outlier is a data point whose response y does not follow the general trend of the rest of the data.
• A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.
No outliers? No influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
Any outliers? Any influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
Any outliers? Any influential data points?
0 2 4 6 8 10 12 14
0
10
20
30
40
50
60
70
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
The regression equation is y = 1.73 + 5.12 x
Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000
S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%
Without the blue data point:
With the blue data point:
The regression equation is y = 2.96 + 5.04 x
Predictor Coef SE Coef T PConstant 2.958 2.009 1.47 0.157x 5.0373 0.3633 13.86 0.000
S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%
Any outliers? Any influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
Any outliers? Any influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.47 + 4.93 x
The regression equation is y = 1.73 + 5.12 x
Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000
S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%
Without the blue data point:
With the blue data point:
The regression equation is y = 2.47 + 4.93 x
Predictor Coef SE Coef T PConstant 2.468 1.076 2.29 0.033x 4.9272 0.1719 28.66 0.000
S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%
Any outliers? Any influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
Any outliers? Any influential data points?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 8.51 + 3.32 x
The regression equation is y = 1.73 + 5.12 x
Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000
S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%
Without the blue data point:
With the blue data point:
The regression equation is y = 8.50 + 3.32 x
Predictor Coef SE Coef T PConstant 8.505 4.222 2.01 0.058x 3.3198 0.6862 4.84 0.000
S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%
Impact on regression analyses
• Not every outlier strongly influences the regression analysis.
• Always determine if the regression analysis is unduly influenced by one or a few data points.
• Simple plots for simple linear regression.• Summary measures for multiple linear
regression.
The leverages hii
The leverages hii
The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:
niyhyhyhyhy niniiiiii ,,1for ˆ 2211
where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.
For example:
nnnnnn
nn
nn
yhyhyhy
yhyhyhy
yhyhyhy
2211
22221212
12121111
ˆ
ˆ
ˆ
The leverages hii
Because the predicted response can be written as:
the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .iy
niyhyhyhyhy niniiiiii ,,1for ˆ 2211
Properties of the leverages hii
• The leverage hii is:
– a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.
– a number between 0 and 1, inclusive.
• The sum of the hii equals p, the number of parameters.
Any high leverages hii?
14121086420
70
60
50
40
30
20
10
0
x
y
0 1 2 3 4 5 6 7 8 9
x
Dotplot for x
sample mean = 4.751
h(1,1) = 0.176 h(21,21) = 0.163h(11,11) = 0.048
HI10.176297 0.157454 0.127014 0.119313 0.0861450.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492
Sum of HI1 = 2.0000
Any high leverages hii?
14121086420
70
60
50
40
30
20
10
0
x
y
14121086420
x
Dotplot for x
sample mean = 5.227
h(1,1) = 0.153 h(11,11) = 0.048 h(21,21) = 0.358
HI10.153481 0.139367 0.116292 0.110382 0.0843740.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535
Sum of HI1 = 2.0000
Identifying data points whose x values are extreme .... and
therefore potentially influential
Using leverages to identify extreme x values
Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….
n
p
n
hh
n
iii
1
…or if it’s greater than 0.99 (whichever is smallest).
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p
Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X
X denotes an observation whose X value gives it largeinfluence.
x y HI1 14.00 68.00 0.357535
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p x y HI213.00 15.00 0.311532
Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX
R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.
Important distinction!
• The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.
• The leverage depends only on the predictor values.
• Whether the data point is influential or not depends on the observed value yi.
Identifying outliers (unusual y values)
Identifying outliers
• Residuals
• Standardized residuals– also called internally studentized residuals
Residuals
iii yye ˆ
Ordinary residuals defined for each observation, i = 1, …, n:
x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2
Standardized residuals
iii
i
ii
hMSE
e
es
ee
1*
Standardized residuals defined for each observation, i = 1, …, n:
MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735
Standardized residuals
• Standardized residuals quantify how large the residuals are in standard deviation units.– An observation with a standardized residual
that is larger than 3 (in absolute value) is generally deemed an outlier.
– Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).
An outlier?
14121086420
70
60
50
40
30
20
10
0
x
y
x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110
S = 4.711
Unusual Observations
Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R
R denotes an observation with a large standardized residual.
Why should we care?(Regression of y on x with outlier)
The regression equation is y = 2.95763 + 5.03734 x S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %
Analysis of Variance
Source DF SS MS F PRegression 1 4265.82 4265.82 192.230 0.000Error 19 421.63 22.19Total 20 4687.46
Why should we care?(Regression of y on x without outlier)
The regression equation is y = 1.73217 + 5.11687 x S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %
Analysis of Variance
Source DF SS MS F PRegression 1 4386.07 4386.07 652.841 0.000Error 18 120.93 6.72 Total 19 4507.00
Identifying influential data points
Identifying influential data points
• Deleted residuals
• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals
• Difference in fits, DFITS
• Cook’s distance measure
Basic idea of these four measures
• Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.
• Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.
Deleted residuals
Deleted residual )(ˆ iii yyd
yi = the observed response for ith observation
= predicted response for ith observation based on the estimated model with the ith observation deleted
)(ˆ iy
0 1 2 3 4 5 6 7 8 9 10
0
5
10
15
x
y
y = 0.6 + 1.55 x
y = 3.82 - 0.13 x
1.161055.16.0ˆ )4( y
1.24 y141.161.24 d
Deleted t residuals
A deleted t residual is just a standardized deleted residual:
ii
i
i
ii
hMSE
d
ds
dt
1)(
The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.
109876543210
15
10
5
0
x
y
y = 0.6 + 1.55 x
y = 3.82 - 0.13 x
x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990
43210-1-2-3-4
0.3
0.2
0.1
0.0
t(1)
dens
ity
The t(1) distribution
Do any of the deleted t residuals stick out like a sore thumb?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012
Do any of the deleted t residuals stick out like a sore thumb?
3210-1-2-3
0.3
0.2
0.1
0.0
t(18)
dens
ity
The t(18) distribution
DFITS
iii
iii
hMSE
yyDFITS
)(
)(ˆˆ The difference in fits:
is the number of standard deviations that the fitted value changes when the ith case is omitted.
Using DFITS
An observation is deemed influential …
… if the absolute value of its DFIT value is greater than:
1
12
pn
p
… or if the absolute value of its DFIT value sticks out like a sore thumb from the other DFIT values.
14121086420
70
60
50
40
30
20
10
0
x
y
82.01221
122
1
12
pn
p x y DFIT114.00 68.00 -1.23841
Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 21 14.0000 68.0000 -1.23841
14121086420
70
60
50
40
30
20
10
0
x
y
x y DFIT213.00 15.00 -11.467082.0
1221
122
1
12
pn
p
Row x y DFIT2 1 0.1000 -0.0716 -0.4028 2 0.4540 4.1673 -0.2438 3 1.0977 6.5703 -0.2058 4 1.2794 13.8150 0.0376 5 2.2061 11.4501 -0.1314 6 2.5006 12.9554 -0.1096 7 3.0403 20.1575 0.0405 8 3.2358 17.5633 -0.0424 9 4.4531 26.0317 0.0602 10 4.1699 22.7573 0.0092 11 5.2847 26.3030 0.0054 12 5.5924 30.6885 0.0782 13 5.9209 33.9402 0.1278 14 6.6607 30.9228 0.0072 15 6.7995 34.1100 0.0731 16 7.9794 44.4536 0.2805 17 8.4154 46.5022 0.3236 18 8.7161 50.0568 0.4361 19 8.7016 46.5475 0.3089 20 9.1646 45.7762 0.2492 21 13.0000 15.0000 -11.4670
x y DFIT3 4.00 40.00 1.550582.0
1221
122
1
12
pn
p
14121086420
70
60
50
40
30
20
10
0
x
y
Row x y DFIT3 1 0.10000 -0.0716 -0.37897 2 0.45401 4.1673 -0.10501 3 1.09765 6.5703 -0.16248 4 1.27936 13.8150 0.36737 5 2.20611 11.4501 -0.17547 6 2.50064 12.9554 -0.16377 7 3.04030 20.1575 0.10670 8 3.23583 17.5633 -0.09265 9 4.45308 26.0317 0.03061 10 4.16990 22.7573 -0.05850 11 5.28474 26.3030 -0.16025 12 5.59238 30.6885 -0.02183 13 5.92091 33.9402 0.05988 14 6.66066 30.9228 -0.34036 15 6.79953 34.1100 -0.18835 16 7.97943 44.4536 0.10017 17 8.41536 46.5022 0.09771 18 8.71607 50.0568 0.29275 19 8.70156 46.5475 -0.02188 20 9.16463 45.7762 -0.33969 21 4.00000 40.0000 1.55050
Cook’s distance
2
2
1
ˆ
ii
iiiii
h
h
MSEp
yyDCook’s distance
• Di depends on both residual ei and leverage hii.
• Di summarizes how much each of the estimated coefficients change when deleting the ith observation.
• A large Di indicates yi has a strong influence on the estimated coefficients.
Effect on estimates of removing each data point one at a time?
14121086420
70
60
50
40
30
20
10
0
x
y
Effect on estimates of removing each data point one at a time?
1086420
6
4
2
0
Estimated intercept (b0)
Est
imat
ed s
lope
(b1
)
All data
Effect on estimates of removing each data point one at a time?
14121086420
70
60
50
40
30
20
10
0
x
y
Effect on estimates of removing each data point one at a time?
1086420
6
4
2
0
All data
With (13,15) removed
Estimated intercept (b0)
Est
imat
ed s
lope
(b1
)
Using Cook’s distance
• If Di is greater than 1, then the ith data point is worthy of further investigation.
• If Di is greater than 4, then the ith data point is most certainly influential.
• Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.
14121086420
70
60
50
40
30
20
10
0
x
y
x y COOK114.00 68.00 0.701960
Row x y COOK1 1 0.1000 -0.0716 0.134156 2 0.4540 4.1673 0.003705 3 1.0977 6.5703 0.017302 4 1.2794 13.8150 0.241688 5 2.2061 11.4501 0.024434 6 2.5006 12.9554 0.020879 7 3.0403 20.1575 0.038414 8 3.2358 17.5633 0.003555 9 4.4531 26.0317 0.009944 10 4.1699 22.7573 0.000260 11 5.2847 26.3030 0.017379 12 5.5924 30.6885 0.001605 13 5.9209 33.9402 0.019747 14 6.6607 30.9228 0.081345 15 6.7995 34.1100 0.015290 16 7.9794 44.4536 0.044621 17 8.4154 46.5022 0.047961 18 8.7161 50.0568 0.173897 19 8.7016 46.5475 0.011657 20 9.1646 45.7762 0.032320 21 14.0000 68.0000 0.701960
14121086420
70
60
50
40
30
20
10
0
x
y
x y COOK213.00 15.00 4.04801
Row x y COOK2 1 0.1000 -0.0716 0.08172 2 0.4540 4.1673 0.03076 3 1.0977 6.5703 0.02198 4 1.2794 13.8150 0.00075 5 2.2061 11.4501 0.00901 6 2.5006 12.9554 0.00629 7 3.0403 20.1575 0.00086 8 3.2358 17.5633 0.00095 9 4.4531 26.0317 0.00191 10 4.1699 22.7573 0.00004 11 5.2847 26.3030 0.00002 12 5.5924 30.6885 0.00320 13 5.9209 33.9402 0.00848 14 6.6607 30.9228 0.00003 15 6.7995 34.1100 0.00280 16 7.9794 44.4536 0.03958 17 8.4154 46.5022 0.05229 18 8.7161 50.0568 0.09180 19 8.7016 46.5475 0.04809 20 9.1646 45.7762 0.03194 21 13.0000 15.0000 4.04801
x y COOK3 4.00 40.00 0.36391
14121086420
70
60
50
40
30
20
10
0
x
y
Row x y COOK3 1 0.10000 -0.0716 0.073075 2 0.45401 4.1673 0.005801 3 1.09765 6.5703 0.013793 4 1.27936 13.8150 0.067493 5 2.20611 11.4501 0.015960 6 2.50064 12.9554 0.013909 7 3.04030 20.1575 0.005955 8 3.23583 17.5633 0.004498 9 4.45308 26.0317 0.000494 10 4.16990 22.7573 0.001799 11 5.28474 26.3030 0.013191 12 5.59238 30.6885 0.000251 13 5.92091 33.9402 0.001886 14 6.66066 30.9228 0.056276 15 6.79953 34.1100 0.018263 16 7.97943 44.4536 0.005272 17 8.41536 46.5022 0.005020 18 8.71607 50.0568 0.043959 19 8.70156 46.5475 0.000253 20 9.16463 45.7762 0.058966 21 4.00000 40.0000 0.363914
A strategy for dealing with problematic data points
• Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.
• First, check for obvious data errors.– If a data entry error, simply correct it.– If not representative of the population, delete it.– If a procedural error invalidates the
measurement, delete it.
A comment about deleting data points
• Do not delete data just because they do not fit your preconceived regression model.
• You must have a good, objective reason for deleting data points.
• If you delete any data after you’ve collected it, justify and describe it in your reports.
• If not sure what to do about a data point, analyze data twice and report both results.
A strategy for dealing with problematic data points (cont’d)
• Then, consider model misspecification.– Any important variables missing?– Any nonlinearity that needs to be modeled?– Any missing interaction terms?
• If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.