Outliers and influential data points

The distinction

• An outlier is a data point whose response y does not follow the general trend of the rest of the data.

• A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.

No outliers? No influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y


0 2 4 6 8 10 12 14

0

10

20

30

40

50

60

70

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

Without the blue data point:

With the blue data point:



S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%


14121086420

70

60

50

40

30

20

10

0

x

y


14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.47 + 4.93 x



S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%





S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%


14121086420

70

60

50

40

30

20

10

0

x

y


14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 8.51 + 3.32 x



S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%





S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%

Impact on regression analyses

• Not every outlier strongly influences the regression analysis.

• Always determine if the regression analysis is unduly influenced by one or a few data points.

• Simple plots for simple linear regression.• Summary measures for multiple linear

regression.

The leverages hii

The leverages hii

The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:

niyhyhyhyhy niniiiiii ,,1for ˆ 2211

where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.

For example:

nnnnnn

nn

nn

yhyhyhy

yhyhyhy

yhyhyhy

2211

22221212

12121111

ˆ

ˆ

ˆ

The leverages hii

Because the predicted response can be written as:

the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .iy

niyhyhyhyhy niniiiiii ,,1for ˆ 2211

Properties of the leverages hii

• The leverage hii is:

– a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.

– a number between 0 and 1, inclusive.

• The sum of the hii equals p, the number of parameters.

Any high leverages hii?

14121086420

70

60

50

40

30

20

10

0

x

y

0 1 2 3 4 5 6 7 8 9

x

Dotplot for x

sample mean = 4.751

h(1,1) = 0.176 h(21,21) = 0.163h(11,11) = 0.048

HI10.176297 0.157454 0.127014 0.119313 0.0861450.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492

Sum of HI1 = 2.0000

Any high leverages hii?

14121086420

70

60

50

40

30

20

10

0

x

y

14121086420

x

Dotplot for x

sample mean = 5.227

h(1,1) = 0.153 h(11,11) = 0.048 h(21,21) = 0.358

HI10.153481 0.139367 0.116292 0.110382 0.0843740.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535

Sum of HI1 = 2.0000

Identifying data points whose x values are extreme .... and

therefore potentially influential

Using leverages to identify extreme x values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

n

p

n

hh

n

iii

1

…or if it’s greater than 0.99 (whichever is smallest).

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p

Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it largeinfluence.

x y HI1 14.00 68.00 0.357535

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p x y HI213.00 15.00 0.311532

Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

Important distinction!

• The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.

• The leverage depends only on the predictor values.

• Whether the data point is influential or not depends on the observed value yi.

Identifying outliers (unusual y values)

Identifying outliers

• Residuals

• Standardized residuals– also called internally studentized residuals

Residuals

iii yye ˆ

Ordinary residuals defined for each observation, i = 1, …, n:

x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2

Standardized residuals

iii

i

ii

hMSE

e

es

ee

1*

Standardized residuals defined for each observation, i = 1, …, n:

MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735

Standardized residuals

• Standardized residuals quantify how large the residuals are in standard deviation units.– An observation with a standardized residual

that is larger than 3 (in absolute value) is generally deemed an outlier.

– Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

An outlier?

14121086420

70

60

50

40

30

20

10

0

x

y

x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

S = 4.711

Unusual Observations

Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R

R denotes an observation with a large standardized residual.

Why should we care?(Regression of y on x with outlier)

The regression equation is y = 2.95763 + 5.03734 x S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %

Analysis of Variance

Source DF SS MS F PRegression 1 4265.82 4265.82 192.230 0.000Error 19 421.63 22.19Total 20 4687.46

Why should we care?(Regression of y on x without outlier)

The regression equation is y = 1.73217 + 5.11687 x S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %

Analysis of Variance

Source DF SS MS F PRegression 1 4386.07 4386.07 652.841 0.000Error 18 120.93 6.72 Total 19 4507.00

Identifying influential data points

Identifying influential data points

• Deleted residuals

• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals

• Difference in fits, DFITS

• Cook’s distance measure

Basic idea of these four measures

• Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.

• Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

Deleted residuals

Deleted residual )(ˆ iii yyd

yi = the observed response for ith observation

= predicted response for ith observation based on the estimated model with the ith observation deleted

)(ˆ iy

0 1 2 3 4 5 6 7 8 9 10

0

5

10

15

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

1.161055.16.0ˆ )4( y

1.24 y141.161.24 d

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

ii

i

i

ii

hMSE

d

ds

dt

1)(

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

109876543210

15

10

5

0

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

43210-1-2-3-4

0.3

0.2

0.1

0.0

t(1)

dens

ity

The t(1) distribution

Do any of the deleted t residuals stick out like a sore thumb?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

Do any of the deleted t residuals stick out like a sore thumb?

3210-1-2-3

0.3

0.2

0.1

0.0

t(18)

dens

ity

The t(18) distribution

DFITS

iii

iii

hMSE

yyDFITS

)(

)(ˆˆ The difference in fits:

is the number of standard deviations that the fitted value changes when the ith case is omitted.

Using DFITS

An observation is deemed influential …

… if the absolute value of its DFIT value is greater than:

1

12

pn

p

… or if the absolute value of its DFIT value sticks out like a sore thumb from the other DFIT values.

14121086420

70

60

50

40

30

20

10

0

x

y

82.01221

122

1

12

pn

p x y DFIT114.00 68.00 -1.23841

Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 21 14.0000 68.0000 -1.23841

14121086420

70

60

50

40

30

20

10

0

x

y

x y DFIT213.00 15.00 -11.467082.0

1221

122

1

12

pn

p

Row x y DFIT2 1 0.1000 -0.0716 -0.4028 2 0.4540 4.1673 -0.2438 3 1.0977 6.5703 -0.2058 4 1.2794 13.8150 0.0376 5 2.2061 11.4501 -0.1314 6 2.5006 12.9554 -0.1096 7 3.0403 20.1575 0.0405 8 3.2358 17.5633 -0.0424 9 4.4531 26.0317 0.0602 10 4.1699 22.7573 0.0092 11 5.2847 26.3030 0.0054 12 5.5924 30.6885 0.0782 13 5.9209 33.9402 0.1278 14 6.6607 30.9228 0.0072 15 6.7995 34.1100 0.0731 16 7.9794 44.4536 0.2805 17 8.4154 46.5022 0.3236 18 8.7161 50.0568 0.4361 19 8.7016 46.5475 0.3089 20 9.1646 45.7762 0.2492 21 13.0000 15.0000 -11.4670

x y DFIT3 4.00 40.00 1.550582.0

1221

122

1

12

pn

p

14121086420

70

60

50

40

30

20

10

0

x

y

Row x y DFIT3 1 0.10000 -0.0716 -0.37897 2 0.45401 4.1673 -0.10501 3 1.09765 6.5703 -0.16248 4 1.27936 13.8150 0.36737 5 2.20611 11.4501 -0.17547 6 2.50064 12.9554 -0.16377 7 3.04030 20.1575 0.10670 8 3.23583 17.5633 -0.09265 9 4.45308 26.0317 0.03061 10 4.16990 22.7573 -0.05850 11 5.28474 26.3030 -0.16025 12 5.59238 30.6885 -0.02183 13 5.92091 33.9402 0.05988 14 6.66066 30.9228 -0.34036 15 6.79953 34.1100 -0.18835 16 7.97943 44.4536 0.10017 17 8.41536 46.5022 0.09771 18 8.71607 50.0568 0.29275 19 8.70156 46.5475 -0.02188 20 9.16463 45.7762 -0.33969 21 4.00000 40.0000 1.55050

Cook’s distance

2

2

1

ˆ

ii

iiiii

h

h

MSEp

yyDCook’s distance

• Di depends on both residual ei and leverage hii.

• Di summarizes how much each of the estimated coefficients change when deleting the ith observation.

• A large Di indicates yi has a strong influence on the estimated coefficients.

Effect on estimates of removing each data point one at a time?

14121086420

70

60

50

40

30

20

10

0

x

y


1086420

6

4

2

0

Estimated intercept (b0)

Est

imat

ed s

lope

(b1

)

All data


14121086420

70

60

50

40

30

20

10

0

x

y


1086420

6

4

2

0

All data

With (13,15) removed

Estimated intercept (b0)

Est

imat

ed s

lope

(b1

)

Using Cook’s distance

• If Di is greater than 1, then the ith data point is worthy of further investigation.

• If Di is greater than 4, then the ith data point is most certainly influential.

• Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.

14121086420

70

60

50

40

30

20

10

0

x

y

x y COOK114.00 68.00 0.701960

Row x y COOK1 1 0.1000 -0.0716 0.134156 2 0.4540 4.1673 0.003705 3 1.0977 6.5703 0.017302 4 1.2794 13.8150 0.241688 5 2.2061 11.4501 0.024434 6 2.5006 12.9554 0.020879 7 3.0403 20.1575 0.038414 8 3.2358 17.5633 0.003555 9 4.4531 26.0317 0.009944 10 4.1699 22.7573 0.000260 11 5.2847 26.3030 0.017379 12 5.5924 30.6885 0.001605 13 5.9209 33.9402 0.019747 14 6.6607 30.9228 0.081345 15 6.7995 34.1100 0.015290 16 7.9794 44.4536 0.044621 17 8.4154 46.5022 0.047961 18 8.7161 50.0568 0.173897 19 8.7016 46.5475 0.011657 20 9.1646 45.7762 0.032320 21 14.0000 68.0000 0.701960

14121086420

70

60

50

40

30

20

10

0

x

y

x y COOK213.00 15.00 4.04801

Row x y COOK2 1 0.1000 -0.0716 0.08172 2 0.4540 4.1673 0.03076 3 1.0977 6.5703 0.02198 4 1.2794 13.8150 0.00075 5 2.2061 11.4501 0.00901 6 2.5006 12.9554 0.00629 7 3.0403 20.1575 0.00086 8 3.2358 17.5633 0.00095 9 4.4531 26.0317 0.00191 10 4.1699 22.7573 0.00004 11 5.2847 26.3030 0.00002 12 5.5924 30.6885 0.00320 13 5.9209 33.9402 0.00848 14 6.6607 30.9228 0.00003 15 6.7995 34.1100 0.00280 16 7.9794 44.4536 0.03958 17 8.4154 46.5022 0.05229 18 8.7161 50.0568 0.09180 19 8.7016 46.5475 0.04809 20 9.1646 45.7762 0.03194 21 13.0000 15.0000 4.04801

x y COOK3 4.00 40.00 0.36391

14121086420

70

60

50

40

30

20

10

0

x

y

Row x y COOK3 1 0.10000 -0.0716 0.073075 2 0.45401 4.1673 0.005801 3 1.09765 6.5703 0.013793 4 1.27936 13.8150 0.067493 5 2.20611 11.4501 0.015960 6 2.50064 12.9554 0.013909 7 3.04030 20.1575 0.005955 8 3.23583 17.5633 0.004498 9 4.45308 26.0317 0.000494 10 4.16990 22.7573 0.001799 11 5.28474 26.3030 0.013191 12 5.59238 30.6885 0.000251 13 5.92091 33.9402 0.001886 14 6.66066 30.9228 0.056276 15 6.79953 34.1100 0.018263 16 7.97943 44.4536 0.005272 17 8.41536 46.5022 0.005020 18 8.71607 50.0568 0.043959 19 8.70156 46.5475 0.000253 20 9.16463 45.7762 0.058966 21 4.00000 40.0000 0.363914

A strategy for dealing with problematic data points

• Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.

• First, check for obvious data errors.– If a data entry error, simply correct it.– If not representative of the population, delete it.– If a procedural error invalidates the

measurement, delete it.

A comment about deleting data points

• Do not delete data just because they do not fit your preconceived regression model.

• You must have a good, objective reason for deleting data points.

• If you delete any data after you’ve collected it, justify and describe it in your reports.

• If not sure what to do about a data point, analyze data twice and report both results.

A strategy for dealing with problematic data points (cont’d)

• Then, consider model misspecification.– Any important variables missing?– Any nonlinearity that needs to be modeled?– Any missing interaction terms?

• If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.

Outliers and influential data points

Documents

Transcript of Outliers and influential data points