Outliers and influential data points

68
Outliers and influential data points

description

Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data. - PowerPoint PPT Presentation

Transcript of Outliers and influential data points

Page 1: Outliers  and  influential  data points

Outliers and influential data points

Page 2: Outliers  and  influential  data points

The distinction

• An outlier is a data point whose response y does not follow the general trend of the rest of the data.

• A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.

Page 3: Outliers  and  influential  data points

No outliers? No influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 4: Outliers  and  influential  data points

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 5: Outliers  and  influential  data points

Any outliers? Any influential data points?

0 2 4 6 8 10 12 14

0

10

20

30

40

50

60

70

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Page 6: Outliers  and  influential  data points

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

Without the blue data point:

With the blue data point:

The regression equation is y = 2.96 + 5.04 x

Predictor Coef SE Coef T PConstant 2.958 2.009 1.47 0.157x 5.0373 0.3633 13.86 0.000

S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%

Page 7: Outliers  and  influential  data points

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 8: Outliers  and  influential  data points

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.47 + 4.93 x

Page 9: Outliers  and  influential  data points

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

Without the blue data point:

With the blue data point:

The regression equation is y = 2.47 + 4.93 x

Predictor Coef SE Coef T PConstant 2.468 1.076 2.29 0.033x 4.9272 0.1719 28.66 0.000

S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%

Page 10: Outliers  and  influential  data points

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 11: Outliers  and  influential  data points

Any outliers? Any influential data points?

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 8.51 + 3.32 x

Page 12: Outliers  and  influential  data points

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T PConstant 1.732 1.121 1.55 0.140x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

Without the blue data point:

With the blue data point:

The regression equation is y = 8.50 + 3.32 x

Predictor Coef SE Coef T PConstant 8.505 4.222 2.01 0.058x 3.3198 0.6862 4.84 0.000

S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%

Page 13: Outliers  and  influential  data points

Impact on regression analyses

• Not every outlier strongly influences the regression analysis.

• Always determine if the regression analysis is unduly influenced by one or a few data points.

• Simple plots for simple linear regression.• Summary measures for multiple linear

regression.

Page 14: Outliers  and  influential  data points

The leverages hii

Page 15: Outliers  and  influential  data points

The leverages hii

The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:

niyhyhyhyhy niniiiiii ,,1for ˆ 2211

where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.

For example:

nnnnnn

nn

nn

yhyhyhy

yhyhyhy

yhyhyhy

2211

22221212

12121111

ˆ

ˆ

ˆ

Page 16: Outliers  and  influential  data points

The leverages hii

Because the predicted response can be written as:

the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .iy

niyhyhyhyhy niniiiiii ,,1for ˆ 2211

Page 17: Outliers  and  influential  data points

Properties of the leverages hii

• The leverage hii is:

– a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.

– a number between 0 and 1, inclusive.

• The sum of the hii equals p, the number of parameters.

Page 18: Outliers  and  influential  data points

Any high leverages hii?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 19: Outliers  and  influential  data points

0 1 2 3 4 5 6 7 8 9

x

Dotplot for x

sample mean = 4.751

h(1,1) = 0.176 h(21,21) = 0.163h(11,11) = 0.048

HI10.176297 0.157454 0.127014 0.119313 0.0861450.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492

Sum of HI1 = 2.0000

Page 20: Outliers  and  influential  data points

Any high leverages hii?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 21: Outliers  and  influential  data points

14121086420

x

Dotplot for x

sample mean = 5.227

h(1,1) = 0.153 h(11,11) = 0.048 h(21,21) = 0.358

HI10.153481 0.139367 0.116292 0.110382 0.0843740.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535

Sum of HI1 = 2.0000

Page 22: Outliers  and  influential  data points

Identifying data points whose x values are extreme .... and

therefore potentially influential

Page 23: Outliers  and  influential  data points

Using leverages to identify extreme x values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

n

p

n

hh

n

iii

1

…or if it’s greater than 0.99 (whichever is smallest).

Page 24: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p

Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it largeinfluence.

x y HI1 14.00 68.00 0.357535

Page 25: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p x y HI213.00 15.00 0.311532

Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

Page 26: Outliers  and  influential  data points

Important distinction!

• The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.

• The leverage depends only on the predictor values.

• Whether the data point is influential or not depends on the observed value yi.

Page 27: Outliers  and  influential  data points

Identifying outliers (unusual y values)

Page 28: Outliers  and  influential  data points

Identifying outliers

• Residuals

• Standardized residuals– also called internally studentized residuals

Page 29: Outliers  and  influential  data points

Residuals

iii yye ˆ

Ordinary residuals defined for each observation, i = 1, …, n:

x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2

Page 30: Outliers  and  influential  data points

Standardized residuals

iii

i

ii

hMSE

e

es

ee

1*

Standardized residuals defined for each observation, i = 1, …, n:

MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735

Page 31: Outliers  and  influential  data points

Standardized residuals

• Standardized residuals quantify how large the residuals are in standard deviation units.– An observation with a standardized residual

that is larger than 3 (in absolute value) is generally deemed an outlier.

– Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

Page 32: Outliers  and  influential  data points

An outlier?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 33: Outliers  and  influential  data points

x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

S = 4.711

Unusual Observations

Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R

R denotes an observation with a large standardized residual.

Page 34: Outliers  and  influential  data points

Why should we care?(Regression of y on x with outlier)

The regression equation is y = 2.95763 + 5.03734 x S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %

Analysis of Variance

Source DF SS MS F PRegression 1 4265.82 4265.82 192.230 0.000Error 19 421.63 22.19Total 20 4687.46

Page 35: Outliers  and  influential  data points

Why should we care?(Regression of y on x without outlier)

The regression equation is y = 1.73217 + 5.11687 x S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %

Analysis of Variance

Source DF SS MS F PRegression 1 4386.07 4386.07 652.841 0.000Error 18 120.93 6.72 Total 19 4507.00

Page 36: Outliers  and  influential  data points

Identifying influential data points

Page 37: Outliers  and  influential  data points

Identifying influential data points

• Deleted residuals

• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals

• Difference in fits, DFITS

• Cook’s distance measure

Page 38: Outliers  and  influential  data points

Basic idea of these four measures

• Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.

• Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

Page 39: Outliers  and  influential  data points

Deleted residuals

Deleted residual )(ˆ iii yyd

yi = the observed response for ith observation

= predicted response for ith observation based on the estimated model with the ith observation deleted

)(ˆ iy

Page 40: Outliers  and  influential  data points

0 1 2 3 4 5 6 7 8 9 10

0

5

10

15

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

1.161055.16.0ˆ )4( y

1.24 y141.161.24 d

Page 41: Outliers  and  influential  data points

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

ii

i

i

ii

hMSE

d

ds

dt

1)(

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

Page 42: Outliers  and  influential  data points

109876543210

15

10

5

0

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

Page 43: Outliers  and  influential  data points

43210-1-2-3-4

0.3

0.2

0.1

0.0

t(1)

dens

ity

The t(1) distribution

Do any of the deleted t residuals stick out like a sore thumb?

Page 44: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

Page 45: Outliers  and  influential  data points

Do any of the deleted t residuals stick out like a sore thumb?

3210-1-2-3

0.3

0.2

0.1

0.0

t(18)

dens

ity

The t(18) distribution

Page 46: Outliers  and  influential  data points

DFITS

iii

iii

hMSE

yyDFITS

)(

)(ˆˆ The difference in fits:

is the number of standard deviations that the fitted value changes when the ith case is omitted.

Page 47: Outliers  and  influential  data points

Using DFITS

An observation is deemed influential …

… if the absolute value of its DFIT value is greater than:

1

12

pn

p

… or if the absolute value of its DFIT value sticks out like a sore thumb from the other DFIT values.

Page 48: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

82.01221

122

1

12

pn

p x y DFIT114.00 68.00 -1.23841

Page 49: Outliers  and  influential  data points

Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 21 14.0000 68.0000 -1.23841

Page 50: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

x y DFIT213.00 15.00 -11.467082.0

1221

122

1

12

pn

p

Page 51: Outliers  and  influential  data points

Row x y DFIT2 1 0.1000 -0.0716 -0.4028 2 0.4540 4.1673 -0.2438 3 1.0977 6.5703 -0.2058 4 1.2794 13.8150 0.0376 5 2.2061 11.4501 -0.1314 6 2.5006 12.9554 -0.1096 7 3.0403 20.1575 0.0405 8 3.2358 17.5633 -0.0424 9 4.4531 26.0317 0.0602 10 4.1699 22.7573 0.0092 11 5.2847 26.3030 0.0054 12 5.5924 30.6885 0.0782 13 5.9209 33.9402 0.1278 14 6.6607 30.9228 0.0072 15 6.7995 34.1100 0.0731 16 7.9794 44.4536 0.2805 17 8.4154 46.5022 0.3236 18 8.7161 50.0568 0.4361 19 8.7016 46.5475 0.3089 20 9.1646 45.7762 0.2492 21 13.0000 15.0000 -11.4670

Page 52: Outliers  and  influential  data points

x y DFIT3 4.00 40.00 1.550582.0

1221

122

1

12

pn

p

14121086420

70

60

50

40

30

20

10

0

x

y

Page 53: Outliers  and  influential  data points

Row x y DFIT3 1 0.10000 -0.0716 -0.37897 2 0.45401 4.1673 -0.10501 3 1.09765 6.5703 -0.16248 4 1.27936 13.8150 0.36737 5 2.20611 11.4501 -0.17547 6 2.50064 12.9554 -0.16377 7 3.04030 20.1575 0.10670 8 3.23583 17.5633 -0.09265 9 4.45308 26.0317 0.03061 10 4.16990 22.7573 -0.05850 11 5.28474 26.3030 -0.16025 12 5.59238 30.6885 -0.02183 13 5.92091 33.9402 0.05988 14 6.66066 30.9228 -0.34036 15 6.79953 34.1100 -0.18835 16 7.97943 44.4536 0.10017 17 8.41536 46.5022 0.09771 18 8.71607 50.0568 0.29275 19 8.70156 46.5475 -0.02188 20 9.16463 45.7762 -0.33969 21 4.00000 40.0000 1.55050

Page 54: Outliers  and  influential  data points

Cook’s distance

2

2

1

ˆ

ii

iiiii

h

h

MSEp

yyDCook’s distance

• Di depends on both residual ei and leverage hii.

• Di summarizes how much each of the estimated coefficients change when deleting the ith observation.

• A large Di indicates yi has a strong influence on the estimated coefficients.

Page 55: Outliers  and  influential  data points

Effect on estimates of removing each data point one at a time?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 56: Outliers  and  influential  data points

Effect on estimates of removing each data point one at a time?

1086420

6

4

2

0

Estimated intercept (b0)

Est

imat

ed s

lope

(b1

)

All data

Page 57: Outliers  and  influential  data points

Effect on estimates of removing each data point one at a time?

14121086420

70

60

50

40

30

20

10

0

x

y

Page 58: Outliers  and  influential  data points

Effect on estimates of removing each data point one at a time?

1086420

6

4

2

0

All data

With (13,15) removed

Estimated intercept (b0)

Est

imat

ed s

lope

(b1

)

Page 59: Outliers  and  influential  data points

Using Cook’s distance

• If Di is greater than 1, then the ith data point is worthy of further investigation.

• If Di is greater than 4, then the ith data point is most certainly influential.

• Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.

Page 60: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

x y COOK114.00 68.00 0.701960

Page 61: Outliers  and  influential  data points

Row x y COOK1 1 0.1000 -0.0716 0.134156 2 0.4540 4.1673 0.003705 3 1.0977 6.5703 0.017302 4 1.2794 13.8150 0.241688 5 2.2061 11.4501 0.024434 6 2.5006 12.9554 0.020879 7 3.0403 20.1575 0.038414 8 3.2358 17.5633 0.003555 9 4.4531 26.0317 0.009944 10 4.1699 22.7573 0.000260 11 5.2847 26.3030 0.017379 12 5.5924 30.6885 0.001605 13 5.9209 33.9402 0.019747 14 6.6607 30.9228 0.081345 15 6.7995 34.1100 0.015290 16 7.9794 44.4536 0.044621 17 8.4154 46.5022 0.047961 18 8.7161 50.0568 0.173897 19 8.7016 46.5475 0.011657 20 9.1646 45.7762 0.032320 21 14.0000 68.0000 0.701960

Page 62: Outliers  and  influential  data points

14121086420

70

60

50

40

30

20

10

0

x

y

x y COOK213.00 15.00 4.04801

Page 63: Outliers  and  influential  data points

Row x y COOK2 1 0.1000 -0.0716 0.08172 2 0.4540 4.1673 0.03076 3 1.0977 6.5703 0.02198 4 1.2794 13.8150 0.00075 5 2.2061 11.4501 0.00901 6 2.5006 12.9554 0.00629 7 3.0403 20.1575 0.00086 8 3.2358 17.5633 0.00095 9 4.4531 26.0317 0.00191 10 4.1699 22.7573 0.00004 11 5.2847 26.3030 0.00002 12 5.5924 30.6885 0.00320 13 5.9209 33.9402 0.00848 14 6.6607 30.9228 0.00003 15 6.7995 34.1100 0.00280 16 7.9794 44.4536 0.03958 17 8.4154 46.5022 0.05229 18 8.7161 50.0568 0.09180 19 8.7016 46.5475 0.04809 20 9.1646 45.7762 0.03194 21 13.0000 15.0000 4.04801

Page 64: Outliers  and  influential  data points

x y COOK3 4.00 40.00 0.36391

14121086420

70

60

50

40

30

20

10

0

x

y

Page 65: Outliers  and  influential  data points

Row x y COOK3 1 0.10000 -0.0716 0.073075 2 0.45401 4.1673 0.005801 3 1.09765 6.5703 0.013793 4 1.27936 13.8150 0.067493 5 2.20611 11.4501 0.015960 6 2.50064 12.9554 0.013909 7 3.04030 20.1575 0.005955 8 3.23583 17.5633 0.004498 9 4.45308 26.0317 0.000494 10 4.16990 22.7573 0.001799 11 5.28474 26.3030 0.013191 12 5.59238 30.6885 0.000251 13 5.92091 33.9402 0.001886 14 6.66066 30.9228 0.056276 15 6.79953 34.1100 0.018263 16 7.97943 44.4536 0.005272 17 8.41536 46.5022 0.005020 18 8.71607 50.0568 0.043959 19 8.70156 46.5475 0.000253 20 9.16463 45.7762 0.058966 21 4.00000 40.0000 0.363914

Page 66: Outliers  and  influential  data points

A strategy for dealing with problematic data points

• Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.

• First, check for obvious data errors.– If a data entry error, simply correct it.– If not representative of the population, delete it.– If a procedural error invalidates the

measurement, delete it.

Page 67: Outliers  and  influential  data points

A comment about deleting data points

• Do not delete data just because they do not fit your preconceived regression model.

• You must have a good, objective reason for deleting data points.

• If you delete any data after you’ve collected it, justify and describe it in your reports.

• If not sure what to do about a data point, analyze data twice and report both results.

Page 68: Outliers  and  influential  data points

A strategy for dealing with problematic data points (cont’d)

• Then, consider model misspecification.– Any important variables missing?– Any nonlinearity that needs to be modeled?– Any missing interaction terms?

• If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.