1 Reg12M G89.2229 Multiple Regression Week 12 (Monday) Quality Control and Critical Evaluation of...

Post on 03-Jan-2016

212 views 0 download

Transcript of 1 Reg12M G89.2229 Multiple Regression Week 12 (Monday) Quality Control and Critical Evaluation of...

1Reg12M

G89.2229 Multiple Regression Week 12 (Monday)

• Quality Control and Critical Evaluation of Regression Results

• An example

• Identifying Residuals

• Leverage: X(X’X)-1X’

• Residuals (Discrepancies)

2Reg12M

Quality Control and Critical Evaluation of Regression

Results• Multiple regression programs

have the potential to hide details of the data» Regression output provides

only summary information» When several variables are

considered, bivariate plots may not be immediately revealing

• Regression diagnostic indicators can help with quality control» Which subjects are not well fit:

Discrepancy or Residuals» Which subjects are affecting

results: Influence Points

3Reg12M

An example

• Suppose X were a measure of SES advantage, such as years of education, W is an indictor of social disadvantage, such as immigrant status, and Y is a number of stressful life events

• Nothing necessarily jumps out from these data as being amiss.

X W Y6 0 87 1 107 1 148 1 118 1 159 0 99 0 119 1 129 1 13

10 0 810 0 2011 0 511 0 711 0 711 0 811 1 911 1 1012 1 1015 0 616 1 11

4Reg12M

Example, Continued

• Here are the regression results

• Here is the plot of the fit

Regression StatisticsMultiple R 0.465R Square 0.216Adjusted R Square 0.124Observations 20

ANOVAdf SS MS F

Regression 2 49.56 24.78 2.34Residual 17 179.64 10.57Total 19 229.20

Coefficients Standard Error t Stat P-valueIntercept 12.70 3.28 3.87 0.0012X -0.37 0.30 -1.22 0.2387W 2.42 1.46 1.65 0.1168

X Line Fit Plot

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20

X

Y

YPredicted Y

5Reg12M

Identifying Residuals

• Outlying residuals should be examined. They may stand out when they are in the center of the X distribution. When the residual in the plot is eliminated, we get the following results:

Regression StatisticsMultiple R 0.804R Square 0.646Adjusted R Square 0.602Observations 19

ANOVA

df SS MS FRegression 2 82.76 41.38 14.60Residual 16 45.34 2.83Total 18 128.11

Coefficients Standard Error t Stat P-valueIntercept 11.15 1.71 6.51 0.0000X -0.34 0.16 -2.15 0.0468W 3.65 0.78 4.70 0.0002

An important question is whether it is proper to delete an outlying point.

6Reg12M

Sometimes Data Errors Don’t Stand Out

• If an erroneous point in the Y data is associated with an extreme X value, it may not show as a residual.» The OLS fit will be influenced by the

bad point

• We can define an extreme point in the X space as Leverage

• An observation will have leverage if it has a relatively large value of hi, where this is the diagonal element of the matrix

X(X’X)-1X’

7Reg12M

Leverage: X(X’X)-1X’

• X(X’X)-1X’ is an n by n matrix that transforms Y into fitted values, Y» Y = XB

= X[(X’X)-1X’Y] = [X(X’X)-1X’]Y

• It is a square, symmetric matrix with a special property: It’s square is itself!» [X(X’X)-1X’] [X(X’X)-1X’]

=[X(X’X)-1X’]

• This so-called hat matrix can be thought of as a camera that projects an image of data Y on the regression plane.

^

^

^

8Reg12M

Standardized Residuals

• Assume that fitted values of Y have been obtained,» Y= XB

• Residuals are calculated» E = Y - Y

• SPSS computes the standardized residual as the ratio of the unstandardized residual to the square root of the MSE.» The square root of MSE is SE

» The SPSS standardized residual is Ei/[SE]

» Values greater than 3 are suspect

^ ^

^ ^

9Reg12M

Variance of Regression Estimates

• Let Y be an nx1 vector and X be an nxq matrix of predictors: Y = XB + e

YXXXB ''ˆ 1 21')ˆ( eXXBV

YXXXXY ''ˆ 1 21 ''ˆ

eXXXXYV

YXXXXIe )''(ˆ 1

21 ''ˆ eXXXXIeV

2)|( eIXYV

10Reg12M

Studentized Residuals

• The precision of the estimate of the residual varies with the leverage of the observation

• Instead of comparing a residual to its distribution, compare it to its own standard error» Called Studentized residual» Cohen et al. call this Internally

studentized residual

• The SPSS studentized residual is Ei/[SE(1-hi)1/2 ]

11Reg12M

Externally Studentized Residuals

• If the ith residual reveals a mistake, then SE on (n-k-1) df will be too big» Externalized Studentized

residuals adjust for this» The regression is re-estimated

dropping the ith observation, and a new SE(i) is computed on (n-k-2) df.

» The discrepancy of the point is computed by comparing Yi to the fitted Y from the model that excludes to point in question.

» The externally studentized residual is ei/[SE(i)(1-hi)1/2 ]

12Reg12M

Example Sorted by Discrepancy Measures

X W Y RES_1 DRE_1 ZRE_1 SRE_1 SDR_111 0 5 -3.64 -4.07 -1.12 -1.18 -1.20

7 1 10 -2.53 -3.04 -0.78 -0.85 -0.856 0 8 -2.49 -3.36 -0.77 -0.89 -0.88

11 1 9 -2.06 -2.32 -0.63 -0.67 -0.6611 0 7 -1.64 -1.83 -0.51 -0.53 -0.5211 0 7 -1.64 -1.83 -0.51 -0.53 -0.5215 0 6 -1.17 -1.64 -0.36 -0.43 -0.42

8 1 11 -1.16 -1.34 -0.36 -0.38 -0.3711 1 10 -1.06 -1.19 -0.33 -0.35 -0.3410 0 8 -1.01 -1.12 -0.31 -0.33 -0.3212 1 10 -0.69 -0.80 -0.21 -0.23 -0.2211 0 8 -0.64 -0.72 -0.20 -0.21 -0.20

9 0 9 -0.38 -0.43 -0.12 -0.12 -0.129 1 12 0.20 0.23 0.06 0.07 0.069 1 13 1.20 1.35 0.37 0.39 0.387 1 14 1.47 1.76 0.45 0.49 0.489 0 11 1.62 1.83 0.50 0.53 0.52

16 1 11 1.79 3.15 0.55 0.73 0.728 1 15 2.84 3.25 0.87 0.93 0.93

10 0 20 10.99 12.22 3.38 3.56 6.88

• Both the studentizing and deleting operations tend to increase the size of the standardized residuals

13Reg12M

Discrepancies and Quality Control of Data

• When carrying out data analysis, it is important to make sure the data are clean.» Initially we look at distributions,

scatterplots.» Out-of-range values especially

are important

• Discrepancy analysis is a second order cleaning step» Some discrepancies may be

clear errors» Other discrepancies may reveal

special populations or circumstances