1Reg12M
G89.2229 Multiple Regression Week 12 (Monday)
• Quality Control and Critical Evaluation of Regression Results
• An example
• Identifying Residuals
• Leverage: X(X’X)-1X’
• Residuals (Discrepancies)
2Reg12M
Quality Control and Critical Evaluation of Regression
Results• Multiple regression programs
have the potential to hide details of the data» Regression output provides
only summary information» When several variables are
considered, bivariate plots may not be immediately revealing
• Regression diagnostic indicators can help with quality control» Which subjects are not well fit:
Discrepancy or Residuals» Which subjects are affecting
results: Influence Points
3Reg12M
An example
• Suppose X were a measure of SES advantage, such as years of education, W is an indictor of social disadvantage, such as immigrant status, and Y is a number of stressful life events
• Nothing necessarily jumps out from these data as being amiss.
X W Y6 0 87 1 107 1 148 1 118 1 159 0 99 0 119 1 129 1 13
10 0 810 0 2011 0 511 0 711 0 711 0 811 1 911 1 1012 1 1015 0 616 1 11
4Reg12M
Example, Continued
• Here are the regression results
• Here is the plot of the fit
Regression StatisticsMultiple R 0.465R Square 0.216Adjusted R Square 0.124Observations 20
ANOVAdf SS MS F
Regression 2 49.56 24.78 2.34Residual 17 179.64 10.57Total 19 229.20
Coefficients Standard Error t Stat P-valueIntercept 12.70 3.28 3.87 0.0012X -0.37 0.30 -1.22 0.2387W 2.42 1.46 1.65 0.1168
X Line Fit Plot
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20
X
Y
YPredicted Y
5Reg12M
Identifying Residuals
• Outlying residuals should be examined. They may stand out when they are in the center of the X distribution. When the residual in the plot is eliminated, we get the following results:
Regression StatisticsMultiple R 0.804R Square 0.646Adjusted R Square 0.602Observations 19
ANOVA
df SS MS FRegression 2 82.76 41.38 14.60Residual 16 45.34 2.83Total 18 128.11
Coefficients Standard Error t Stat P-valueIntercept 11.15 1.71 6.51 0.0000X -0.34 0.16 -2.15 0.0468W 3.65 0.78 4.70 0.0002
An important question is whether it is proper to delete an outlying point.
6Reg12M
Sometimes Data Errors Don’t Stand Out
• If an erroneous point in the Y data is associated with an extreme X value, it may not show as a residual.» The OLS fit will be influenced by the
bad point
• We can define an extreme point in the X space as Leverage
• An observation will have leverage if it has a relatively large value of hi, where this is the diagonal element of the matrix
X(X’X)-1X’
7Reg12M
Leverage: X(X’X)-1X’
• X(X’X)-1X’ is an n by n matrix that transforms Y into fitted values, Y» Y = XB
= X[(X’X)-1X’Y] = [X(X’X)-1X’]Y
• It is a square, symmetric matrix with a special property: It’s square is itself!» [X(X’X)-1X’] [X(X’X)-1X’]
=[X(X’X)-1X’]
• This so-called hat matrix can be thought of as a camera that projects an image of data Y on the regression plane.
^
^
^
8Reg12M
Standardized Residuals
• Assume that fitted values of Y have been obtained,» Y= XB
• Residuals are calculated» E = Y - Y
• SPSS computes the standardized residual as the ratio of the unstandardized residual to the square root of the MSE.» The square root of MSE is SE
» The SPSS standardized residual is Ei/[SE]
» Values greater than 3 are suspect
^ ^
^ ^
9Reg12M
Variance of Regression Estimates
• Let Y be an nx1 vector and X be an nxq matrix of predictors: Y = XB + e
YXXXB ''ˆ 1 21')ˆ( eXXBV
YXXXXY ''ˆ 1 21 ''ˆ
eXXXXYV
YXXXXIe )''(ˆ 1
21 ''ˆ eXXXXIeV
2)|( eIXYV
10Reg12M
Studentized Residuals
• The precision of the estimate of the residual varies with the leverage of the observation
• Instead of comparing a residual to its distribution, compare it to its own standard error» Called Studentized residual» Cohen et al. call this Internally
studentized residual
• The SPSS studentized residual is Ei/[SE(1-hi)1/2 ]
11Reg12M
Externally Studentized Residuals
• If the ith residual reveals a mistake, then SE on (n-k-1) df will be too big» Externalized Studentized
residuals adjust for this» The regression is re-estimated
dropping the ith observation, and a new SE(i) is computed on (n-k-2) df.
» The discrepancy of the point is computed by comparing Yi to the fitted Y from the model that excludes to point in question.
» The externally studentized residual is ei/[SE(i)(1-hi)1/2 ]
12Reg12M
Example Sorted by Discrepancy Measures
X W Y RES_1 DRE_1 ZRE_1 SRE_1 SDR_111 0 5 -3.64 -4.07 -1.12 -1.18 -1.20
7 1 10 -2.53 -3.04 -0.78 -0.85 -0.856 0 8 -2.49 -3.36 -0.77 -0.89 -0.88
11 1 9 -2.06 -2.32 -0.63 -0.67 -0.6611 0 7 -1.64 -1.83 -0.51 -0.53 -0.5211 0 7 -1.64 -1.83 -0.51 -0.53 -0.5215 0 6 -1.17 -1.64 -0.36 -0.43 -0.42
8 1 11 -1.16 -1.34 -0.36 -0.38 -0.3711 1 10 -1.06 -1.19 -0.33 -0.35 -0.3410 0 8 -1.01 -1.12 -0.31 -0.33 -0.3212 1 10 -0.69 -0.80 -0.21 -0.23 -0.2211 0 8 -0.64 -0.72 -0.20 -0.21 -0.20
9 0 9 -0.38 -0.43 -0.12 -0.12 -0.129 1 12 0.20 0.23 0.06 0.07 0.069 1 13 1.20 1.35 0.37 0.39 0.387 1 14 1.47 1.76 0.45 0.49 0.489 0 11 1.62 1.83 0.50 0.53 0.52
16 1 11 1.79 3.15 0.55 0.73 0.728 1 15 2.84 3.25 0.87 0.93 0.93
10 0 20 10.99 12.22 3.38 3.56 6.88
• Both the studentizing and deleting operations tend to increase the size of the standardized residuals
13Reg12M
Discrepancies and Quality Control of Data
• When carrying out data analysis, it is important to make sure the data are clean.» Initially we look at distributions,
scatterplots.» Out-of-range values especially
are important
• Discrepancy analysis is a second order cleaning step» Some discrepancies may be
clear errors» Other discrepancies may reveal
special populations or circumstances
Top Related