6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

61
6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression

Transcript of 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

Page 1: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.1 Estimation of Parameters in Multiple Regression

Page 2: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.1 Estimation of Parameters in Multiple Regression

• The least squares function is given by

• The least squares estimates must satisfy

Page 3: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.1 Estimation of Parameters in Multiple Regression• The least squares normal equations are

• The solution to the normal equations are the least squares estimators of the regression coefficients.

Page 4: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

(𝑋 ′ 𝑋 )=[𝑛 ∑

𝑖=1

𝑛

𝑋 1 𝑖

∑𝑖=1

𝑛

𝑋 1𝑖 ∑𝑖=1

𝑛

𝑋 1 𝑖2

⋯∑𝑖=1

𝑛

𝑋𝑝𝑖 ∑𝑖=1

𝑛

𝑌 𝑖

∑𝑖=1

𝑛

𝑋1 𝑖 𝑋𝑝𝑖 ∑𝑖=1

𝑛

𝑋 1𝑖𝑌 𝑖

⋮ ⋱ ⋮

∑𝑖=1

𝑛

𝑋𝑝𝑖 ∑𝑖=1

𝑛

𝑋 1 𝑖𝑋𝑝𝑖

∑𝑖=1

𝑛

𝑌 𝑖 ∑𝑖=1

𝑛

𝑋 1 𝑖𝑌 𝑖

⋯∑𝑖=1

𝑛

𝑋𝑝𝑖2 ∑𝑖=1

𝑛

𝑋𝑝𝑖𝑌 𝑖

∑𝑖=1

𝑛

𝑋𝑝𝑖𝑌 𝑖 ∑𝑖=1

𝑛

𝑌 𝑖2]

X’X in Multiple Regression

(𝑋 ′ 𝑋 )−1𝜎𝜖

2=[ 𝑉 𝐴𝑅( �̂�0) 𝐶𝑂𝑉 ( �̂�0 , �̂�1)

𝐶𝑂𝑉 ( �̂�0 , �̂�1) 𝑉 𝐴𝑅( �̂�1)⋯

𝐶𝑂𝑉 ( �̂�0 , �̂�𝑝)

𝐶𝑂𝑉 ( �̂�1 , �̂�𝑝)⋮ ⋱ ⋮

𝐶𝑂𝑉 ( �̂�0 , �̂�𝑝) 𝐶𝑂𝑉 ( �̂�1 , �̂�𝑝) ⋯ 𝑉 𝐴𝑅( �̂�𝑝 )]

Page 5: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 6: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 7: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 8: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 9: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 10: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.1 Estimation of Parameters in Multiple Regression

Page 11: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Adjusted R2

We can adjust the R2 to take into account the number of regressors in the model:

(i) The ADJ RSQ does not always increase, like R2, as k increases. ADJ RSQ is especially preferred to R2 if k/n is a large fraction (greater than 10%). If k/n is small, then both measures are almost identical.

(ii) Always: ADJ RSQ

(iii) R2 = 1 SSE/SS(TOTAL) ADJ RSQ = 1 – MSE/MS(TOTAL) where MS(TOTAL)=SS(TOTAL)/(n1) = sample variance of y.

Page 12: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 13: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.2 Inferences in Multiple Regression

Test for Significance of Regression

Page 14: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.2 Inferences in Multiple RegressionInference on Individual Regression Coefficients

• This is called a partial or marginal test

Page 15: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.2 Inferences in Multiple RegressionConfidence Intervals on the Mean Response and Prediction Intervals

+

Page 16: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionConfidence Intervals on the Mean Response and Prediction Intervals

𝑌 0=𝛽0+𝛽1𝑥10+𝛽2 𝑥20+⋯+𝛽𝑘𝑥𝑘0+𝜖

+

√�̂� 2+ [𝑠𝑒 ( �̂�𝑌∨𝑥10 , 𝑥 20 ,⋯ , 𝑥𝑘0 ) ]2

The response at the point of interest is

and the corresponding predicted value is

The prediction error is , and the standard deviation of this prediction error is

Page 17: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.2 Inferences in Multiple RegressionConfidence Intervals on the Mean Response and Prediction Intervals

Page 18: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyResidual Analysis

Page 19: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyResidual Analysis

Page 20: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyResidual Analysis

Page 21: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyResidual Analysis

Page 22: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyResidual Analysis

0<h𝑖𝑗≤1

Because the ’s are always between zero and unity, a studentized residual is always larger than the corresponding standardized residual. Consequently, studentized residuals are a more sensitive diagnostic when looking for outliers.

Page 23: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model AdequacyInfluential Observations

• The disposition of points in the x-space is important in determining the properties of the model in R2, the regression coefficients, and the magnitude of the error mean squares.

• hii -- a measure of distance of the point (xi1, xi2, …, xik) from the average of all of the points in the data set

• A rule of thumb is that hii’s greater than 2p/n should be investigated (leverage points)

• A large value of Di implies that the ith points is influential.

• A value of Di>1 would indicate that the point is influential.

Page 24: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

6-3.3 Checking Model Adequacy

Page 25: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

OPTIONS NOOVP NODATE NONUMBER LS=140;DATA EX67;INPUT STRENGTH LENGTH HEIGHT @@;LABEL STRENGTH='PULL STRENGTH' LENGTH='WIRE LENGTH' HEIGHT='DIE HEIGHT';CARDS;9.95 2 50 24.45 8 110 31.75 11 120 35 10 55025.02 8 295 16.86 4 200 14.38 2 375 9.6 2 5224.35 9 100 27.5 8 300 17.08 4 412 37 11 40041.95 12 500 11.66 2 360 21.65 4 205 17.89 4 40069 20 600 10.3 1 585 34.93 10 540 46.59 15 25044.88 15 290 54.12 16 510 56.63 17 590 22.13 6 10021.15 5 400PROC SGSCATTER DATA=EX67; MATRIX STRENGTH LENGTH HEIGHT;TITLE 'SCATTER PLOT MATRIX FOR WIRE BOND DATA';PROC REG DATA=EX67; MODEL STRENGTH=LENGTH HEIGHT/XPX R CLB CLM CLI INFLUENCE; TITLE 'MULTIPLE REGRESSION';

ODS GRAPHICS ON;PROC REG DATA=EX67 PLOTS(LABEL)=(COOKSD RSTUDENTBYLEVERAGE DFFITS DFBETAS); MODEL STRENGTH=LENGTH HEIGHT;RUN;ODS GRAPHICS OFF;

DATA EX67N; INPUT LENGTH HEIGHT @@;DATALINES;11 35 5 20DATA EX67N1; SET EX67 EX67N;PROC REG DATA=EX67N1; MODEL STRENGTH=LENGTH HEIGHT/CLM CLI;TITLE 'CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION';RUN; QUIT;

Example 6-7

6-3 Multiple Regression

The DFBETAS statistics are the scaled measures of the change in each parame-ter estimate and are calculated by delet-ing the i th observation. In general, large values of DFBETAS in-dicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch (1980) recommend 2 as a general cutoff value to indicate influ-ential observations and 2/  as a size-ad-justed cutoff.

The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space. Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch (1980) is 2.

Page 26: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 27: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

The REG Procedure Model: MODEL1

Model Crossproducts X'X X'Y Y'Y

Variable Label Intercept length height strength Intercept Intercept 25 206 8294 725.82 length Wire length 206 2396 77177 8008.47 height Die Height 8294 77177 3531848 274816.71 strength 725.82 8008.47 274816.71 27178.5316 ------------------------------------------------------------------------------------------------------------------------------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: strength

Number of Observations Read 25 Number of Observations Used 25

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 2 5990.77122 2995.38561 572.17 <.0001 Error 22 115.17348 5.23516 Corrected Total 24 6105.94470

Root MSE 2.28805 R-Square 0.9811 Dependent Mean 29.03280 Adj R-Sq 0.9794 Coeff Var 7.88090

Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept Intercept 1 2.26379 1.06007 2.14 0.0441 0.06535 4.46223 length Wire length 1 2.74427 0.09352 29.34 <.0001 2.55031 2.93823 height Die Height 1 0.01253 0.00280 4.48 0.0002 0.00672 0.01833

6-3 Multiple Regression

Page 28: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

Multiple Regression

The REG Procedure Model: MODEL1 Dependent Variable: strength

Output Statistics

Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual Residual Residual -2-1 0 1 2 D

1 9.9500 8.3787 0.9074 6.4968 10.2606 3.2740 13.4834 1.5713 2.100 0.748 | |* | 0.035 2 24.4500 25.5960 0.7645 24.0105 27.1815 20.5930 30.5990 -1.1460 2.157 -0.531 | *| | 0.012 3 31.7500 33.9541 0.8620 32.1665 35.7417 28.8834 39.0248 -2.2041 2.119 -1.040 | **| | 0.060 4 35.0000 36.5968 0.7303 35.0821 38.1114 31.6158 41.5778 -1.5968 2.168 -0.736 | *| | 0.021 5 25.0200 27.9137 0.4677 26.9437 28.8836 23.0704 32.7569 -2.8937 2.240 -1.292 | **| | 0.024 6 16.8600 15.7464 0.6261 14.4481 17.0448 10.8269 20.6660 1.1136 2.201 0.506 | |* | 0.007 7 14.3800 12.4503 0.7862 10.8198 14.0807 7.4328 17.4677 1.9297 2.149 0.898 | |* | 0.036 8 9.6000 8.4038 0.9039 6.5291 10.2784 3.3018 13.5058 1.1962 2.102 0.569 | |* | 0.020 9 24.3500 28.2150 0.8185 26.5175 29.9125 23.1754 33.2546 -3.8650 2.137 -1.809 | ***| | 0.160 10 27.5000 27.9763 0.4651 27.0118 28.9408 23.1341 32.8184 -0.4763 2.240 -0.213 | | | 0.001 11 17.0800 18.4023 0.6960 16.9588 19.8458 13.4425 23.3621 -1.3223 2.180 -0.607 | *| | 0.013 12 37.0000 37.4619 0.5246 36.3739 38.5498 32.5936 42.3301 -0.4619 2.227 -0.207 | | | 0.001 13 41.9500 41.4589 0.6553 40.0999 42.8179 36.5230 46.3948 0.4911 2.192 0.224 | | | 0.001 14 11.6600 12.2623 0.7689 10.6678 13.8568 7.2565 17.2682 -0.6023 2.155 -0.280 | | | 0.003 15 21.6500 15.8091 0.6213 14.5206 17.0976 10.8921 20.7260 5.8409 2.202 2.652 | |***** | 0.187 16 17.8900 18.2520 0.6785 16.8448 19.6592 13.3026 23.2014 -0.3620 2.185 -0.166 | | | 0.001 17 69.0000 64.6659 1.1652 62.2494 67.0824 59.3409 69.9909 4.3341 1.969 2.201 | |**** | 0.565 18 10.3000 12.3368 1.2383 9.7689 14.9048 6.9414 17.7323 -2.0368 1.924 -1.059 | **| | 0.155 19 34.9300 36.4715 0.7096 34.9999 37.9431 31.5034 41.4396 -1.5415 2.175 -0.709 | *| | 0.018 20 46.5900 46.5598 0.8780 44.7389 48.3807 41.4773 51.6423 0.0302 2.113 0.0143 | | | 0.000 21 44.8800 47.0609 0.8238 45.3524 48.7694 42.0176 52.1042 -2.1809 2.135 -1.022 | **| | 0.052 22 54.1200 52.5613 0.8432 50.8127 54.3099 47.5042 57.6183 1.5587 2.127 0.733 | |* | 0.028 23 56.6300 56.3078 0.9771 54.2814 58.3342 51.1481 61.4675 0.3222 2.069 0.156 | | | 0.002 24 22.1300 19.9822 0.7557 18.4149 21.5494 14.9850 24.9794 2.1478 2.160 0.995 | |* | 0.040 25 21.1500 20.9963 0.6176 19.7153 22.2772 16.0813 25.9112 0.1537 2.203 0.0698 | | | 0.000

6-3 Multiple Regression

ii

ii

hs

er

1

Page 29: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

Multiple Regression The REG Procedure Model: MODEL1 Dependent Variable: strength

Output Statistics Hat Diag Cov -------------DFBETAS------------- Obs RStudent H Ratio DFFITS Intercept length height 1 0.7404 0.1573 1.2629 0.3199 0.3179 -0.1005 -0.2001 2 -0.5226 0.1116 1.2451 -0.1853 -0.1403 -0.0515 0.1483 3 -1.0419 0.1419 1.1519 -0.4237 -0.2219 -0.2371 0.3393 4 -0.7285 0.1019 1.1879 -0.2454 0.0788 0.0223 -0.1843 5 -1.3131 0.0418 0.9470 -0.2742 -0.1572 -0.0097 0.0553 6 0.4973 0.0749 1.1999 0.1415 0.1301 -0.0581 -0.0494 7 0.8940 0.1181 1.1655 0.3271 0.1480 -0.2618 0.1422 8 0.5602 0.1561 1.3031 0.2409 0.2395 -0.0766 -0.1498 9 -1.9155 0.1280 0.8133 -0.7338 -0.5012 -0.2837 0.6056 10 -0.2079 0.0413 1.1919 -0.0432 -0.0241 -0.0010 0.0075 11 -0.5978 0.0925 1.2045 -0.1909 -0.0602 0.1321 -0.1027 12 -0.2028 0.0526 1.2065 -0.0478 0.0018 -0.0169 -0.0085 13 0.2191 0.0820 1.2440 0.0655 -0.0224 0.0173 0.0338 14 -0.2736 0.1129 1.2824 -0.0976 -0.0484 0.0779 -0.0381 15 3.1422 0.0737 0.3906 0.8866 0.8097 -0.3743 -0.2920 16 -0.1620 0.0879 1.2559 -0.0503 -0.0178 0.0347 -0.0253 17 2.4352 0.2593 0.7361 1.4410 -0.8514 1.0089 0.4136 18 -1.0617 0.2929 1.3899 -0.6833 -0.0219 0.5216 -0.5324 19 -0.7004 0.0962 1.1870 -0.2285 0.0701 0.0180 -0.1676 20 0.0140 0.1473 1.3483 0.0058 0.0006 0.0048 -0.0031 21 -1.0228 0.1296 1.1418 -0.3947 -0.0085 -0.3241 0.1706 22 0.7249 0.1358 1.2354 0.2873 -0.1326 0.1830 0.0764 23 0.1522 0.1824 1.4016 0.0719 -0.0413 0.0402 0.0304 24 0.9943 0.1091 1.1242 0.3479 0.3085 0.0165 -0.2621 25 0.0682 0.0729 1.2393 0.0191 0.0063 -0.0116 0.0095

Sum of Residuals 0 Sum of Squared Residuals 115.17348 Predicted Residual SS (PRESS) 156.16295

6-3 Multiple Regression

iii

ii

hs

er

1)(

*

The DFBETAS statistics are the scaled measures of the change in each parame-ter estimate and are calculated by delet-ing the i th observation. In general, large values of DFBETAS in-dicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch (1980) recommend 2 as a general cutoff value to indicate influ-ential observations and 2/  as a size-ad-justed cutoff.

The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space. Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch (1980) is 2.

Page 30: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

Multiple Regression

The REG Procedure Model: MODEL1 Dependent Variable: strength Pull Strength

Number of Observations Read 25 Number of Observations Used 25

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F

Model 2 5990.77122 2995.38561 572.17 <.0001 Error 22 115.17348 5.23516 Corrected Total 24 6105.94470

Root MSE 2.28805 R-Square 0.9811 Dependent Mean 29.03280 Adj R-Sq 0.9794 Coeff Var 7.88090

Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 2.26379 1.06007 2.14 0.0441 length Wire length 1 2.74427 0.09352 29.34 <.0001 height Die Height 1 0.01253 0.00280 4.48 0.0002

6-3 Multiple Regression

Page 31: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 32: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 33: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 34: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 35: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 36: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 37: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION

The REG Procedure Model: MODEL1 Dependent Variable: strength Pull Strength

Number of Observations Read 27 Number of Observations Used 25 Number of Observations with Missing Values 2

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F

Model 2 5990.77122 2995.38561 572.17 <.0001 Error 22 115.17348 5.23516 Corrected Total 24 6105.94470

Root MSE 2.28805 R-Square 0.9811 Dependent Mean 29.03280 Adj R-Sq 0.9794 Coeff Var 7.88090

Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 2.26379 1.06007 2.14 0.0441 length Wire length 1 2.74427 0.09352 29.34 <.0001 height Die Height 1 0.01253 0.00280 4.48 0.0002

6-3 Multiple Regression

Page 38: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

The REG Procedure Model: MODEL1 Dependent Variable: strength Pull Strength Output Statistics

Dependent Predicted Std Error Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual 1 9.9500 8.3787 0.9074 6.4968 10.2606 3.2740 13.4834 1.5713 2 24.4500 25.5960 0.7645 24.0105 27.1815 20.5930 30.5990 -1.1460 3 31.7500 33.9541 0.8620 32.1665 35.7417 28.8834 39.0248 -2.2041 4 35.0000 36.5968 0.7303 35.0821 38.1114 31.6158 41.5778 -1.5968 5 25.0200 27.9137 0.4677 26.9437 28.8836 23.0704 32.7569 -2.8937 6 16.8600 15.7464 0.6261 14.4481 17.0448 10.8269 20.6660 1.1136 7 14.3800 12.4503 0.7862 10.8198 14.0807 7.4328 17.4677 1.9297 8 9.6000 8.4038 0.9039 6.5291 10.2784 3.3018 13.5058 1.1962 9 24.3500 28.2150 0.8185 26.5175 29.9125 23.1754 33.2546 -3.8650 10 27.5000 27.9763 0.4651 27.0118 28.9408 23.1341 32.8184 -0.4763 11 17.0800 18.4023 0.6960 16.9588 19.8458 13.4425 23.3621 -1.3223 12 37.0000 37.4619 0.5246 36.3739 38.5498 32.5936 42.3301 -0.4619 13 41.9500 41.4589 0.6553 40.0999 42.8179 36.5230 46.3948 0.4911 14 11.6600 12.2623 0.7689 10.6678 13.8568 7.2565 17.2682 -0.6023 15 21.6500 15.8091 0.6213 14.5206 17.0976 10.8921 20.7260 5.8409 16 17.8900 18.2520 0.6785 16.8448 19.6592 13.3026 23.2014 -0.3620 17 69.0000 64.6659 1.1652 62.2494 67.0824 59.3409 69.9909 4.3341 18 10.3000 12.3368 1.2383 9.7689 14.9048 6.9414 17.7323 -2.0368 19 34.9300 36.4715 0.7096 34.9999 37.9431 31.5034 41.4396 -1.5415 20 46.5900 46.5598 0.8780 44.7389 48.3807 41.4773 51.6423 0.0302 21 44.8800 47.0609 0.8238 45.3524 48.7694 42.0176 52.1042 -2.1809 22 54.1200 52.5613 0.8432 50.8127 54.3099 47.5042 57.6183 1.5587 23 56.6300 56.3078 0.9771 54.2814 58.3342 51.1481 61.4675 0.3222 24 22.1300 19.9822 0.7557 18.4149 21.5494 14.9850 24.9794 2.1478 25 21.1500 20.9963 0.6176 19.7153 22.2772 16.0813 25.9112 0.1537 26 . 32.8892 1.0620 30.6867 35.0918 27.6579 38.1206 . 27 . 16.2357 0.9286 14.3099 18.1615 11.1147 21.3567 .

Sum of Residuals 0 Sum of Squared Residuals 115.17348 Predicted Residual SS (PRESS) 156.16295

6-3 Multiple Regression

Page 39: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.3 Checking Model Adequacy

Multicollinearity ( 다중공선성 )

Multicollinearity is a catch-all phase referring to problems caused by the independent variables being correlated with each other. This can cause a number of problems

1) Individual t-tests can be non-significant for important variables. The sign of a can be flopped. Recall, the partial slopes measure the change in Y for a unit change in the holding the other X’s constant. If two X’s are highly correlated, this interpretation doesn’t do much good.

2) The MSE can be inflated. Also the SE’s of the partial slopes are inflated.

3) Removing one X from the model may make another more significant or less significant.

Page 40: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.3 Checking Model Adequacy

Variance Inflation Factor

The Quantity called the variance inflation factor is denoted as VIF(Xj). The larger the value of VIF(Xj), the more the multicollinearity and the larger the standard error of the due to having Xj in the model. A common rule of thumb is that if VIF(Xj)>5 then multicollinearity is high. Also 10 has been proposed as a cut off value.

Mallow’s CP

Another measure of the amount of multicollinearity is Mallow’s CP.

Assume we have a total of r variables. Suppose we fit a model with only p of the r variables. Let SSEP be the sums of squares error from the p variable model and MSE the mean square error from the model with all r variables. Then

We want CP to be near p+1 for a good model.

Page 41: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression6-3.3 Checking Model AdequacyMulticollinearity

Page 42: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionA Test for the Significance of a Group of Regressors (Partial F-Test)

𝑌=𝛽0+𝛽1 𝑥1+𝛽2𝑥2+⋯+𝛽𝑟 𝑥𝑟+𝛽𝑟+1𝑥𝑟+1+⋯+𝛽𝑘𝑥𝑘+𝜖

𝑌=𝛽0+𝛽1 𝑥1+𝛽2𝑥2+⋯+𝛽𝑟 𝑥𝑟+𝜖

Suppose that the full model has k regressors, and we are interested in testing whether the last k-r of them can be deleted from the model. This smaller model is called the reduced model. That is, the full model is

and the reduced model has ==0, so the reduced model is

Then, to test the hypotheses

Page 43: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionA Test for the Significance of a Group of Regressors (Partial F-Test)

where:SSER = SSE for Reduced ModelSSEF = SSE for Full Model = number of ’s in H0

For given , we reject H0 if: Partial F>tabled F

with dof = , numerator , denominator

𝐹=(𝑆𝑆𝐸𝑅−𝑆𝑆𝐸𝐹 ) /(𝑘−𝑟 )

𝑀𝑆𝐸 𝐹

Page 44: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

OPTIONS NOOVP NODATE NONUMBER LS=100;DATA appraise;INPUT price units age size parking area cond$ @@;CARDS;90300 4 82 4635 0 4266 F 384000 20 13 17798 0 14391 G157500 5 66 5913 0 6615 G 676200 26 64 7750 6 34144 E165000 5 55 5150 0 6120 G 300000 10 65 12506 0 14552 G108750 4 82 7160 0 3040 G 276538 11 23 5120 0 7881 G420000 20 18 11745 20 12600 G 950000 62 71 21000 3 39448 G560000 26 74 11221 0 30000 G 268000 13 56 7818 13 8088 F290000 9 76 4900 0 11315 E 173200 6 21 5424 6 4461 G323650 11 24 11834 8 9000 G 162500 5 19 5246 5 3828 G353500 20 62 11223 2 13680 F 134400 4 70 5834 0 4680 E187000 8 19 9075 0 7392 G 93600 4 82 6864 0 3840 F110000 4 50 4510 0 3092 G 573200 14 10 11192 0 23704 E79300 4 82 7425 0 3876 F 272000 5 82 7500 0 9542 EPROC CORR DATA=APPRAISE; VAR PRICE UNITS AGE SIZE PARKING AREA; TITLE 'CORRELATIONS OF VARIABLES IN MODEL';ODS GRAPHICS ON;PROC REG DATA=APPRAISE; MODEL PRICE=UNITS AGE SIZE PARKING AREA/R VIF;TITLE 'ALL VARIABLES IN MODEL';PROC REG DATA=APPRAISE; MODEL PRICE=UNITS AGE AREA/R INFLUENCE;TITLE 'REDUCED MODEL';RUN; QUIT;

Example

6-3 Multiple Regression

Page 45: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

CORRELATIONS OF VARIABLES IN MODEL

CORR 프로시저 6 Variables: price units age size parking area

단순 통계량

변수 N 평균 표준편차 합 최소값 최대값 price 24 296193 214164 7108638 79300 950000 units 24 12.50000 12.73475 300.00000 4.00000 62.00000 age 24 52.75000 26.43655 1266 10.00000 82.00000 size 24 8702 4221 208843 4510 21000 parking 24 2.62500 5.01140 63.00000 0 20.00000 area 24 11648 10170 279555 3040 39448

피어슨 상관 계수 , N = 24 H0: Rho=0 가정하에서 Prob > |r|

price units age size parking area

price 1.00000 0.92207 -0.11118 0.73582 0.21385 0.96784 <.0001 0.6050 <.0001 0.3157 <.0001

units 0.92207 1.00000 -0.00982 0.79583 0.21290 0.87622 <.0001 0.9637 <.0001 0.3179 <.0001

age -0.11118 -0.00982 1.00000 -0.18563 -0.36141 0.03090 0.6050 0.9637 0.3852 0.0827 0.8860

size 0.73582 0.79583 -0.18563 1.00000 0.15151 0.66741 <.0001 <.0001 0.3852 0.4797 0.0004

parking 0.21385 0.21290 -0.36141 0.15151 1.00000 0.07830 0.3157 0.3179 0.0827 0.4797 0.7161

area 0.96784 0.87622 0.03090 0.66741 0.07830 1.00000 <.0001 <.0001 0.8860 0.0004 0.7161

6-3 Multiple Regression

Page 46: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

ALL VARIABLES IN MODEL

The REG Procedure Model: MODEL1 Dependent Variable: price

Number of Observations Read 24 Number of Observations Used 24

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F

Model 5 1.033962E12 2.067924E11 177.60 <.0001 Error 18 20959224743 1164401375 Corrected Total 23 1.054921E12

Root MSE 34123 R-Square 0.9801 Dependent Mean 296193 Adj R-Sq 0.9746 Coeff Var 11.52063

Parameter Estimates

Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation

Intercept 1 93629 29874 3.13 0.0057 0 units 1 4156.17223 1532.28739 2.71 0.0143 7.52119 age 1 -856.06670 306.65871 -2.79 0.0121 1.29821 size 1 0.88901 2.96966 0.30 0.7681 3.10362 parking 1 2675.62291 1626.23661 1.65 0.1173 1.31193 area 1 15.53982 1.50259 10.34 <.0001 4.61289

6-3 Multiple Regression

Page 47: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

ALL VARIABLES IN MODEL

The REG Procedure Model: MODEL1 Dependent Variable: price

Output Statistics Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual -2-1 0 1 2 D

1 90300 110470 12281 -20170 31837 -0.634 | *| | 0.010 2 384000 405080 23185 -21080 25037 -0.842 | *| | 0.101 3 157500 165962 9178 -8462 32866 -0.257 | | | 0.001 4 676200 700437 25152 -24237 23061 -1.051 | **| | 0.219 5 165000 167009 10095 -2009 32596 -0.0616 | | | 0.000 6 300000 316800 17858 -16800 29077 -0.578 | *| | 0.021 7 108750 93663 13018 15087 31543 0.478 | | | 0.006 8 276538 246679 19376 29859 28088 1.063 | |** | 0.090 9 420000 421099 25938 -1099 22173 -0.0496 | | | 0.001 10 950000 930242 31527 19758 13057 1.513 | |*** | 2.225 11 560000 614511 17207 -54511 29467 -1.850 | ***| | 0.194 12 268000 267139 18075 860.6816 28943 0.0297 | | | 0.000 13 290000 246163 11851 43837 31999 1.370 | |** | 0.043 14 173200 190788 14200 -17588 31028 -0.567 | *| | 0.011 15 323650 290586 14788 33064 30752 1.075 | |** | 0.045 16 162500 175673 14612 -13173 30836 -0.427 | | | 0.007 17 353500 351590 10164 1910 32575 0.0586 | | | 0.000 18 134400 128242 9951 6158 32640 0.189 | | | 0.001 19 187000 233552 13949 -46552 31142 -1.495 | **| | 0.075 20 93600 105832 12433 -12232 31778 -0.385 | | | 0.004 21 110000 119509 12404 -9509 31789 -0.299 | | | 0.002 22 573200 521561 22525 51639 25632 2.015 | |**** | 0.522 23 79300 106890 12957 -27590 31567 -0.874 | *| | 0.021 24 272000 199161 13080 72839 31517 2.311 | |**** | 0.153

Sum of Residuals 0 Sum of Squared Residuals 20959224743 Predicted Residual SS (PRESS) 56380131094

6-3 Multiple Regression

Page 48: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 49: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 50: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Consider the Full Model

vs:

98.01% of the variability in the Y’s is explained by the relation to the X’s. The adjusted R2 is 0.9746 which is very close to the R2 value. This indicates no serious problems with the number of independent variables.

However, possible multicollinearity between units, area and size since they have large correlations. Age and parking have low correlations with price so may not be needed.

Page 51: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

We have some evidence of multicollinearity, thus we must consider dropping some of the variables. Let’s look at the individual tests of

vs: , i=1, 2,

These tests are summarized in the SAS output of PROC REG. Size is very non-significant (p-value=0.7681) and parking is also not significant (p-value=0.1173). There is evidence from the correlations that size is related to both units and area, so removing this variable might remove much of the multicollinearity. Parking just doesn’t seem to explain much variability in price.

Let’s look at a 95% confidence interval for .)

2675(2.101)*(1626.24)(741.1, 6092.4)

Page 52: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionTesting

The Full model is

The Reduced model is

From the SAS output we have

No evidence to reject the null hypothesis.

Page 53: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionInterpreting the ’s

For the apartment appraisal problem we have The ’s are

= 114,857.4 =5,012.6 (units) (age) =14.96 (area)

If one extra unit is added (all other factors held constant) the value of the complex will increase by $5,012.6. If the complex ages one more year it will lose $1,054.0 in value (all other factors held constant). If the area is increased by one square feet the value of the complex will increase by $14.96 (all other factors held constant).

Notice the potential for multicollinearity. If one more unit is added, the number of square feet would also increase. Thus the interdependency of some of the variables makes the ’s harder to interpret.

Page 54: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Notes on the Reduced Model

The MSE has increased in the reduced model (MSE=34,721) vs. the full model (MSE=34,123), but the standard error of the individual ’s have all decreased. This is another indication that there was multicollinearity in the full model. We will be able to do more accurate influence in this reduced model.

The R2 and adjusted R2 have been decreased by only a small amount. This justified dropping the two variables, also.

All the individual ’s are significantly different from zero (all p-values small). This indicates that we probably cannot remove further variables without losing some information about the Y’s.

Page 55: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionExamining the Final Model

Some final checks on the model are:

1) Residual 2) Studentized (standardized) residuals

The studentized residuals should be between -2 and 2 around 95% of the time. If an excessive number of greater than 2 in absolute value or if any one studentized residual is much greater than 2 you should investigate closer.

3) Hat diagonals are the main diagonal element of the matrix

We have already seen that is important. The diagonal elements as well as the eigenvalues of this matrix contain much information. Each diagonal corresponds to a particular observation. Look for values of the diagonal that are greater than 1.

Page 56: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple RegressionOne More Diagnostic

4) DFBETAS

This diagnostic investigates the influence of each observation on the value of the parameters. The parameters are first fit with all observations, call the parameter .

Next the parameters are estimated using all but the jth observation. Call these estimated . The DFBETAS for the ith parameter and jth observation is calculated as

You look for values of DFBETAS that are much larger than the other values. This indicates that the observation is too influential in determining the value of the parameters. A combined DFBETAS can also be calculated which looks at all the parameters at once.

Page 57: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

REDUCED MODEL (NO SIZE AND PARKING)

The REG Procedure Model: MODEL1 Dependent Variable: price

Number of Observations Read 24 Number of Observations Used 24

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F

Model 3 1.03081E12 3.436033E11 285.01 <.0001 Error 20 24111264632 1205563232 Corrected Total 23 1.054921E12

Root MSE 34721 R-Square 0.9771 Dependent Mean 296193 Adj R-Sq 0.9737 Coeff Var 11.72249

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 114857 17919 6.41 <.0001 units 1 5012.58292 1183.19286 4.24 0.0004 age 1 -1054.84586 274.79652 -3.84 0.0010 area 1 14.96564 1.48218 10.10 <.0001

6-3 Multiple Regression

Page 58: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

REDUCED MODEL

The REG Procedure Model: MODEL1 Dependent Variable: price

Output Statistics

Dependent Predicted Std Error Std Error Student Cook's Obs Variable Value Mean Predict Residual Residual Residual -2-1 0 1 2 D

1 90300 112254 12030 -21954 32570 -0.674 | *| | 0.015 2 384000 416767 13928 -32767 31805 -1.030 | **| | 0.051 3 157500 169298 9015 -11798 33530 -0.352 | | | 0.002 4 676200 688661 21982 -12461 26877 -0.464 | | | 0.036 5 165000 173494 8304 -8494 33714 -0.252 | | | 0.001 6 300000 314198 10357 -14198 33141 -0.428 | | | 0.004 7 108750 93906 12575 14844 32364 0.459 | | | 0.008 8 276538 263679 11347 12859 32815 0.392 | | | 0.005 9 420000 384689 13763 35311 31877 1.108 | |** | 0.057 10 950000 941108 31332 8892 14961 0.594 | |* | 0.387 11 560000 616095 17479 -56095 30001 -1.870 | ***| | 0.297 12 268000 241992 9249 26008 33467 0.777 | |* | 0.012 13 290000 249139 10066 40861 33230 1.230 | |** | 0.035 14 173200 189543 12261 -16343 32484 -0.503 | *| | 0.009 15 323650 279370 10773 44280 33008 1.341 | |** | 0.048 16 162500 177167 12803 -14667 32275 -0.454 | | | 0.008 17 353500 354439 9992 -938.6096 33252 -0.0282 | | | 0.000 18 134400 131108 9953 3292 33264 0.0990 | | | 0.000 19 187000 245542 11977 -58542 32590 -1.796 | ***| | 0.109 20 93600 105878 12192 -12278 32510 -0.378 | | | 0.005 21 110000 128439 9416 -18439 33420 -0.552 | *| | 0.006 22 573200 529231 22052 43969 26819 1.639 | |*** | 0.454 23 79300 106417 12177 -27117 32516 -0.834 | *| | 0.024 24 272000 196225 12163 75775 32521 2.330 | |**** | 0.190

6-3 Multiple Regression

Page 59: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

REDUCED MODEL The REG Procedure Output Statistics

Hat Diag Cov ------------------DFBETAS----------------- Obs RStudent H Ratio DFFITS Intercept units age area

1 -0.6646 0.1201 1.2727 -0.2455 0.0255 -0.0031 -0.1666 0.0567 2 -1.0319 0.1609 1.1764 -0.4519 -0.3370 -0.1451 0.3432 0.0915 3 -0.3440 0.0674 1.2842 -0.0925 -0.0163 0.0211 -0.0367 -0.0002 4 -0.4544 0.4008 1.9623 -0.3716 0.1032 0.2203 -0.0267 -0.3226 5 -0.2459 0.0572 1.2858 -0.0606 -0.0300 0.0120 -0.0045 0.0034 6 -0.4195 0.0890 1.2989 -0.1311 0.0062 0.0820 -0.0353 -0.0838 7 0.4494 0.1312 1.3546 0.1746 -0.0130 0.0243 0.1154 -0.0638 8 0.3834 0.1068 1.3328 0.1326 0.1207 0.0292 -0.0918 -0.0392 9 1.1144 0.1571 1.1307 0.4812 0.3411 0.2414 -0.3141 -0.1954 10 0.5845 0.8143 6.1574 1.2240 -0.4218 0.8913 0.2392 -0.4129 11 -2.0062 0.2534 0.7625 -1.1689 0.4828 0.4972 -0.3232 -0.8502 12 0.7692 0.0710 1.1690 0.2126 0.0686 0.1215 0.0315 -0.1349 13 1.2466 0.0840 0.9787 0.3776 -0.0751 -0.1207 0.2293 0.0981 14 -0.4935 0.1247 1.3330 -0.1863 -0.1807 -0.0149 0.1282 0.0485 15 1.3706 0.0963 0.9317 0.4473 0.4080 0.0441 -0.3203 -0.0715 16 -0.4452 0.1360 1.3632 -0.1766 -0.1731 -0.0080 0.1242 0.0421 17 -0.0275 0.0828 1.3384 -0.0083 0.0001 -0.0053 -0.0025 0.0041 18 0.0965 0.0822 1.3350 0.0289 0.0037 -0.0018 0.0140 -0.0055 19 -1.9118 0.1190 0.6894 -0.7026 -0.6733 0.0295 0.5377 0.0515 20 -0.3694 0.1233 1.3609 -0.1385 0.0130 -0.0080 -0.0934 0.0388 21 -0.5419 0.0735 1.2463 -0.1527 -0.0977 -0.0163 0.0079 0.0616 22 1.7175 0.4034 1.1553 1.4122 0.5731 -0.9475 -0.8374 1.1063 23 -0.8274 0.1230 1.2151 -0.3098 0.0292 -0.0168 -0.2089 0.0855 24 2.6607 0.1227 0.3943 0.9951 -0.2172 -0.4517 0.6229 0.3274

6-3 Multiple Regression

Sum of Residuals 0 Sum of Squared Residuals 24111264632 Predicted Residual SS (PRESS) 37937505741

The DFBETAS statistics are the scaled measures of the change in each parame-ter estimate and are calculated by delet-ing the i th observation. In general, large values of DFBETAS in-dicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch (1980) recommend 2 as a general cutoff value to indicate influ-ential observations and 2/  as a size-ad-justed cutoff.

The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space. Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch (1980) is 2.

Page 60: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression

Page 61: 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression.

6-3 Multiple Regression