Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)
-
Upload
sharlene-townsend -
Category
Documents
-
view
228 -
download
0
Transcript of Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)
Stat 112 Notes 6
• Today:– Chapter 4.1 (Introduction to Multiple
Regression)
Multiple Regression
• In multiple regression analysis, we consider more than one explanatory variable, X1,…,XK . We are interested in the conditional mean of Y given X1,…,XK , E(Y| X1,…,XK ).
• Two motivations for multiple regression:– We can obtain better predictions of Y by using
information on X1,…,XK rather than just X1.– We can control for lurking variables
Automobile Example• A team charged with designing a new automobile is
concerned about the gas mileage (gallons per 1000 miles on a highway) that can be achieved. The design team is interested in two things:
(1) Which characteristics of the design are likely to affect mileage?
(2) A new car is planned to have the following characteristics: weight – 4000 lbs, horsepower – 200, length – 200 inches, seating – 5 adults. Predict the new car’s gas mileage.
• The team has available information about gallons per 1000 miles and four design characteristics (weight, horsepower, length, seating) for a sample of cars made in 2004. Data is in car04.JMP.
Multivariate Correlations GP1000M_Hwy Weight(lb) Horsepower Length Seating GP1000M_Hwy 1.0000 0.8575 0.6120 0.3912 0.3993 Weight(lb) 0.8575 1.0000 0.6434 0.7023 0.5858 Horsepower 0.6120 0.6434 1.0000 0.4910 0.0642 Length 0.3912 0.7023 0.4910 1.0000 0.6010 Seating 0.3993 0.5858 0.0642 0.6010 1.0000 20 rows not used due to missing or excluded values or frequency or weight variables missing, negative or less than one. Scatterplot Matrix
30
40
50
60
2000
3000
4000
5000
100
200
300
140
160
180
200
2
4
6
GP1000M_Hw y
30 40 50 60
Weight(lb)
2000 4000
Horsepow er
100 200 300
Length
140 170 200
Seating
2 3 4 5 6 7
Best Single Predictor
• To obtain the correlation matrix and pairwise scatterplots, click Analyze, Multivariate Methods, Multivariate.
• If we use simple linear regression with each of the four explanatory variables, which provides the best predictions?
Best Single Predictor
• Answer: The simple linear regression that has the highest R2 gives the best predictions because recall that
• Weight gives the best predictions of GPM1000Hwy based on simple linear regression.
• But we can obtain better predictions by using more than one of the independent variables.
SST
SSER 12
Multiple Linear Regression Model
11 , , 0 1 1( | , , ) ( )KK X X K KE Y X X X X
0 1 1i i K iK iY X X e For each possible value of 1( , , )KX X , there is a subpopulation.
Assumptions of the multiple linear regression model: (1) Linearity: the means of the subpopulation are a linear function of 1( , , )KX X , i.e., 1 0 1 1( | , , )K K KE Y X X X X for some
0( , , )K .
(2) Constant variance: the subpopulation standard deviations are all equal (to e )
(3) Normality: The subpopulations are normally distributed. (4) Independence: The observations are independent.
Point Estimates for Multiple Linear Regression Model
• We use the same least squares procedure as for simple linear regression.
• Our estimates of are the coefficients that minimize the sum of squared prediction errors:
• Least Squares in JMP: Click Analyze, Fit Model, put dependent variable into Y and add independent variables to the construct model effects box.
K ,...,0
Kbb ,...,0
n
i iKKiibbK xbxbbybbK 1
2*1
*1
*0,...,0 )(minarg,..., **
0
KK xbxbby 110ˆ
Response GP1000M_Hwy Summary of Fit RSquare 0.834148 RSquare Adj 0.831091 Root Mean Square Error 3.082396 Mean of Response 39.75907 Observations (or Sum Wgts) 222 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 42.198338 3.300533 12.79 <.0001 Weight(lb) 0.0102748 0.00052 19.77 <.0001 Seating 0.2748828 0.254288 1.08 0.2809 Horsepower 0.0189373 0.00524 3.61 0.0004 Length -0.244818 0.02358 -10.38 <.0001
Root Mean Square Error
• Estimate of :
• = Root Mean Square Error in JMP • For simple linear regression of GP1000MHWY on
Weight, . For multiple linear regression of GP1000MHWY on weight, horsepower, cargo, seating,
• The multiple regression improves the predictions.
e
1
)ˆ( 2
1
Kn
yys i
n
i ie
es
3.08RMSE
3.87RMSE
Residuals and Root Mean Square Errors
•
• Residual for observation i = prediction error for observation i =
• Root mean square error = Typical size of absolute value of prediction error
• As with simple linear regression model, if multiple linear regression model holds– About 95% of the observations will be within two RMSEs of their
predicted value • For car data, about 95% of the time, the actual GP1000M will be
within 2*3.08=6.16 GP1000M of the predicted GP1000M of the car based on the car’s weight, horsepower, length and seating.
1 1 0 1 1ˆ ( | , , )K K K KE Y X x X x b b x b x
1 1
0 1 1
ˆ ( | , , )i i K iK
i i K iK
Y E Y X x X x
Y b b x b x
Residual ExampleBMW 745i Weight = 4376 Seating = 5 Horsepower = 325 Length = 198
1 4ˆ ( | , , ) 42.19 .01027*4376 .2479*5
.0189*325 .2448*198
46.22
E Y X X
Actual Y (GP1000M) for BMW745i = 38.46 Residual = 38.46-46.22 = -7.76 The BMW is more fuel efficient (lower GP1000M) than we would expect based on its weight, seating, horsepower and length. The residuals and predicted values can be saved by clicking the red triangle next to Response after Fit Model, then clicking Save Columns and clicking Predicted Values and Residuals.
Interpretation of Regression Coefficients
• Gas mileage regression from car04.JMP
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 42.198338 3.300533 12.79 <.0001 Weight(lb) 0.0102748 0.00052 19.77 <.0001 Seating 0.2748828 0.254288 1.08 0.2809 Horsepower 0.0189373 0.00524 3.61 0.0004 Length -0.244818 0.02358 -10.38 <.0001
Interpretation of coefficient 0.0103weightb : The mean of GP1000Mwy
is estimated to increase 0.0103 for a one pound increase in weight holding fixed seating, horsepower and length.
1 1 2 2 1 1
0 1 1 0 1 1 1
( | 1, , , ) ( | , , )
( ( 1) ) ( )K K K K
K K K K
E Y X x X x X x E Y X x X x
x x x x