Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

13
Stat 112 Notes 6 • Today: – Chapter 4.1 (Introduction to Multiple Regression)

Transcript of Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Page 1: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Stat 112 Notes 6

• Today:– Chapter 4.1 (Introduction to Multiple

Regression)

Page 2: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Multiple Regression

• In multiple regression analysis, we consider more than one explanatory variable, X1,…,XK . We are interested in the conditional mean of Y given X1,…,XK , E(Y| X1,…,XK ).

• Two motivations for multiple regression:– We can obtain better predictions of Y by using

information on X1,…,XK rather than just X1.– We can control for lurking variables

Page 3: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Automobile Example• A team charged with designing a new automobile is

concerned about the gas mileage (gallons per 1000 miles on a highway) that can be achieved. The design team is interested in two things:

(1) Which characteristics of the design are likely to affect mileage?

(2) A new car is planned to have the following characteristics: weight – 4000 lbs, horsepower – 200, length – 200 inches, seating – 5 adults. Predict the new car’s gas mileage.

• The team has available information about gallons per 1000 miles and four design characteristics (weight, horsepower, length, seating) for a sample of cars made in 2004. Data is in car04.JMP.

Page 4: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Multivariate Correlations GP1000M_Hwy Weight(lb) Horsepower Length Seating GP1000M_Hwy 1.0000 0.8575 0.6120 0.3912 0.3993 Weight(lb) 0.8575 1.0000 0.6434 0.7023 0.5858 Horsepower 0.6120 0.6434 1.0000 0.4910 0.0642 Length 0.3912 0.7023 0.4910 1.0000 0.6010 Seating 0.3993 0.5858 0.0642 0.6010 1.0000 20 rows not used due to missing or excluded values or frequency or weight variables missing, negative or less than one. Scatterplot Matrix

30

40

50

60

2000

3000

4000

5000

100

200

300

140

160

180

200

2

4

6

GP1000M_Hw y

30 40 50 60

Weight(lb)

2000 4000

Horsepow er

100 200 300

Length

140 170 200

Seating

2 3 4 5 6 7

Page 5: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Best Single Predictor

• To obtain the correlation matrix and pairwise scatterplots, click Analyze, Multivariate Methods, Multivariate.

• If we use simple linear regression with each of the four explanatory variables, which provides the best predictions?

Page 6: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Best Single Predictor

• Answer: The simple linear regression that has the highest R2 gives the best predictions because recall that

• Weight gives the best predictions of GPM1000Hwy based on simple linear regression.

• But we can obtain better predictions by using more than one of the independent variables.

SST

SSER 12

Page 7: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Multiple Linear Regression Model

11 , , 0 1 1( | , , ) ( )KK X X K KE Y X X X X

0 1 1i i K iK iY X X e For each possible value of 1( , , )KX X , there is a subpopulation.

Assumptions of the multiple linear regression model: (1) Linearity: the means of the subpopulation are a linear function of 1( , , )KX X , i.e., 1 0 1 1( | , , )K K KE Y X X X X for some

0( , , )K .

(2) Constant variance: the subpopulation standard deviations are all equal (to e )

(3) Normality: The subpopulations are normally distributed. (4) Independence: The observations are independent.

Page 8: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Point Estimates for Multiple Linear Regression Model

• We use the same least squares procedure as for simple linear regression.

• Our estimates of are the coefficients that minimize the sum of squared prediction errors:

• Least Squares in JMP: Click Analyze, Fit Model, put dependent variable into Y and add independent variables to the construct model effects box.

K ,...,0

Kbb ,...,0

n

i iKKiibbK xbxbbybbK 1

2*1

*1

*0,...,0 )(minarg,..., **

0

KK xbxbby 110ˆ

Page 9: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Response GP1000M_Hwy Summary of Fit RSquare 0.834148 RSquare Adj 0.831091 Root Mean Square Error 3.082396 Mean of Response 39.75907 Observations (or Sum Wgts) 222 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 42.198338 3.300533 12.79 <.0001 Weight(lb) 0.0102748 0.00052 19.77 <.0001 Seating 0.2748828 0.254288 1.08 0.2809 Horsepower 0.0189373 0.00524 3.61 0.0004 Length -0.244818 0.02358 -10.38 <.0001

Page 10: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Root Mean Square Error

• Estimate of :

• = Root Mean Square Error in JMP • For simple linear regression of GP1000MHWY on

Weight, . For multiple linear regression of GP1000MHWY on weight, horsepower, cargo, seating,

• The multiple regression improves the predictions.

e

1

)ˆ( 2

1

Kn

yys i

n

i ie

es

3.08RMSE

3.87RMSE

Page 11: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Residuals and Root Mean Square Errors

• Residual for observation i = prediction error for observation i =

• Root mean square error = Typical size of absolute value of prediction error

• As with simple linear regression model, if multiple linear regression model holds– About 95% of the observations will be within two RMSEs of their

predicted value • For car data, about 95% of the time, the actual GP1000M will be

within 2*3.08=6.16 GP1000M of the predicted GP1000M of the car based on the car’s weight, horsepower, length and seating.

1 1 0 1 1ˆ ( | , , )K K K KE Y X x X x b b x b x

1 1

0 1 1

ˆ ( | , , )i i K iK

i i K iK

Y E Y X x X x

Y b b x b x

Page 12: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Residual ExampleBMW 745i Weight = 4376 Seating = 5 Horsepower = 325 Length = 198

1 4ˆ ( | , , ) 42.19 .01027*4376 .2479*5

.0189*325 .2448*198

46.22

E Y X X

Actual Y (GP1000M) for BMW745i = 38.46 Residual = 38.46-46.22 = -7.76 The BMW is more fuel efficient (lower GP1000M) than we would expect based on its weight, seating, horsepower and length. The residuals and predicted values can be saved by clicking the red triangle next to Response after Fit Model, then clicking Save Columns and clicking Predicted Values and Residuals.

Page 13: Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)

Interpretation of Regression Coefficients

• Gas mileage regression from car04.JMP

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 42.198338 3.300533 12.79 <.0001 Weight(lb) 0.0102748 0.00052 19.77 <.0001 Seating 0.2748828 0.254288 1.08 0.2809 Horsepower 0.0189373 0.00524 3.61 0.0004 Length -0.244818 0.02358 -10.38 <.0001

Interpretation of coefficient 0.0103weightb : The mean of GP1000Mwy

is estimated to increase 0.0103 for a one pound increase in weight holding fixed seating, horsepower and length.

1 1 2 2 1 1

0 1 1 0 1 1 1

( | 1, , , ) ( | , , )

( ( 1) ) ( )K K K K

K K K K

E Y X x X x X x E Y X x X x

x x x x