Multiple Regression

52
Statistics: Unlocking the Power of Data STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables (10.1) Partitioning variability – R 2 , ANOVA (9.2) Conditions – residual plot (10.2)

description

STAT 101 Dr. Kari Lock Morgan. Multiple Regression. SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables (10.1) Partitioning variability – R 2 , ANOVA (9.2) Conditions – residual plot (10.2). Exam 2 Grades: In-Class. Exam 2 Re-grades. - PowerPoint PPT Presentation

Transcript of Multiple Regression

Page 1: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

STAT 101Dr. Kari Lock Morgan

Multiple Regression

SECTIONS 9.2, 10.1, 10.2• Multiple explanatory variables (10.1)• Partitioning variability – R2 , ANOVA (9.2)• Conditions – residual plot (10.2)

Page 2: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Exam 2 Grades: In-Class

Page 3: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Exam 2 Re-gradesRe-grade requests due in writing by class on

Monday, 4/15/14

Partial credit will not be altered – only submit a re-grade request if you think you have entirely the correct answer but got points off

Grades may go up or down

If points were added up incorrectly, just bring your exam to your TA (no need for an official re-grade)

Page 4: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• Today we’ll finally learn a way to handle more than 2 variables!

More than 2 variables!

Page 5: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• Multiple regression extends simple linear regression to include multiple explanatory variables:

Multiple Regression

0 1 2 21 ... k k ix xxy

Page 6: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• We’ll use your current grades to predict final exam scores, based on a model from previous 101 students

• Response: final exam score

• Explanatory: hw average, clicker average, exam 1, exam 2

Grade on Final

0 1 2 3 4hw clicker exam1 exam2y

Page 7: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

What variable is the most significant predictor of final exam score?

a) Homework averageb) Clicker averagec) Exam 1 d) Exam 2

Grade on Final

Page 8: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

The p-value for explanatory variable xi is associated with the hypotheses

For intervals and p-values of coefficients in multiple regression, use a t-distribution with degrees of freedom n – k – 1, where k is the number of explanatory variables included in the model

0 : 0: 0a

i

iHH

Inference for Coefficients

Page 9: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Estimate your score on the final exam.

What type of interval do you want for this estimate?

a) Confidence intervalb) Prediction interval

Grade on Final

Page 10: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Estimate your score on the final exam.(for this data hw average was out of 10, clicker average was out of 2)

Grade on Final

Page 11: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Is the clicker coefficient really negative?!?

Grade on Final

Page 12: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Is your score on exam 2 really not a significant predictor of your final exam score?!?

Grade on Final

Page 13: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• The coefficient (and significance) for each explanatory variable depend on the other variables in the model!

Coefficients

Page 14: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

If you take Exam 1 out of the model…

Grade on Final

Model with Exam 1:

Now Exam 2 is significant!

Page 15: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Multiple Regression• The coefficient for each explanatory variable is the

predicted change in y for one unit change in x, given the other explanatory variables in the model!

• The p-value for each coefficient indicates whether it is a significant predictor of y, given the other explanatory variables in the model!

• If explanatory variables are associated with each other, coefficients and p-values will change depending on what else is included in the model

Page 16: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

If you include Project 1 in the model…

Grade on Final

Model without Project 1:

Page 17: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Grades

Page 18: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Evaluating a Model

• How do we evaluate the success of a model?

• How we determine the overall significance of a model?

• How do we choose between two competing models?

Page 19: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Variability• One way to evaluate a model is to partition variability

• A good model “explains” a lot of the variability in Y

Total Variability

VariabilityExplained

by the Model

Error Variability

Page 20: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Exam Scores• Without knowing the explanatory variables, we can say that a person’s final exam score will probably be between 60 and 98 (the range of Y)

• Knowing hw average, clicker average, exam 1 and 2 grades, and project 1 grades, we can give a narrower prediction interval for final exam score

• We say the some of the variability in y is explained by the explanatory variables

• How do we quantify this?

Page 21: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

VariabilityHow do we quantify variability in Y?

a) Standard deviation of Yb) Sum of squared deviations from the

mean of Yc) (a) or (b)d) None of the above

Page 22: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Sums of Squares

2

1

n

ii

Y Y

Total Variability

VariabilityExplained

by the model

Error variability

2

1

ˆn

ii

Y Y

2

1

ˆn

i ii

Y Y

SST SSM SSE

Page 23: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Variability

Y

2

1

Total Sum of Squares:n

ii

SST y y

2

1

Model Sum of Squares:

ˆn

ii

SSM y y

2

1

Error Sum of Squares:

ˆn

i ii

SSE y y

• If SSM is much higher than SSE, than the model explains a lot of the variability in Y

Page 24: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

R2

• R2 is the proportion of the variability in Y that is explained by the model

2 "Variability in Y explained by the model""Total variability in Y"

SSMRSST

Total Variability

Variability Explained by the Model

Page 25: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

R2

• For simple linear regression, R2 is just the squared correlation between X and Y

• For multiple regression, R2 is the squared correlation between the actual values and the predicted values

Page 26: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

R2

2 0.67R 2 0.09R

Page 27: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Final Exam Grade

Page 28: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Is the model significant?• If we want to test whether the model is significant (whether the model helps to predict y), we can test the hypotheses:

• We do this with ANOVA!

0 1 2: ... 0: At least one 0

k

a i

HH

Page 29: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

ANOVA for Regression

k: number of explanatory variablesn: sample size

Source

Model

Error

Total

df

k

n-k-1

n-1

Sum ofSquares

SSM

SSE

SST

MeanSquareMSM = SSM/kMSE =

SSE/(n-k-1)

F

MSMMSE

p-value

Use Fk,n-k-1

Page 30: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

ANOVA for Regression

Source

Model

Error

Total

df

5

63

68

Sum ofSquares3125.8

1901.4

5027.2

MeanSquare625.16

30.18

F

20.71

p-value

0

Page 31: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Final Exam Grade

Page 32: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Simple Linear Regression• For simple linear regression, the following tests will all give equivalent p-values:

• t-test for non-zero correlation

• t-test for non-zero slope

• ANOVA for regression

Page 33: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Mean Square Error (MSE)• Mean square error (MSE) measures the average variability in the errors (residuals)

• The square root of MSE gives the standard deviation of the residuals (giving a typical distance of points from the line)

• This number is also given in the R output as the residual standard error, and is known as s in the textbook

Page 34: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Final Exam Grade

Page 35: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

0 1i i iy x

Simple Linear Model

~ 0,i N

Residual standard error = MSE = se estimates the standard deviation of

the residuals (the spread of the normal distributions around the

predicted values)

Page 36: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Residual Standard Error• Use the fact that the residual standard error is 5.494 and your predicted final exam score to compute an approximate 95% prediction interval for your final exam score

• NOTE: This calculation only takes into account errors around the line, not uncertainty in the line itself, so your true prediction interval will be slightly wider

Page 37: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Revisiting Conditions• For simple linear regression, we learned that the following should hold for inferences to be valid:

• Linearity• Constant variability of the residuals• Normality of the residuals

• How do we assess the first two conditions in multiple regression, when we can no longer visualize with a scatterplot?

Page 38: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Residual Plot• A residual plot is a scatterplot of the

residuals against the predicted responses

Should have:1) No obvious pattern2) Constant variability

Page 39: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Residual Plots

Obvious pattern Variability not constant

Page 40: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Final Exam Score

Are the conditions satisfied?(a) Yes (b) No

Page 41: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Conditions• What if the conditions for inference aren’t met???

• Option 1 (best option): Take STAT 210 and learn more about modeling!

• Option 2: Try a transformation…

Page 42: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Transformations• If the conditions are not satisfied, there are some common transformations you can apply to the response variable

• You can take any function of y and use it as the response, but the most common are• log(y) (natural logarithm - ln)• y (square root)• y2 (squared)• ey (exponential))

Page 43: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

log(y)Original Response, y:

Logged Response, log(y):

Page 44: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

yOriginal Response, y:

Square root of Response, y:

Page 45: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

y2Original Response, y:

Squared response, y2:

Page 46: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

eyOriginal Response, y:

Exponentiated Response, ey:

Page 47: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

TransformationsInterpretation becomes a bit more

complicated if you transform the response – it should only be done if it clearly helps the conditions to be met

If you transform the response, be careful when interpreting coefficients and predictions

The slope will now have different meaning, and predictions and confidence/prediction intervals will be for the transformed response

Page 48: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

Transformations• You do NOT need to know which transformation would be appropriate for given data on the final, but they may help if conditions are not met for Project 2 or for future data you may want to analyze

Page 49: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• How do we decide which explanatory variables to include in the model?

• How do we use categorical explanatory variables?

• What if the coefficient of one explanatory variable depends on the value of another explanatory variable?

To Come…

Page 50: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• Project done in your lab groups – one project per group

• 10 page (max) paper: due Wednesday, 4/23

• Choose one quantitative variable and answer questions about it and it’s relationship with other variables

• Use multiple regression and anything else we’ve learned in the course

• Project 2 Details here

Project 2

Page 51: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

• Data on college students:• Sleep data from a 2-week sleep diary• Gender• Class year• Early riser, night owl, or neither?• Early classes?• Missed classes• Score on a test of cognitive skills• GPA• Alcohol consumption• Depression, anxiety, stress, happiness

Project 2 Data

Page 52: Multiple Regression

Statistics: Unlocking the Power of Data Lock5

To DoRead 9.2, 10.1, 10.2

Do HW 8 (due Wednesday, 4/16)

Do Project 2 (due Wednesday, 4/23)