1 Chapter 10, Part 2 Linear Regression. 2 Last Time: A scatterplot gives a picture of the...
-
Upload
mitchell-french -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Chapter 10, Part 2 Linear Regression. 2 Last Time: A scatterplot gives a picture of the...
1
Chapter 10, Part 2
Linear Regression
2
• Last Time: A scatterplot gives a picture of the relationship between two quantitative variables.
• One variable is explanatory, and the other is the response.
• Today: If we know the value of the explanatory variable, can we predict the value of the response variable?
Predictions with Scatterplots
The Regression Line
• To make predictions, we’ll find a straight line that is the “best fit” for the points in the scatterplot. This is not so simple….
40
50
60
70
80
90
100
Exa
m 2
20 30 40 50 60 70 80 90 100 110
Exam 1
Regression Line in JMP
• Start by making a scatterplot.• Red Triangle menu -> “Fit Line.”• The equation of the regression line
appears under the “Linear Fit” group.• JMP uses column headings as variable
names (instead of x and y).• Example from the Cars 1993 file:MaxPrice = 2.3139014 + 1.1435971*MinPrice
Predicted Values
• We use the equation of the regression line to make predictions about…• Individuals not in the original data set.• Later measurements of the same individuals.
• Example: In 1994, a vehicle had a Min. Price of $15,000. Use the previous data to predict the Max. Price.
• You can do this by hand from the equation:MaxPrice = 2.3139014 + 1.1435971*MinPrice
• 2.3139014+1.1435971*(15) = 19.4678579
Are the Predictions Useful?
• In some cases, the regression line is more useful for predicting values. Consider the following examples (from Cars 1993):
7
Coefficient of Determination
• If the scatterplot is well-approximated by a straight line, the regression equation is more useful for making predictions.
• Correlation is one measure of this.
• The square of the correlation has a more intuitive meaning: What proportion of variation in the Response Variable is explained by variation in the Explanatory Variable?
JMP: “RSquare” under “Summary of Fit”
Coefficient of Determination
• In predicting Max. Price from Min. Price, we had RSquare = 0.822202.
• About 82% of the variation in Max. Price is explained by a variation in Min. Price.
• In predicting Highway MPG from Engine size, we have RSquare = 0.392871
• Only 39% of the variation in Highway MPG is explained by a variation in Engine Size.
Coefficient of Determination
• RSquare takes values from 0 to 1.
• For values close to 0, the regression line is not very useful for predictions.
• For values close to 1, the regression line is more useful for making predictions.
• RSquare makes no distinction between positive and negative association of variables.
10
Residuals
• For each individual in the data set we can compute the difference (error) between the actual and predicted values of the response variable. This difference is called a residual:
Residual = (actual value) – (predicted value)
• In JMP: Click the red triangle by “Linear Fit” and select “Save Residuals” from the drop-down menu. You can also “Plot Residuals.”
11
How does JMP find the Regression Line?
• JMP uses the most popular method, Ordinary Least Squares (OLS).
• To measure how a given line fits the data:• Compute all residuals, take the square of each.• Add up the results to get a “total error.”
• The closer this total is to zero, the better the line fits the data. Choose the line with the smallest “total error.”
• (Thankfully) JMP takes care of the details.
12
Limitations of Correlation and Linear Regression:
• Both describe linear relationships only.• Both are sensitive to outliers.• Beware of extrapolation: predicting
outside of the given range of the explanatory variable.
• Beware of lurking variables: other factors that may explain a strong correlation.
• Correlation does not imply causality!
13
Beware Extrapolation!
• A child’s height was plotted against her age...
• Can you predict her height at age 8 (96 months)?
• Can you predict her height at age 30 (360 months)?
80
85
90
95
100
30 35 40 45 50 55 60 65
age (months)
hei
gh
t (c
m)
14
Beware Extrapolation!
• Regression line:y = 71.95 + .383 x
• Height at 96 months? y = 94.93cm (3' 6'')
• Height at 360 months? y = 209.8cm (6’ 10'')
• Height at birth (x = 0)?
y = 71.95cm (2’ 4”)
70
90
110
130
150
170
190
210
30 90 150 210 270 330 390
age (months)
hei
gh
t (c
m)
Beware Lurking Variables!
• Although there may be a strong correlation (statistical relationship) between two variables, there might not be a direct practical (cause-and-effect) relationship.
• A lurking variable is a third variable (not in the scatterplot) that might cause the apparent relationship between explanatory and response variables.
Example: Pizza vs. Subway Fare
The regression line to the right shows a strong correlation (0.9878) between the cost of:
• A slice of pizza
• Subway fare
Q: Does the price of pizza affect the price of the subway?
17
• In a study of emergency services, it was noted that larger fires tend to have more firefighters present.
• Suppose we used:– Explanatory Variable: Number of firefighters
– Response Variable: Size of the fire
• We would expect a strong correlation.
• But it’s ludicrous to conclude that having more firefighters present causes the fire to be larger.
Caution:Correlation Does Not Imply Causation