Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between...
-
Upload
justin-morgan -
Category
Documents
-
view
218 -
download
0
Transcript of Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between...
![Page 1: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/1.jpg)
Simple Linear Regression Simple Linear Regression
![Page 2: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/2.jpg)
Introduction
• In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation.
• The motivation for using the technique:– Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).– Analyze the specific relationships between the
independent variables and the dependent variable.
![Page 3: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/3.jpg)
House size
HouseCost
Most lots sell for $25,000
Building a house costs about
$75 per square foot.
House cost = 25000 + 75(Size)
The Model
The model has a deterministic and a probabilistic components
![Page 4: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/4.jpg)
House cost = 25000 + 75 (Size)
House size
HouseCost
Most lots sell for $25,000
However, house cost vary even among same size houses! Since cost behave unpredictably,
we add a random component.
![Page 5: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/5.jpg)
• The first order linear model
Y = dependent variableX = independent variable0 = Y-intercept1 = slope of the line = error variable
Y 0 1X
Y 0 1X
X
Y
0Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore are estimated from the data.
![Page 6: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/6.jpg)
Estimating the Coefficients
• The estimates are determined by – drawing a sample from the population of interest,– calculating sample statistics.– producing a straight line that cuts into the data.
Question: What should be considered a good line?
X
Y
![Page 7: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/7.jpg)
The Least Squares (Regression) Line
A good line is one that minimizes the sum of squared differences between the points and the line.
![Page 8: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/8.jpg)
3
3
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
2.5
Let us compare two linesThe second line is horizontal
The smaller the sum of squared differencesthe better the fit of the line to the data.
![Page 9: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/9.jpg)
The Estimated Coefficients
To calculate the estimates of the line coefficients, that minimize the differences between the data points and the line, use the formulas:
b1 cov(X,Y )
sX2
sXY
sX2
b0 Y b1 X
b1 cov(X,Y )
sX2
sXY
sX2
b0 Y b1 X
The regression equation that estimatesthe equation of the first order linear modelis:
ˆ Y b0 b1X
ˆ Y b0 b1X
![Page 10: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/10.jpg)
• Example 17.2 (Xm17-02)– A car dealer wants to find
the relationship between the odometer reading and the selling price of used cars.
– A random sample of 100 cars is selected, and the data recorded.
– Find the regression line.
Car Odometer Price1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718
. . .
. . .
. . .
Independent variable X
Dependent variable Y
The Simple Linear Regression Line
![Page 11: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/11.jpg)
• Solution– Solving by hand: Calculate a number of statistics
X 36,009.45;
Y 14,822.823;
sX2
(X i X )2n 1
43,528,690
cov(X,Y ) (X i X )(Yi Y )
n 1 2,712,511
where n = 100.
b1 cov(X,Y)
sX2 1,712,511
43,528,690 .06232
b0 Y b1X 14,822.82 ( .06232)(36,009.45) 17,067
ˆ Y b0 b1X 17,067 .0623X
![Page 12: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/12.jpg)
• Solution – continued– Using the computer (Xm17-02)
Tools > Data Analysis > Regression > [Shade the Y range and the X range] > OK
![Page 13: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/13.jpg)
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.8063R Square 0.6501Adjusted R Square 0.6466Standard Error 303.1Observations 100
ANOVAdf SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000Residual 98 9005450 91892Total 99 25739561
Coefficients Standard Error t Stat P-valueIntercept 17067 169 100.97 0.0000Odometer -0.0623 0.0046 -13.49 0.0000
ˆ Y 17,067 .0623X
Xm17-02
![Page 14: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/14.jpg)
This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Pri
ce
ˆ Y 17,067 .0623X
Interpreting the Linear Regression -Equation
The intercept is b0 = $17067.
0 No data
Do not interpret the intercept as the “Price of cars that have not been driven”
17067
![Page 15: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/15.jpg)
Error Variable: Required Conditions
• The error is a critical part of the regression model.• Four requirements involving the distribution of must
be satisfied.– The probability distribution of is normal.– The mean of is zero: E() = 0.– The standard deviation of is for all values of X.– The set of errors associated with different values of Y are
all independent.
![Page 16: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/16.jpg)
The Normality of
From the first three assumptions wehave: Y is normally distributed with mean E(Y) = 0 + 1X, and a constant standard deviation
From the first three assumptions wehave: Y is normally distributed with mean E(Y) = 0 + 1X, and a constant standard deviation
0 + 1X1
0 + 1 X2
0 + 1 X3
E(Y|X2)
E(Y|X3)
X1 X2 X3
E(Y|X1)
The standard deviation remains constant,
but the mean value changes with X
![Page 17: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/17.jpg)
Assessing the Model
• The least squares method will produces a regression line whether or not there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits the data.
• Several methods are used to assess the model. All are based on the sum of squares for errors, SSE.
![Page 18: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/18.jpg)
– This is the sum of differences between the points and the regression line.
– It can serve as a measure of how well the line fits the data. SSE is defined by
SSE (Yi ˆ Y i)2
i1
n
.
SSE (Yi ˆ Y i)2
i1
n
.
Sum of Squares for Errors
SSE (n 1)sY2
cov(X,Y) 2
sX2
SSE (n 1)sY2
cov(X,Y) 2
sX2
– A shortcut formula
![Page 19: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/19.jpg)
– The mean error is equal to zero.– If is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the data well.
– Therefore, we can, use as a measure of the suitability of using a linear model.
– An estimator of is given by s
S tandard Error of Estimate
s SSE
n 2
S tandard Error of Estimate
s SSE
n 2
Standard Error of Estimate
![Page 20: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/20.jpg)
• Example 17.3– Calculate the standard error of estimate for Example 17.2,
and describe what does it tell you about the model fit?• Solution
13.30398
450,005,9
2
SSE
450,005,9690,528,43
)511,712,2()996,259(99
)],[cov()1(SSE
996,2591
)ˆ(
2
2
22
22
ns
s
YXsn
n
YYs
XY
iiY
Calculated before
It is hard to assess the model based on s even when compared with the mean value of Y.
823,14y1.303s
![Page 21: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/21.jpg)
Testing the Slope– When no linear relationship exists between two
variables, the regression line should be horizontal.
Different inputs (X) yielddifferent outputs (Y).
No linear relationship.Different inputs (X) yieldthe same output (Y).
The slope is not equal to zero The slope is equal to zero
Linear relationship.Linear relationship.Linear relationship.Linear relationship.
![Page 22: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/22.jpg)
• We can draw inference about 1 from b1 by testingH0: 1 = 0H1: 1 0 (or < 0,or > 0)– The test statistic is
– If the error variable is normally distributed, the statistic has Student t distribution with d.f. = n-2.
t b1 1
sb1
t b1 1
sb1
The standard error of b1.
sb1
s(n 1)sX
2
sb1
s(n 1)sX
2where
![Page 23: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/23.jpg)
• Example 17.4– Test to determine whether there is enough evidence
to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 17.2. Use = 5%.
![Page 24: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/24.jpg)
• Solving by hand– To compute “t” we need the values of b1 and sb1.
– The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.Approximately, t.025 = 1.984
b1 .0623
sb1
s(n 1)sX
2
303.1
(99)(43,528,690).00462
t b1 1
sb1
.0623 0.00462
13.49
![Page 25: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/25.jpg)
Price Odometer SUMMARY OUTPUT14636 3738814122 44758 Regression Statistics14016 45833 Multiple R 0.806315590 30862 R Square 0.650115568 31705 Adjusted R Square 0.646614718 34010 Standard Error 303.114470 45854 Observations 10015690 1905715072 40149 ANOVA14802 40237 df SS MS F Significance F15190 32359 Regression 1 16734111 16734111 182.11 0.000014660 43533 Residual 98 9005450 9189215612 32744 Total 99 2573956115610 3447014634 37720 Coefficients Standard Error t Stat P-value14632 41350 Intercept 17067 169 100.97 0.000015740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
• Using the computer
There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.
Xm17-02
![Page 26: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/26.jpg)
– To measure the strength of the linear relationship we use the coefficient of determination:
R2 cov(X,Y) 2
sX2 sY
2 or, rXY2 ;
or, R2 1SSE
(Yi Y )2 (see p. 18 above)
R2 cov(X,Y) 2
sX2 sY
2 or, rXY2 ;
or, R2 1SSE
(Yi Y )2 (see p. 18 above)
Coefficient of Determination
![Page 27: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/27.jpg)
• To understand the significance of this coefficient note:
Overall variability in Y
The regression model
Remains, in part, unexplained The error
Explained in part by
![Page 28: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/28.jpg)
x1 x2
y1
y2
y
Two data points (X1,Y1) and (X2,Y2) of a certain sample are shown.
(Y1 Y )2 (Y2 Y )2
( ˆ Y 1 Y )2 ( ˆ Y 2 Y )2
(Y1 ˆ Y 1)2 (Y2 ˆ Y 2)2
Total variation in Y =Variation explained by the regression line + Unexplained variation (error)
Variation in Y = SSR + SSE
![Page 29: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/29.jpg)
• R2 measures the proportion of the variation in Y that is explained by the variation in X.
R2 1SSE
(Yi Y )2
(Yi Y )2 SSE
(Yi Y )2
SSR
(Yi Y )2
• R2 takes on any value between zero and one.R2 = 1: Perfect match between the line and the data points.R2 = 0: There are no linear relationship between X and Y.
![Page 30: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/30.jpg)
• Example 17.5– Find the coefficient of determination for Example 17.2;
what does this statistic tell you about the model?• Solution
– Solving by hand;
R2 [cov(X,Y)]2
sX2 sY
2 [ 2,712,511]2
(43,528,688)(259,996) .6501
![Page 31: Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649d9c5503460f94a84b95/html5/thumbnails/31.jpg)
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.8063R Square 0.6501Adjusted R Square 0.6466Standard Error 303.1Observations 100
ANOVAdf SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000Residual 98 9005450 91892Total 99 25739561
CoefficientsStandard Error t Stat P-valueIntercept 17067 169 100.97 0.0000Odometer -0.0623 0.0046 -13.49 0.0000
– Using the computer From the regression output we have
65% of the variation in the auctionselling price is explained by the variation in odometer reading. Therest (35%) remains unexplained bythis model.