Regression• For the purposes of this class:
– Does Y depend on X?– Does a change in X cause a change in Y?– Can Y be predicted from X?
• Y= mX + b
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue
Predicted values
Overall Mean
Actual values
When analyzing a regression-type data set, the first step is to plot the data:
X Y
35 114
45 120
55 150
65 140
75 166
55 138
The next step is to determine the line that ‘best fits’ these points. It appears this line would be sloped upward and linear (straight).
100
120
140
160
180
30 40 50 60 70 80
Independent Value (X)
Dep
end
ent
Val
ue
(Y)
1) The regression line passes through the point (Xavg, Yavg).
2) Its slope is at the rate of “m” units of Y per unit of X, where m = regression coefficient (slope; y=mx+b)
The line of best fit is the sample regression of Y on X, and its position is fixed by two results:
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue
(55, 138)
Y = 1.24(X) + 69.8
slope Y-intercept
Rise/Run
Testing the Regression Line for Significance
• An F-test is used based on Model, Error, and Total SOS.– Very similar to ANOVA
• Basically, we are testing if the regression line has a significantly different slope than a line formed by using just Y_avg.– If there is no difference, then that means that Y
does not change as X changes (stays around the average value)
• To begin, we must first find the regression line that has the smallest Error SOS.
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue
Error SOSThe regression line should pass through the overall average with a slope that has the smallest Error SOS (Error SOS = the distance between each point and predicted line: gives an index of the variability of the data points around the predicted line).
overall average is the pivot point
55
138
For each X, we can predict Y:Y = 1.24(X) + 69.8
X Y_Actual Y_Pred SOSError
35 114 113.2 0.64
45 120 125.6 31.36
55 150 138 144
65 140 150.4 108.16
75 166 162.8 10.24
294.4
Error SOS is calculated as the sum of (YActual – YPredicted)2
This gives us an index of how scattered the actual observations are around the predicted line. The more scattered the points, the larger the Error SOS will be. This is like analysis of variance, except we are using the predicted line instead of the mean value.
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Total SOS
• Calculated as the sum of (Y – Yavg)2
• Gives us an index of how scattered our data set is around the overall Y average.
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue
Overall Y average
Regression line not shown
X Y_Actual Y Average SOSTotal
35 114 138 576
45 120 138 324
55 150 138 144
65 140 138 4
75 166 138 784
1832
Total SOS gives us an index of how scattered the data points are around the overall average. This is calculated the same way for a single treatment in ANOVA.
What happens to Total SOS when all of the points are close to the overall average? What happens when the points form a non-horizontal linear trend?
Model SOS
• Calculated as the Sum of (YPredicted – Yavg)2
• Gives us an index of how far all of the predicted values are from the overall average.
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue Distance between
predicted Y and overall mean
Model SOS• Gives us an index of how far away the predicted
values are from the overall average value
• What happens to Model SOS when all of the predicted values are close to the average value?
X Y_Pred Y Average SOSModel
35 113.2 138 615.04
45 125.6 138 153.76
55 138 138 0
65 150.4 138 153.76
75 162.8 138 615.04
1537.6
All Together Now!!
X Y_Actual Y_Pred SOSError Y_Avg SOSTotal SOSModel
35 114 113.2 0.64 138 576 615.04
45 120 125.6 31.36 138 324 153.76
55 150 138 144 138 144 0
65 140 150.4 108.16 138 4 153.76
75 166 162.8 10.24 138 784 615.04
294.4 1832 1537.6
SOSError = (Y_Actual – Y_Pred)2
SOSTotal = (Y_Actual –Y_ Avg) 2
SOSModel = (Y_Pred – Y_Avg) 2
Using SOS to Assess Regression Line
• Model SOS gives us an index on how ‘different’ the predicted values are from the average values.– Bigger Model SOS = more different– Tells us how different a sloped line is from a line made
up only of Y_avg.– Remember, the regression line will pass through the
overall average point.
• Error SOS gives us an index of how different the predicted values are from the actual values– More variability = larger Error SOS = large distance
between predicted and actual values
Magic of the F-test• The ratio of Model SOS to Error SOS (Model SOS divided by Error SOS)
gives us an overall index (the F statistic) used to indicate the relative ‘difference’ between the regression line and a line with slope of zero (all values = Y_avg.– A large Model SOS and small Error SOS = a large F statistic. Why does this
indicate a significant difference?– A small Model SOS and a large Error SOS = a small F statistic. Why does
this indicate no significant difference??
• Based on sample size and alpha level (P-value), each F statistic has an associated P-value.– P < 0.05 (Large F statistic) there is a significant difference between the
regression line a the Y_avg line.– P ≥ 0.05 (Small F statistic) there is NO significant difference between the
regression line a the Y_avg line.
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
end
ent
Val
ue
Mean Model SOS Mean Error SOS
100
120
140
160
180
30 40 50 60 70 80Independent Value
Dep
end
ent
Val
ue
Basically, this is an index that tells us how different the regression line is from Y_avg, and the scatter of the data around the predicted values.
= F
Data production; input X Y; cards; 35 11445 12055 15065 14075 166; proc print; run; proc reg; {Tells SAS to do the regression procedure} model Y=X; {Tells SAS that Y is the dependent value and X is the independent value} run;
SAS Code for Regression
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 1537.60000 1537.60000 15.67 0.0288
Error 3 294.40000 98.13333
Corrected Total 4 1832.00000
Root MSE 9.90623 R-Square 0.8393
Dependent Mean 138.00000 Adj R-Sq 0.7857
Coeff Var 7.17843
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 69.80000 17.78988 3.92 0.0295
X 1 1.24000 0.31326 3.96 0.0288
Y = [mX + b] = [(1.24)(X) + 69.8]
Correlation (r):Another measure of the mutual linear relationship between two variables.
• ‘r’ is a pure number without units or dimensions
• ‘r’ is always between –1 and 1
• Positive values indicate that y increases when x does and negative values indicate that y decreases when x increases.– What does r = 0 mean?
• ‘r’ is a measure of intensity of association observed between x and y.– ‘r’ does not predict – only describes associations
between variables
100
120
140
160
180
30 40 50 60 70 80
Inpendent Variable
Dep
end
ent
Var
iab
le
100
120
140
160
180
30 40 50 60 70 80
Independent Variable
Dep
end
ent
Var
iab
le
100
120
140
160
180
30 40 50 60 70 80
Independent Variable
Dep
end
ent
Var
iab
le
r > 0
r < 0
r = 0r is also called Pearson’s correlation coefficient.
SAS Code for Correlation
Proc corr; {Tells SAS to do the correlation procedure}var y x; (Tells SAS to determine the correlation between these variables)run;
The CORR Procedure 2 Variables: Y X
Simple StatisticsVariable N Mean Std Dev Sum Minimum MaximumY 5 138.00000 21.40093 690.00000 114.00000 166.00000X 5 55.00000 15.81139 275.00000 35.00000 75.00000
Pearson Correlation Coefficients, N = 5 Prob > |r| under H0: Rho=0
Y X Y 1.00000 0.91613 0.0288
X 0.91613 1.00000 0.0288
Significant correlation
high correlation
R-square
• If we square r, we get rid of the negative value if it is negative) and we get an index of how close the data points are to the regression line.
• Allows us to decide how much confidence we have in making a prediction based on our model.
• Is calculated as Model SOS / Total SOS
r2 = Model SOS / Total SOS
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
ende
nt V
alue
= Model SOS
= Total SOS
100
120
140
160
180
30 40 50 60 70 80
Independent Value
Dep
ende
nt V
alue
= Model SOS
= Total SOS
R2 = 0.0144
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50
r2 = Model SOS / Total SOS
numerator/denominator
Small numerator Big denominator
R2 = 0.8393
R-square and Prediction Confidence
R2 = 0.0144
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
R2 = 0.5537
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
R2 = 0.7605
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
R2 = 0.9683
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 1537.60000 1537.60000 15.67 0.0288
Error 3 294.40000 98.13333
Corrected Total 4 1832.00000
Root MSE 9.90623 R-Square 0.8393
Dependent Mean 138.00000 Adj R-Sq 0.7857
Coeff Var 7.17843
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 69.80000 17.78988 3.92 0.0295
X 1 1.24000 0.31326 3.96 0.0288
Y = [mX + b] = [(1.24)(X) + 69.8]
Finally……..
• If we have a significant relationship (based on the p-value), we can use the r-square value to judge how sure we are in making a prediction.
Top Related