Data mining, prediction, correlation, regression, correlation analysis, regression analysis.
Regression Analysis
-
Upload
nadiazaheer -
Category
Technology
-
view
5.855 -
download
2
description
Transcript of Regression Analysis
Regression analysisWeek no 2 - 19th to 23rd Sept, 2011
Course Map
Introduction to Quantitative Analysis, Ch1, RSH (1 Week)
Regression Models Ch4 (1week)
Decision Analysis, Ch3, RSH (2 Weeks)
Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2 Weeks)
Linear Programming Modeling Applications: With Computer Analyses in Excel, Ch8, RSH (2 Weeks)
Simulation Modeling, Ch15, RSH (2 Weeks)
Forecasting, Ch5, RSH. (2 Weeks)
Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)
regression analysisA very valuable tool for today’s manager. Regression Analysis is used to:
Understand the relationship between variables.
Predict the value of one variable based on another variable.
A regression model has:
dependent, or response, variable - Y axis
an independent, or predictor, variable - X axis
How to perform Regression analysis
regression analysisTriple A Construction Company renovates old
homes in Albany. They have found that its dollar volume of renovation work is dependent on the
Albany area payroll.Local Payroll
($100,000,000's)Triple A Sales($100,000's)
3 64 86 94 52 4.55 9.5
Scatter plot
0
2
4
6
8
10
0 1 2 3 4 5 6Local Payroll($100,000,000's)
Sal
es10
0,00
0
regression analysis model
Create a Scatter PlotPerform Regression Analysis
some random error that cannot be
predicted.
Slope
Intercept (Value of Y when
X=0)
Independent Variable, Predictor
Dependent Variable, Response
Regression: Understand & Predict
regression analysis model
Sample data are used to estimate the true values for the intercept and slope.
Y = b + b X Where,
Y = predicted value of Y
Error = (actual value) – (predicted value)
e = Y - Y
The difference between the actual value of Y and the predicted value (using sample data) is known as the error.
0 1
regression analysis model
Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y)
6 3 1 1
8 4 0 0
9 6 4 4
5 4 0 0
4.5 2 4 5
9.5 5 1 2.5
Summations for each column: 42 24 10 12.5
Y = 42/6 = 7 X = 24/6 = 4
_ _ _
__
Calculating the required parameters:
b = (X-X)(Y-Y) 12.5 (X-X) 10
b = Y – b X = 7 – (1.25)(4) = 2
So,
Y = 2 + 1.25 X
! ! 2
o 1
1 = = 1.25
2
Measuring the Fit of the linear Regression
Model
Measuring the Fit of the linear Regression Model
To understand how well the X predicts the Y, we evaluate
Variability in the Y variable
SSR –> Regression Variability that is explained by the relationship b/w X & Y
+SSE –> Unexplained
Variability, due to factors then the regression
------------------------------------ SST –> Total variability about
the mean
Coefficient of Determination R Sq - Proportion of explained variation
Correlation Coefficient
r – Strength of the relationship
between Y and X variables
Standard Error
St Deviation of error
around the Regression
Line
Residual AnalysisValidation of
Model
Test for LinearitySignificance of the
Regression Model i.e. Linear Regression Model
Variability
0
2
4
6
8
10
0 1 2 3 4 5 6
y = 1.25x + 2R² = 0.6944
Local Payroll($100,000,000's)
Regression Line
SSTSSESSR
explained variability
Y_
Variability
! Sum of Squares Total (SST) measures the total variable in Y.
! Sum of the Squared Error (SSE) is less than the SST because the regression line reduced the variability.
! Sum of Squares due to Regression (SSR) indicated how much of the total variability is explained by the regression model.
Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms prior to summing.
SST = (Y-Y) 2 !
SSE = e = (Y-Y) 2 2 ! !
SSR = (Y-Y) ! 2
For Triple A Construction:
SST = (Y-Y) 2 !
SSE = e = (Y-Y) 2 2 ! !
SSR = (Y-Y) ! 2
= 22.5
= 6.875
= 15.625
Note: SST = SSR + SSE
Explained Variability
Unexplained Variability
Coefficient of Determination
The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation.
r2 = SSR = 1 – SSE SST SST
For Triple A Construction:
r2 = 15.625 = 0.6944 22.5
69% of the variability in sales is explained by the regression based on payroll.
Note: 0 < r2 < 1
SST, SSR and SSE just themselves
provide little direct interpretation. This
measures the usefulness of
regression
Correlation Coefficient
=
( ) [ ] ! ! - 2 2 ( ) [ ] ! ! - - !
2 2 2 Y Y ( Y n X X n
! ! - ! Y X XY n r
For Triple A Construction, r = 0.8333
The correlation coefficient (r) measures the strength of the linear relationship.
Note: -1 < r < 1
Possible Scatter Diagrams
for values of r.
Shown as Multiple R in the output of Excel
file
Correlation Coefficient
Standard error
s = MSE = SSE n–k-1
The mean squared error (MSE) is the estimate of the error variance of the regression equation.
2
Where, n = number of observations in the sample k = number of independent variables
For Triple A Construction, s = 1.31 2
Estimate of Variance. Just like St Dev (which is around mean), it measures the
variation of Y variation around the regression line OR St Dev of error
around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales
error in prediction
Test for linearityAn F-test is used to statistically test the null hypothesis that there is no linear relationship between the X and Y variables (i.e. ! = 0). If the significance level for the F test is low, we reject Ho and conclude there is a linear relationship.
F = MSR MSE where, MSR = SSR k
1 For Triple A Construction:
MSR = 15.625 = 15.625 1
F = 15.625 = 9.0909 1.7188
The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and conclude a linear relationship exists between sales and payroll.
p value is significance levelalpha = level of significance or
= 1-confidence interval
If p<alphaReject the null hypothesis that there is no linear relationship between X & Y
Computer Software for Regression
In Excel, use Tools/ Data Analysis. This
is an ‘add-in’ option.
Computer Software for Regression
Computer Software for Regression
Multiple R is correlation coefficient
Estimate of Variance. Just like St Dev (which is around mean), it measures the variation of Y variation around the regression line OR St Dev of error around the Regression Line.
Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear
The
adju
sted
R S
q ta
kes i
nto
acco
unt t
he
num
ber o
f ind
epen
dent
var
iabl
es in
the
mod
el.
Anova table
Residual Analysis:to verify regression assumptions
are correct
Assumptions of the Regression Model
! Errors are independent. ! Errors are normally distributed. ! Errors have a mean of zero. ! Errors have a constant variance.
We make certain assumptions about the errors in a regression model which allow for statistical testing.
Assumptions:
A plot of the errors (Real
Value minus predicted value of Y), also called residuals in excel may
highlightproblems with the
model.
PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including
interpretation of the intercept (X=0).A linear regression model may not be the best model, even in the presence of a significant F
test.
Constant varianceTriple A Construction
Errors have constant Variance Assumption
Plot Residues w.r.t X valuesPattern should be random!
Non-constant Variation in Error Residual Plot –violation
0 X
Normal distribution
Histogram of Residuals - Should look like a bell curve
Triple A Construction
Not possible to see the bell curve with just 6 observations. Need
more samples
zero meanTriple A Construction
Errors have zero Mean
0 X
independent errors
If samples collected over a period of time and not at the
same time, then plot the residues w.r.t time to see if
any pattern (Autocorrelation) exists.
If substantial autocorrelation, Regression Model Validity
becomes doubtfulAutocorrelation can also be checked
using Durbin–Watson statistic.
Example: Manager of a package delivery store wants to predict
weekly sales based on the number of customers making purchases for a period of 100 days. Data is collected over a period of time so check for
autocorrelation (pattern) effect.
time
Res
idue
s
Cyclical Pattern! A Violation
Residual analysis for validating assumptions
Nonlinear Residual Plot –violation
multiple regression
multiple regressionMultiple regression models are similar to simple linear regression models except they include more than one X variable.
Y = b + b X + b X +…+ b X 0 1 1 2 2 n n
Independent variables
slope
Price Sq. Feet Age Condition
35000 1926 30 Good
47000 2069 40 Excellent
49900 1720 30 Excellent
55000 1396 15 Good
58900 1706 32 Mint
60000 1847 38 Mint
67000 1950 27 Mint
70000 2323 30 Excellent
78500 2285 26 Mint
79000 3752 35 Good
87500 2300 18 Good
93000 2525 17 Good
95000 3800 40 Excellent
97000 1740 12 Mint
Wilson Realty wants to develop a model to determine the suggested listing price for a house based on size and age.
multiple regression
67% of the variation in sales price is explained by size and age.
Ho: No linear relationship is rejected
Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected
Y = 60815.45 + 21.91(size) – 1449.34 (age)
Y = 60815.45 + 21.91(size) – 1449.34 (age)
Wilson Realty has found a linear relationship between price and size and age. The coefficient for size indicates each additional square foot increases the value by $21.91, while each additional year in age decreases the value by $1449.34.
For a 1900 square foot house that is 10 years old, the following prediction can be made:
$87,951 = 21.91(1900) + 1449.34(10)
binary or dummy variables
dummy variables
! A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise.
! The number of dummy variables must equal one less than the number of categories of the qualitative variable.
Binary (or dummy) variables are special variables that are created for qualitative data.
Return to Wilson Realty, and let’s evaluate how to use property condition in the regression model. There are three categories: Mint, Excellent, and Good.
X = 1 if the house is in excellent condition = 0 otherwise
X = 1 if the house is in mint condition = 0 otherwise
Note: If both X and X = 0 then the house is in good condition
3
4
dummy variables
Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)
As more variables are added to the model, the r2
usually increases.
model building
adjusted r-Square
! As more variables are added to the model, the r2 usually increases.
! The adjusted r2 takes into account the number of independent variables in the model.
The best model is a statistically significant model with a high r2
and a few variables.
Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.
multicollinearity
! Collinearity and multicollinearity create problems in the coefficients.
! The overall model prediction is still good; however individual interpretation of the variables is questionable.
Collinearity or multicollinearity exists when an independent variable is correlated with another independent variable.
Duplication of information occurs
When multicollinearity exists, the overall F test is still valid, but the hypothesis tests related to the individual coefficients are not.
A variable may appear to be significant when it is insignificant, or a variable may appear to be insignificant when it is significant.
non-linear regression
non-linear regressionEngineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG).
Linear regression model: MPG = 47.8 – 8.2 (weight)
F significance = .0003 r2 = .7446
non-linear regression Nonlinear (transformed variable)
regression model
MPG = 79.8 – 30.2(weight) + 3.4(weight)
F significance = .0002 R2 = .8478
2
non-linear regression
We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared).
Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant.
Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also.
This is an example of a problem that exists when multicollinearity is present.
chapter assignments on LMS
quiz in next class
Case studies