Regression Analysis

Regression analysisWeek no 2 - 19th to 23rd Sept, 2011

Course Map

Introduction to Quantitative Analysis, Ch1, RSH (1 Week)

Regression Models Ch4 (1week)

Decision Analysis, Ch3, RSH (2 Weeks)

Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2 Weeks)

Linear Programming Modeling Applications: With Computer Analyses in Excel, Ch8, RSH (2 Weeks)

Simulation Modeling, Ch15, RSH (2 Weeks)

Forecasting, Ch5, RSH. (2 Weeks)

Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)

regression analysisA very valuable tool for today’s manager. Regression Analysis is used to:

Understand the relationship between variables.

Predict the value of one variable based on another variable.

A regression model has:

dependent, or response, variable - Y axis

an independent, or predictor, variable - X axis

How to perform Regression analysis

regression analysisTriple A Construction Company renovates old

homes in Albany. They have found that its dollar volume of renovation work is dependent on the

Albany area payroll.Local Payroll

($100,000,000's)Triple A Sales($100,000's)

3 64 86 94 52 4.55 9.5

Scatter plot

0

2

4

6

8

10

0 1 2 3 4 5 6Local Payroll($100,000,000's)

Sal

es10

0,00

0

regression analysis model

Create a Scatter PlotPerform Regression Analysis

some random error that cannot be

predicted.

Slope

Intercept (Value of Y when

X=0)

Independent Variable, Predictor

Dependent Variable, Response

Regression: Understand & Predict


Sample data are used to estimate the true values for the intercept and slope.

Y = b + b X Where,

Y = predicted value of Y

Error = (actual value) – (predicted value)

e = Y - Y

The difference between the actual value of Y and the predicted value (using sample data) is known as the error.

0 1


Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y)

6 3 1 1

8 4 0 0

9 6 4 4

5 4 0 0

4.5 2 4 5

9.5 5 1 2.5

Summations for each column: 42 24 10 12.5

Y = 42/6 = 7 X = 24/6 = 4

_ _ _

__

Calculating the required parameters:

b = (X-X)(Y-Y) 12.5 (X-X) 10

b = Y – b X = 7 – (1.25)(4) = 2

So,

Y = 2 + 1.25 X

! ! 2

o 1

1 = = 1.25

2

Measuring the Fit of the linear Regression

Model

Measuring the Fit of the linear Regression Model

To understand how well the X predicts the Y, we evaluate

Variability in the Y variable

SSR –> Regression Variability that is explained by the relationship b/w X & Y

+SSE –> Unexplained

Variability, due to factors then the regression

------------------------------------ SST –> Total variability about

the mean

Coefficient of Determination R Sq - Proportion of explained variation

Correlation Coefficient

r – Strength of the relationship

between Y and X variables

Standard Error

St Deviation of error

around the Regression

Line

Residual AnalysisValidation of

Model

Test for LinearitySignificance of the

Regression Model i.e. Linear Regression Model

Variability

0

2

4

6

8

10

0 1 2 3 4 5 6

y = 1.25x + 2R² = 0.6944

Local Payroll($100,000,000's)

Regression Line

SSTSSESSR

explained variability

Y_

Variability

! Sum of Squares Total (SST) measures the total variable in Y.

! Sum of the Squared Error (SSE) is less than the SST because the regression line reduced the variability.

! Sum of Squares due to Regression (SSR) indicated how much of the total variability is explained by the regression model.

Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms prior to summing.

SST = (Y-Y) 2 !

SSE = e = (Y-Y) 2 2 ! !

SSR = (Y-Y) ! 2

For Triple A Construction:

SST = (Y-Y) 2 !

SSE = e = (Y-Y) 2 2 ! !

SSR = (Y-Y) ! 2

= 22.5

= 6.875

= 15.625

Note: SST = SSR + SSE

Explained Variability

Unexplained Variability

Coefficient of Determination

The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation.

r2 = SSR = 1 – SSE SST SST

For Triple A Construction:

r2 = 15.625 = 0.6944 22.5

69% of the variability in sales is explained by the regression based on payroll.

Note: 0 < r2 < 1

SST, SSR and SSE just themselves

provide little direct interpretation. This

measures the usefulness of

regression


=

( ) [ ] ! ! - 2 2 ( ) [ ] ! ! - - !

2 2 2 Y Y ( Y n X X n

! ! - ! Y X XY n r

For Triple A Construction, r = 0.8333

The correlation coefficient (r) measures the strength of the linear relationship.

Note: -1 < r < 1

Possible Scatter Diagrams

for values of r.

Shown as Multiple R in the output of Excel

file

Standard error

s = MSE = SSE n–k-1

The mean squared error (MSE) is the estimate of the error variance of the regression equation.

2

Where, n = number of observations in the sample k = number of independent variables

For Triple A Construction, s = 1.31 2

Estimate of Variance. Just like St Dev (which is around mean), it measures the

variation of Y variation around the regression line OR St Dev of error

around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales

error in prediction

Test for linearityAn F-test is used to statistically test the null hypothesis that there is no linear relationship between the X and Y variables (i.e. ! = 0). If the significance level for the F test is low, we reject Ho and conclude there is a linear relationship.

F = MSR MSE where, MSR = SSR k

1 For Triple A Construction:

MSR = 15.625 = 15.625 1

F = 15.625 = 9.0909 1.7188

The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and conclude a linear relationship exists between sales and payroll.

p value is significance levelalpha = level of significance or

= 1-confidence interval

If p<alphaReject the null hypothesis that there is no linear relationship between X & Y

Computer Software for Regression

In Excel, use Tools/ Data Analysis. This

is an ‘add-in’ option.


Multiple R is correlation coefficient

Estimate of Variance. Just like St Dev (which is around mean), it measures the variation of Y variation around the regression line OR St Dev of error around the Regression Line.

Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction

p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear

The

adju

sted

R S

q ta

kes i

nto

acco

unt t

he

num

ber o

f ind

epen

dent

var

iabl

es in

the

mod

el.

Anova table

Residual Analysis:to verify regression assumptions

are correct

Assumptions of the Regression Model

! Errors are independent. ! Errors are normally distributed. ! Errors have a mean of zero. ! Errors have a constant variance.

We make certain assumptions about the errors in a regression model which allow for statistical testing.

Assumptions:

A plot of the errors (Real

Value minus predicted value of Y), also called residuals in excel may

highlightproblems with the

model.

PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including

interpretation of the intercept (X=0).A linear regression model may not be the best model, even in the presence of a significant F

test.

Constant varianceTriple A Construction

Errors have constant Variance Assumption

Plot Residues w.r.t X valuesPattern should be random!

Non-constant Variation in Error Residual Plot –violation

0 X

Normal distribution

Histogram of Residuals - Should look like a bell curve

Triple A Construction

Not possible to see the bell curve with just 6 observations. Need

more samples

zero meanTriple A Construction

Errors have zero Mean

0 X

independent errors

If samples collected over a period of time and not at the

same time, then plot the residues w.r.t time to see if

any pattern (Autocorrelation) exists.

If substantial autocorrelation, Regression Model Validity

becomes doubtfulAutocorrelation can also be checked

using Durbin–Watson statistic.

Example: Manager of a package delivery store wants to predict

weekly sales based on the number of customers making purchases for a period of 100 days. Data is collected over a period of time so check for

autocorrelation (pattern) effect.

time

Res

idue

s

Cyclical Pattern! A Violation

Residual analysis for validating assumptions

Nonlinear Residual Plot –violation

multiple regression

multiple regressionMultiple regression models are similar to simple linear regression models except they include more than one X variable.

Y = b + b X + b X +…+ b X 0 1 1 2 2 n n

Independent variables

slope

Price Sq. Feet Age Condition

35000 1926 30 Good

47000 2069 40 Excellent

49900 1720 30 Excellent

55000 1396 15 Good

58900 1706 32 Mint

60000 1847 38 Mint

67000 1950 27 Mint

70000 2323 30 Excellent

78500 2285 26 Mint

79000 3752 35 Good

87500 2300 18 Good

93000 2525 17 Good

95000 3800 40 Excellent

97000 1740 12 Mint

Wilson Realty wants to develop a model to determine the suggested listing price for a house based on size and age.

multiple regression

67% of the variation in sales price is explained by size and age.

Ho: No linear relationship is rejected

Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected

Y = 60815.45 + 21.91(size) – 1449.34 (age)

Y = 60815.45 + 21.91(size) – 1449.34 (age)

Wilson Realty has found a linear relationship between price and size and age. The coefficient for size indicates each additional square foot increases the value by $21.91, while each additional year in age decreases the value by $1449.34.

For a 1900 square foot house that is 10 years old, the following prediction can be made:

$87,951 = 21.91(1900) + 1449.34(10)

binary or dummy variables

dummy variables

! A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise.

! The number of dummy variables must equal one less than the number of categories of the qualitative variable.

Binary (or dummy) variables are special variables that are created for qualitative data.

Return to Wilson Realty, and let’s evaluate how to use property condition in the regression model. There are three categories: Mint, Excellent, and Good.

X = 1 if the house is in excellent condition = 0 otherwise

X = 1 if the house is in mint condition = 0 otherwise

Note: If both X and X = 0 then the house is in good condition

3

4

dummy variables

Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)

As more variables are added to the model, the r2

usually increases.

model building

adjusted r-Square

! As more variables are added to the model, the r2 usually increases.

! The adjusted r2 takes into account the number of independent variables in the model.

The best model is a statistically significant model with a high r2

and a few variables.

Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.

multicollinearity

! Collinearity and multicollinearity create problems in the coefficients.

! The overall model prediction is still good; however individual interpretation of the variables is questionable.

Collinearity or multicollinearity exists when an independent variable is correlated with another independent variable.

Duplication of information occurs

When multicollinearity exists, the overall F test is still valid, but the hypothesis tests related to the individual coefficients are not.

A variable may appear to be significant when it is insignificant, or a variable may appear to be insignificant when it is significant.

non-linear regression

non-linear regressionEngineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG).

Linear regression model: MPG = 47.8 – 8.2 (weight)

F significance = .0003 r2 = .7446

non-linear regression Nonlinear (transformed variable)

regression model

MPG = 79.8 – 30.2(weight) + 3.4(weight)

F significance = .0002 R2 = .8478

2

non-linear regression

We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared).

Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant.

Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also.

This is an example of a problem that exists when multicollinearity is present.

chapter assignments on LMS

quiz in next class

Case studies

Regression Analysis

Technology

Transcript of Regression Analysis